All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps
@ 2022-11-09 18:15 Darrick J. Wong
  2022-11-09 18:15 ` [PATCH 01/14] xfs: write page faults in iomap are not buffered writes Darrick J. Wong
                   ` (15 more replies)
  0 siblings, 16 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:15 UTC (permalink / raw)
  To: djwong
  Cc: Christoph Hellwig, Dave Chinner, linux-xfs, linux-fsdevel, david, hch

Hi all,

This is my adaptation of Dave's last RFC.  Dave's patches are unchanged
except for exporting mapping_seek_hole_data to fix a compilation error.

The last seven patches of the series are where I change things up.  The
first two patches refactor ->iomap_begin and ->iomap_end to receive a
const pointer to the iterator, which reduces the argument count and
makes it possible for ->iomap_begin to access the iter->private pointer.
The third new patch changes the iomap pagecache write functions to
enable the filesystem to set iter->private, similar to iomap_dio_rw.

Having done that, I converted the xfs code to stuff the data/cow
sequence counters in an iter->private object instead of bit stuffing
them into the iomap->private pointer.  Maybe it would've been smarter to
make filesystems tell iomap about their notions of how large struct
iomap objects should be (thereby enabling each fs to cram extra data
along in the iomap) but that seemed like more work.

So having replaced the iomap sequence counters with an explicit object,
I then made the validator check the cow fork.  Not sure if it's really
necessary, but paranoia on my part.  I /think/ it's the case that
updates to the cow fork don't affect writing to the page cache, but I've
wondered if the same validation rules might apply to other things (like
directio writes).

Lastly, I added a couple of write/writeback delay knobs so that I could
write some tests that simulate race conditions and check that slow
threads encounter iomap invalidation midway through an operation.

I haven't gotten to analyzing Brian's eofblock truncate fixes yet, but I
wanted to push this out for comments since it's now survived an
overnight fstests run.

NOTE: I don't have RH's original reproducer, so I have no idea if this
series really fixes that corruption problem.
----
Recently a customer workload encountered a data corruption in a
specific multi-threaded write operation. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.

The result of this was random partial blocks containing zeroes
instead of the correct data.  The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.

The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk.  This is how we get random zero ranges in the file
instead of the correct data.

The gory details of the race condition can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.

In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.

If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple....

However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.

This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.

As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.

That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.

Eventually, I realised that if the XFS code tracked the processed
write regions by byte ranges instead of fileysetm block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. THese are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.

Of course, now xfs/196 fails. This is a error injection test that is
supposed to exercise the delalloc extent recover code that the above
fixes just completely reworked. the error injection assumes that it
can just truncate the page cache over the write and then punch out
the delalloc extent completely. This is fundamentally broken, and
only has been working by chance - the chance is that writes are page
aligned and page aligned writes don't install large folios in the
page cache.

IOWs, with sub-folio block size, and not know what size folios are
in the cache, we can't actually guarantee that we can remove the
cached dirty folios from the cache via truncation, and hence the new
code will not remove the delalloc extents under those dirty folios.
As a result the error injection results is writing zeroes to disk
rather that removing the delalloc extents from memory. I can't make
this error injection to work the way it was intended, so I removed
it. The code that it is supposed to exercise is now exercised every time we
detect a stale iomap, so we have much better coverage of the failed
write error handling than the error injection provides us with,
anyway....

So, this passes fstests on 1kb and 4kb block sizes and the data
corruption reproducer does not detect data corruption, so this set
of fixes is /finally/ something I'd consider ready for merge.
Comments and testing welcome!

-Dave.

Version 2:
- refactor iomap code a lot, track data/cow sequence counters separately,
  add debugging knobs so we can test the revalidation [djwong]

Version 1:
- complete rework of iomap stale detection
- complete rework of XFS partial delalloc write error handling.

Original RFC:
- https://lore.kernel.org/linux-xfs/20220921082959.1411675-1-david@fromorbit.com/

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=iomap-write-races-6.2
---
 fs/btrfs/inode.c             |   18 +-
 fs/erofs/data.c              |   12 +
 fs/erofs/zmap.c              |    6 -
 fs/ext2/inode.c              |   16 +-
 fs/ext4/extents.c            |    5 -
 fs/ext4/inode.c              |   38 +++-
 fs/f2fs/data.c               |    9 +
 fs/fuse/dax.c                |   14 +-
 fs/gfs2/bmap.c               |   28 ++-
 fs/gfs2/file.c               |    2 
 fs/hpfs/file.c               |    8 +
 fs/iomap/buffered-io.c       |   67 ++++++--
 fs/iomap/iter.c              |   27 ++-
 fs/xfs/libxfs/xfs_bmap.c     |    4 
 fs/xfs/libxfs/xfs_errortag.h |   18 +-
 fs/xfs/xfs_aops.c            |   12 +
 fs/xfs/xfs_error.c           |   43 ++++-
 fs/xfs/xfs_error.h           |   22 ++
 fs/xfs/xfs_file.c            |    5 -
 fs/xfs/xfs_iomap.c           |  371 ++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_iomap.h           |    7 +
 fs/xfs/xfs_reflink.c         |    3 
 fs/zonefs/super.c            |   27 ++-
 include/linux/iomap.h        |   34 +++-
 mm/filemap.c                 |    1 
 25 files changed, 610 insertions(+), 187 deletions(-)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/14] xfs: write page faults in iomap are not buffered writes
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
@ 2022-11-09 18:15 ` Darrick J. Wong
  2022-11-09 18:15 ` [PATCH 02/14] xfs: punching delalloc extents on write failure is racy Darrick J. Wong
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:15 UTC (permalink / raw)
  To: djwong
  Cc: Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
we mark the iomap as IOMAP_F_NEW so that the the write context
understands that it allocated the delalloc region.

If we then fail that buffered write, xfs_buffered_write_iomap_end()
checks for the IOMAP_F_NEW flag and if it is set, it punches out
the unused delalloc region that was allocated for the write.

The assumption this code makes is that all buffered write operations
that can allocate space are run under an exclusive lock (i_rwsem).
This is an invalid assumption: page faults in mmap()d regions call
through this same function pair to map the file range being faulted
and this runs only holding the inode->i_mapping->invalidate_lock in
shared mode.

IOWs, we can have races between page faults and write() calls that
fail the nested page cache write operation that result in data loss.
That is, the failing iomap_end call will punch out the data that
the other racing iomap iteration brought into the page cache. This
can be reproduced with generic/34[46] if we arbitrarily fail page
cache copy-in operations from write() syscalls.

Code analysis tells us that the iomap_page_mkwrite() function holds
the already instantiated and uptodate folio locked across the iomap
mapping iterations. Hence the folio cannot be removed from memory
whilst we are mapping the range it covers, and as such we do not
care if the mapping changes state underneath the iomap iteration
loop:

1. if the folio is not already dirty, there is no writeback races
   possible.
2. if we allocated the mapping (delalloc or unwritten), the folio
   cannot already be dirty. See #1.
3. If the folio is already dirty, it must be up to date. As we hold
   it locked, it cannot be reclaimed from memory. Hence we always
   have valid data in the page cache while iterating the mapping.
4. Valid data in the page cache can exist when the underlying
   mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
   change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
   change the data in the page - it only affects actions if we are
   initialising a new page. Hence #3 applies  and we don't care
   about these extent map transitions racing with
   iomap_page_mkwrite().
5. iomap_page_mkwrite() checks for page invalidation races
   (truncate, hole punch, etc) after it locks the folio. We also
   hold the mapping->invalidation_lock here, and hence the mapping
   cannot change due to extent removal operations while we are
   iterating the folio.

As such, filesystems that don't use bufferheads will never fail
the iomap_folio_mkwrite_iter() operation on the current mapping,
regardless of whether the iomap should be considered stale.

Further, the range we are asked to iterate is limited to the range
inside EOF that the folio spans. Hence, for XFS, we will only map
the exact range we are asked for, and we will only do speculative
preallocation with delalloc if we are mapping a hole at the EOF
page. The iterator will consume the entire range of the folio that
is within EOF, and anything beyond the EOF block cannot be accessed.
We never need to truncate this post-EOF speculative prealloc away in
the context of the iomap_page_mkwrite() iterator because if it
remains unused we'll remove it when the last reference to the inode
goes away.

Hence we don't actually need an .iomap_end() cleanup/error handling
path at all for iomap_page_mkwrite() for XFS. This means we can
separate the page fault processing from the complexity of the
.iomap_end() processing in the buffered write path. This also means
that the buffered write path will also be able to take the
mapping->invalidate_lock as necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  |    2 +-
 fs/xfs/xfs_iomap.c |    9 +++++++++
 fs/xfs/xfs_iomap.h |    1 +
 3 files changed, 11 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e462d39c840e..595a5bcf46b9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1325,7 +1325,7 @@ __xfs_filemap_fault(
 		if (write_fault) {
 			xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 			ret = iomap_page_mkwrite(vmf,
-					&xfs_buffered_write_iomap_ops);
+					&xfs_page_mkwrite_iomap_ops);
 			xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 		} else {
 			ret = filemap_fault(vmf);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 07da03976ec1..5cea069a38b4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1187,6 +1187,15 @@ const struct iomap_ops xfs_buffered_write_iomap_ops = {
 	.iomap_end		= xfs_buffered_write_iomap_end,
 };
 
+/*
+ * iomap_page_mkwrite() will never fail in a way that requires delalloc extents
+ * that it allocated to be revoked. Hence we do not need an .iomap_end method
+ * for this operation.
+ */
+const struct iomap_ops xfs_page_mkwrite_iomap_ops = {
+	.iomap_begin		= xfs_buffered_write_iomap_begin,
+};
+
 static int
 xfs_read_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index c782e8c0479c..0f62ab633040 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -47,6 +47,7 @@ xfs_aligned_fsb_count(
 }
 
 extern const struct iomap_ops xfs_buffered_write_iomap_ops;
+extern const struct iomap_ops xfs_page_mkwrite_iomap_ops;
 extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/14] xfs: punching delalloc extents on write failure is racy
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
  2022-11-09 18:15 ` [PATCH 01/14] xfs: write page faults in iomap are not buffered writes Darrick J. Wong
@ 2022-11-09 18:15 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 03/14] xfs: use byte ranges for write cleanup ranges Darrick J. Wong
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:15 UTC (permalink / raw)
  To: djwong
  Cc: Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() has a comment about the safety of
punching delalloc extents based holding the IOLOCK_EXCL. This
comment is wrong, and punching delalloc extents is not race free.

When we punch out a delalloc extent after a write failure in
xfs_buffered_write_iomap_end(), we punch out the page cache with
truncate_pagecache_range() before we punch out the delalloc extents.
At this point, we only hold the IOLOCK_EXCL, so there is nothing
stopping mmap() write faults racing with this cleanup operation,
reinstantiating a folio over the range we are about to punch and
hence requiring the delalloc extent to be kept.

If this race condition is hit, we can end up with a dirty page in
the page cache that has no delalloc extent or space reservation
backing it. This leads to bad things happening at writeback time.

To avoid this race condition, we need the page cache truncation to
be atomic w.r.t. the extent manipulation. We can do this by holding
the mapping->invalidate_lock exclusively across this operation -
this will prevent new pages from being inserted into the page cache
whilst we are removing the pages and the backing extent and space
reservation.

Taking the mapping->invalidate_lock exclusively in the buffered
write IO path is safe - it naturally nests inside the IOLOCK (see
truncate and fallocate paths). iomap_zero_range() can be called from
under the mapping->invalidate_lock (from the truncate path via
either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
will not instantiate new delalloc pages (because it skips holes) and
hence will not ever need to punch out delalloc extents on failure.

Fix the locking issue, and clean up the code logic a little to avoid
unnecessary work if we didn't allocate the delalloc extent or wrote
the entire region we allocated.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c |   39 ++++++++++++++++++++++-----------------
 1 file changed, 22 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5cea069a38b4..a2e45ea1b0cb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1147,6 +1147,10 @@ xfs_buffered_write_iomap_end(
 		written = 0;
 	}
 
+	/* If we didn't reserve the blocks, we're not allowed to punch them. */
+	if (!(iomap->flags & IOMAP_F_NEW))
+		return 0;
+
 	/*
 	 * start_fsb refers to the first unused block after a short write. If
 	 * nothing was written, round offset down to point at the first block in
@@ -1158,27 +1162,28 @@ xfs_buffered_write_iomap_end(
 		start_fsb = XFS_B_TO_FSB(mp, offset + written);
 	end_fsb = XFS_B_TO_FSB(mp, offset + length);
 
+	/* Nothing to do if we've written the entire delalloc extent */
+	if (start_fsb >= end_fsb)
+		return 0;
+
 	/*
-	 * Trim delalloc blocks if they were allocated by this write and we
-	 * didn't manage to write the whole range.
-	 *
-	 * We don't need to care about racing delalloc as we hold i_mutex
-	 * across the reserve/allocate/unreserve calls. If there are delalloc
-	 * blocks in the range, they are ours.
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite between the page cache
+	 * truncation and the delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && start_fsb < end_fsb) {
-		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
-					 XFS_FSB_TO_B(mp, end_fsb) - 1);
+	filemap_invalidate_lock(inode->i_mapping);
+	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
+				 XFS_FSB_TO_B(mp, end_fsb) - 1);
 
-		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-					       end_fsb - start_fsb);
-		if (error && !xfs_is_shutdown(mp)) {
-			xfs_alert(mp, "%s: unable to clean up ino %lld",
-				__func__, ip->i_ino);
-			return error;
-		}
+	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+				       end_fsb - start_fsb);
+	filemap_invalidate_unlock(inode->i_mapping);
+	if (error && !xfs_is_shutdown(mp)) {
+		xfs_alert(mp, "%s: unable to clean up ino %lld",
+			__func__, ip->i_ino);
+		return error;
 	}
-
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/14] xfs: use byte ranges for write cleanup ranges
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
  2022-11-09 18:15 ` [PATCH 01/14] xfs: write page faults in iomap are not buffered writes Darrick J. Wong
  2022-11-09 18:15 ` [PATCH 02/14] xfs: punching delalloc extents on write failure is racy Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 04/14] xfs: buffered write failure should not truncate the page cache Darrick J. Wong
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong
  Cc: Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() currently converts the byte ranges
passed to it to filesystem blocks to pass them to the bmap code to
punch out delalloc blocks, but then has to convert filesytem
blocks back to byte ranges for page cache truncate.

We're about to make the page cache truncate go away and replace it
with a page cache walk, so having to convert everything to/from/to
filesystem blocks is messy and error-prone. It is much easier to
pass around byte ranges and convert to page indexes and/or
filesystem blocks only where those units are needed.

In preparation for the page cache walk being added, add a helper
that converts byte ranges to filesystem blocks and calls
xfs_bmap_punch_delalloc_range() and convert
xfs_buffered_write_iomap_end() to calculate limits in byte ranges.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c |   40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a2e45ea1b0cb..7bb55dbc19d3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1120,6 +1120,20 @@ xfs_buffered_write_iomap_begin(
 	return error;
 }
 
+static int
+xfs_buffered_write_delalloc_punch(
+	struct inode		*inode,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	struct xfs_mount	*mp = XFS_M(inode->i_sb);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
+	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
+
+	return xfs_bmap_punch_delalloc_range(XFS_I(inode), start_fsb,
+				end_fsb - start_fsb);
+}
+
 static int
 xfs_buffered_write_iomap_end(
 	struct inode		*inode,
@@ -1129,10 +1143,9 @@ xfs_buffered_write_iomap_end(
 	unsigned		flags,
 	struct iomap		*iomap)
 {
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		start_fsb;
-	xfs_fileoff_t		end_fsb;
+	struct xfs_mount	*mp = XFS_M(inode->i_sb);
+	loff_t			start_byte;
+	loff_t			end_byte;
 	int			error = 0;
 
 	if (iomap->type != IOMAP_DELALLOC)
@@ -1157,13 +1170,13 @@ xfs_buffered_write_iomap_end(
 	 * the range.
 	 */
 	if (unlikely(!written))
-		start_fsb = XFS_B_TO_FSBT(mp, offset);
+		start_byte = round_down(offset, mp->m_sb.sb_blocksize);
 	else
-		start_fsb = XFS_B_TO_FSB(mp, offset + written);
-	end_fsb = XFS_B_TO_FSB(mp, offset + length);
+		start_byte = round_up(offset + written, mp->m_sb.sb_blocksize);
+	end_byte = round_up(offset + length, mp->m_sb.sb_blocksize);
 
 	/* Nothing to do if we've written the entire delalloc extent */
-	if (start_fsb >= end_fsb)
+	if (start_byte >= end_byte)
 		return 0;
 
 	/*
@@ -1173,15 +1186,12 @@ xfs_buffered_write_iomap_end(
 	 * leave dirty pages with no space reservation in the cache.
 	 */
 	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
-				 XFS_FSB_TO_B(mp, end_fsb) - 1);
-
-	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-				       end_fsb - start_fsb);
+	truncate_pagecache_range(inode, start_byte, end_byte - 1);
+	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
 	filemap_invalidate_unlock(inode->i_mapping);
 	if (error && !xfs_is_shutdown(mp)) {
-		xfs_alert(mp, "%s: unable to clean up ino %lld",
-			__func__, ip->i_ino);
+		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
+			__func__, XFS_I(inode)->i_ino);
 		return error;
 	}
 	return 0;


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/14] xfs: buffered write failure should not truncate the page cache
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 03/14] xfs: use byte ranges for write cleanup ranges Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-15  8:32   ` Christoph Hellwig
  2022-11-09 18:16 ` [PATCH 05/14] iomap: write iomap validity checks Darrick J. Wong
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() currently invalidates the page cache
over the unused range of the delalloc extent it allocated. While the
write allocated the delalloc extent, it does not own it exclusively
as the write does not hold any locks that prevent either writeback
or mmap page faults from changing the state of either the page cache
or the extent state backing this range.

Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.

generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.

Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.

TO do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.

Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.

Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c |  151 +++++++++++++++++++++++++++++++++++++++++++++++++---
 mm/filemap.c       |    1 
 2 files changed, 142 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7bb55dbc19d3..2d48fcc7bd6f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
 				end_fsb - start_fsb);
 }
 
+/*
+ * Scan the data range passed to us for dirty page cache folios. If we find a
+ * dirty folio, punch out the preceeding range and update the offset from which
+ * the next punch will start from.
+ *
+ * We can punch out clean pages because they either contain data that has been
+ * written back - in which case the delalloc punch over that range is a no-op -
+ * or they have been read faults in which case they contain zeroes and we can
+ * remove the delalloc backing range and any new writes to those pages will do
+ * the normal hole filling operation...
+ *
+ * This makes the logic simple: we only need to keep the delalloc extents only
+ * over the dirty ranges of the page cache.
+ */
+static int
+xfs_buffered_write_delalloc_scan(
+	struct inode		*inode,
+	loff_t			*punch_start_byte,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			offset = start_byte;
+
+	while (offset < end_byte) {
+		struct folio	*folio;
+
+		/* grab locked page */
+		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
+		if (!folio) {
+			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
+			continue;
+		}
+
+		/* if dirty, punch up to offset */
+		if (folio_test_dirty(folio)) {
+			if (offset > *punch_start_byte) {
+				int	error;
+
+				error = xfs_buffered_write_delalloc_punch(inode,
+						*punch_start_byte, offset);
+				if (error) {
+					folio_unlock(folio);
+					folio_put(folio);
+					return error;
+				}
+			}
+
+			/*
+			 * Make sure the next punch start is correctly bound to
+			 * the end of this data range, not the end of the folio.
+			 */
+			*punch_start_byte = min_t(loff_t, end_byte,
+					folio_next_index(folio) << PAGE_SHIFT);
+		}
+
+		/* move offset to start of next folio in range */
+		offset = folio_next_index(folio) << PAGE_SHIFT;
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+	return 0;
+}
+
+/*
+ * Punch out all the delalloc blocks in the range given except for those that
+ * have dirty data still pending in the page cache - those are going to be
+ * written and so must still retain the delalloc backing for writeback.
+ *
+ * As we are scanning the page cache for data, we don't need to reimplement the
+ * wheel - mapping_seek_hole_data() does exactly what we need to identify the
+ * start and end of data ranges correctly even for sub-folio block sizes. This
+ * byte range based iteration is especially convenient because it means we don't
+ * have to care about variable size folios, nor where the start or end of the
+ * data range lies within a folio, if they lie within the same folio or even if
+ * there are multiple discontiguous data ranges within the folio.
+ */
+static int
+xfs_buffered_write_delalloc_release(
+	struct inode		*inode,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			punch_start_byte = start_byte;
+	int			error = 0;
+
+	/*
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite whilst we walk the
+	 * cache and perform delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	while (start_byte < end_byte) {
+		loff_t		data_end;
+
+		start_byte = mapping_seek_hole_data(inode->i_mapping,
+				start_byte, end_byte, SEEK_DATA);
+		/*
+		 * If there is no more data to scan, all that is left is to
+		 * punch out the remaining range.
+		 */
+		if (start_byte == -ENXIO || start_byte == end_byte)
+			break;
+		if (start_byte < 0) {
+			error = start_byte;
+			goto out_unlock;
+		}
+		ASSERT(start_byte >= punch_start_byte);
+		ASSERT(start_byte < end_byte);
+
+		/*
+		 * We find the end of this contiguous cached data range by
+		 * seeking from start_byte to the beginning of the next hole.
+		 */
+		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
+				end_byte, SEEK_HOLE);
+		if (data_end < 0) {
+			error = data_end;
+			goto out_unlock;
+		}
+		ASSERT(data_end > start_byte);
+		ASSERT(data_end <= end_byte);
+
+		error = xfs_buffered_write_delalloc_scan(inode,
+				&punch_start_byte, start_byte, data_end);
+		if (error)
+			goto out_unlock;
+
+		/* The next data search starts at the end of this one. */
+		start_byte = data_end;
+	}
+
+	if (punch_start_byte < end_byte)
+		error = xfs_buffered_write_delalloc_punch(inode,
+				punch_start_byte, end_byte);
+out_unlock:
+	filemap_invalidate_unlock(inode->i_mapping);
+	return error;
+}
+
 static int
 xfs_buffered_write_iomap_end(
 	struct inode		*inode,
@@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
 	if (start_byte >= end_byte)
 		return 0;
 
-	/*
-	 * Lock the mapping to avoid races with page faults re-instantiating
-	 * folios and dirtying them via ->page_mkwrite between the page cache
-	 * truncation and the delalloc extent removal. Failing to do this can
-	 * leave dirty pages with no space reservation in the cache.
-	 */
-	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(inode, start_byte, end_byte - 1);
-	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
-	filemap_invalidate_unlock(inode->i_mapping);
+	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
 	if (error && !xfs_is_shutdown(mp)) {
 		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
 			__func__, XFS_I(inode)->i_ino);
diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..e95239a418de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2925,6 +2925,7 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
 		return end;
 	return start;
 }
+EXPORT_SYMBOL(mapping_seek_hole_data);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/14] iomap: write iomap validity checks
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 04/14] xfs: buffered write failure should not truncate the page cache Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 06/14] xfs: use iomap_valid method to detect stale cached iomaps Darrick J. Wong
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

A recent multithreaded write data corruption has been uncovered in
the iomap write code. The core of the problem is partial folio
writes can be flushed to disk while a new racing write can map it
and fill the rest of the page:

writeback			new write

allocate blocks
  blocks are unwritten
submit IO
.....
				map blocks
				iomap indicates UNWRITTEN range
				loop {
				  lock folio
				  copyin data
.....
IO completes
  runs unwritten extent conv
    blocks are marked written
				  <iomap now stale>
				  get next folio
				}

Now add memory pressure such that memory reclaim evicts the
partially written folio that has already been written to disk.

When the new write finally gets to the last partial page of the new
write, it does not find it in cache, so it instantiates a new page,
sees the iomap is unwritten, and zeros the part of the page that
it does not have data from. This overwrites the data on disk that
was originally written.

The full description of the corruption mechanism can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

To solve this problem, we need to check whether the iomap is still
valid after we lock each folio during the write. We have to do it
after we lock the page so that we don't end up with state changes
occurring while we wait for the folio to be locked.

Hence we need a mechanism to be able to check that the cached iomap
is still valid (similar to what we already do in buffered
writeback), and we need a way for ->begin_write to back out and
tell the high level iomap iterator that we need to remap the
remaining write range.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |   53 ++++++++++++++++++++++++++++++++++++++----------
 fs/iomap/iter.c        |   19 ++++++++++++++++-
 include/linux/iomap.h  |   17 +++++++++++++++
 3 files changed, 77 insertions(+), 12 deletions(-)


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 91ee0b308e13..d3c565aa29f8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -584,8 +584,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
 	return iomap_read_inline_data(iter, folio);
 }
 
-static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
-		size_t len, struct folio **foliop)
+static int iomap_write_begin(struct iomap_iter *iter,
+		const struct iomap_ops *ops, loff_t pos, size_t len,
+		struct folio **foliop)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -618,6 +619,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
 		goto out_no_page;
 	}
+
+	/*
+	 * Now we have a locked folio, before we do anything with it we need to
+	 * check that the iomap we have cached is not stale. The inode extent
+	 * mapping can change due to concurrent IO in flight (e.g.
+	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
+	 * reclaimed a previously partially written page at this index after IO
+	 * completion before this write reaches this file offset) and hence we
+	 * could do the wrong thing here (zero a page range incorrectly or fail
+	 * to zero) and corrupt data.
+	 */
+	if (ops->iomap_valid) {
+		bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);
+
+		if (!iomap_valid) {
+			iter->iomap.flags |= IOMAP_F_STALE;
+			status = 0;
+			goto out_unlock;
+		}
+	}
+
 	if (pos + len > folio_pos(folio) + folio_size(folio))
 		len = folio_pos(folio) + folio_size(folio) - pos;
 
@@ -727,7 +749,8 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 	return ret;
 }
 
-static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
+static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
+		const struct iomap_ops *ops)
 {
 	loff_t length = iomap_length(iter);
 	loff_t pos = iter->pos;
@@ -770,9 +793,11 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			break;
 		}
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
 		if (unlikely(status))
 			break;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		page = folio_file_page(folio, pos >> PAGE_SHIFT);
 		if (mapping_writably_mapped(mapping))
@@ -825,14 +850,15 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 		iter.flags |= IOMAP_NOWAIT;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_write_iter(&iter, i);
+		iter.processed = iomap_write_iter(&iter, i, ops);
 	if (iter.pos == iocb->ki_pos)
 		return ret;
 	return iter.pos - iocb->ki_pos;
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
-static loff_t iomap_unshare_iter(struct iomap_iter *iter)
+static loff_t iomap_unshare_iter(struct iomap_iter *iter,
+		const struct iomap_ops *ops)
 {
 	struct iomap *iomap = &iter->iomap;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -853,9 +879,11 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
 		struct folio *folio;
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
 		if (unlikely(status))
 			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		status = iomap_write_end(iter, pos, bytes, bytes, folio);
 		if (WARN_ON_ONCE(status == 0))
@@ -886,12 +914,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_unshare_iter(&iter);
+		iter.processed = iomap_unshare_iter(&iter, ops);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
+static loff_t iomap_zero_iter(struct iomap_iter *iter,
+		const struct iomap_ops *ops, bool *did_zero)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
@@ -908,9 +937,11 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
 		if (status)
 			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		offset = offset_in_folio(folio, pos);
 		if (bytes > folio_size(folio) - offset)
@@ -946,7 +977,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero);
+		iter.processed = iomap_zero_iter(&iter, ops, did_zero);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index a1c7592d2ade..79a0614eaab7 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -7,12 +7,28 @@
 #include <linux/iomap.h>
 #include "trace.h"
 
+/*
+ * Advance to the next range we need to map.
+ *
+ * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
+ * processed - it was aborted because the extent the iomap spanned may have been
+ * changed during the operation. In this case, the iteration behaviour is to
+ * remap the unprocessed range of the iter, and that means we may need to remap
+ * even when we've made no progress (i.e. iter->processed = 0). Hence the
+ * "finished iterating" case needs to distinguish between
+ * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
+ * need to remap the entire remaining range.
+ */
 static inline int iomap_iter_advance(struct iomap_iter *iter)
 {
+	bool stale = iter->iomap.flags & IOMAP_F_STALE;
+
 	/* handle the previous iteration (if any) */
 	if (iter->iomap.length) {
-		if (iter->processed <= 0)
+		if (iter->processed < 0)
 			return iter->processed;
+		if (!iter->processed && !stale)
+			return 0;
 		if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
 			return -EIO;
 		iter->pos += iter->processed;
@@ -33,6 +49,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter)
 	WARN_ON_ONCE(iter->iomap.offset > iter->pos);
 	WARN_ON_ONCE(iter->iomap.length == 0);
 	WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
+	WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE);
 
 	trace_iomap_iter_dstmap(iter->inode, &iter->iomap);
 	if (iter->srcmap.type != IOMAP_HOLE)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17..308931f0840a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -62,8 +62,13 @@ struct vm_fault;
  *
  * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
  * has changed as the result of this write operation.
+ *
+ * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
+ * range it covers needs to be remapped by the high level before the operation
+ * can proceed.
  */
 #define IOMAP_F_SIZE_CHANGED	0x100
+#define IOMAP_F_STALE		0x200
 
 /*
  * Flags from 0x1000 up are for file system specific usage:
@@ -165,6 +170,18 @@ struct iomap_ops {
 	 */
 	int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
 			ssize_t written, unsigned flags, struct iomap *iomap);
+
+	/*
+	 * Check that the cached iomap still maps correctly to the filesystem's
+	 * internal extent map. FS internal extent maps can change while iomap
+	 * is iterating a cached iomap, so this hook allows iomap to detect that
+	 * the iomap needs to be refreshed during a long running write
+	 * operation.
+	 *
+	 * This is called with the folio over the specified file position
+	 * held locked by the iomap code.
+	 */
+	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };
 
 /**


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/14] xfs: use iomap_valid method to detect stale cached iomaps
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 05/14] iomap: write iomap validity checks Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 07/14] xfs: drop write error injection is unfixable, remove it Darrick J. Wong
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

or the ->iomap_valid() introduction commit for exact details of the
corruption vector.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |    4 +--
 fs/xfs/xfs_aops.c        |    2 +
 fs/xfs/xfs_iomap.c       |   69 +++++++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_iomap.h       |    4 +--
 fs/xfs/xfs_pnfs.c        |    5 ++-
 5 files changed, 61 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 49d0d4ea63fc..db225130618c 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4551,8 +4551,8 @@ xfs_bmapi_convert_delalloc(
 	 * the extent.  Just return the real extent at this offset.
 	 */
 	if (!isnullstartblock(bma.got.br_startblock)) {
-		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
 		*seq = READ_ONCE(ifp->if_seq);
+		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
 		goto out_trans_cancel;
 	}
 
@@ -4599,8 +4599,8 @@ xfs_bmapi_convert_delalloc(
 	XFS_STATS_INC(mp, xs_xstrat_quick);
 
 	ASSERT(!isnullstartblock(bma.got.br_startblock));
-	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
 	*seq = READ_ONCE(ifp->if_seq);
+	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
 
 	if (whichfork == XFS_COW_FORK)
 		xfs_refcount_alloc_cow_extent(tp, bma.blkno, bma.length);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5d1a995b15f8..ca5a9e45a48c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -373,7 +373,7 @@ xfs_map_blocks(
 	    isnullstartblock(imap.br_startblock))
 		goto allocate_blocks;
 
-	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0);
+	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0, XFS_WPC(wpc)->data_seq);
 	trace_xfs_map_blocks_found(ip, offset, count, whichfork, &imap);
 	return 0;
 allocate_blocks:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2d48fcc7bd6f..5053ffcf10fe 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -54,7 +54,8 @@ xfs_bmbt_to_iomap(
 	struct iomap		*iomap,
 	struct xfs_bmbt_irec	*imap,
 	unsigned int		mapping_flags,
-	u16			iomap_flags)
+	u16			iomap_flags,
+	int			sequence)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
@@ -91,6 +92,9 @@ xfs_bmbt_to_iomap(
 	if (xfs_ipincount(ip) &&
 	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
 		iomap->flags |= IOMAP_F_DIRTY;
+
+	/* The extent tree sequence is needed for iomap validity checking. */
+	*((int *)&iomap->private) = sequence;
 	return 0;
 }
 
@@ -195,7 +199,8 @@ xfs_iomap_write_direct(
 	xfs_fileoff_t		offset_fsb,
 	xfs_fileoff_t		count_fsb,
 	unsigned int		flags,
-	struct xfs_bmbt_irec	*imap)
+	struct xfs_bmbt_irec	*imap,
+	int			*seq)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -285,6 +290,8 @@ xfs_iomap_write_direct(
 		error = xfs_alert_fsblock_zero(ip, imap);
 
 out_unlock:
+	if (seq)
+		*seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 
@@ -743,6 +750,7 @@ xfs_direct_write_iomap_begin(
 	bool			shared = false;
 	u16			iomap_flags = 0;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
+	int			seq;
 
 	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
 
@@ -811,9 +819,10 @@ xfs_direct_write_iomap_begin(
 			goto out_unlock;
 	}
 
+	seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
 
 allocate_blocks:
 	error = -EAGAIN;
@@ -839,24 +848,25 @@ xfs_direct_write_iomap_begin(
 	xfs_iunlock(ip, lockmode);
 
 	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap);
+			flags, &imap, &seq);
 	if (error)
 		return error;
 
 	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 iomap_flags | IOMAP_F_NEW);
+				 iomap_flags | IOMAP_F_NEW, seq);
 
 out_found_cow:
+	seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, lockmode);
 	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
 	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
 	if (imap.br_startblock != HOLESTARTBLOCK) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
 		if (error)
 			return error;
 	}
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
 
 out_unlock:
 	if (lockmode)
@@ -915,6 +925,7 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 	unsigned int		lockmode = XFS_ILOCK_EXCL;
+	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1094,26 +1105,29 @@ xfs_buffered_write_iomap_begin(
 	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
 	 * them out if the write happens to fail.
 	 */
+	seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
 
 found_imap:
+	seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 
 found_cow:
+	seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	if (imap.br_startoff <= offset_fsb) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
 		if (error)
 			return error;
 		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-					 IOMAP_F_SHARED);
+					 IOMAP_F_SHARED, seq);
 	}
 
 	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1328,9 +1342,26 @@ xfs_buffered_write_iomap_end(
 	return 0;
 }
 
+/*
+ * Check that the iomap passed to us is still valid for the given offset and
+ * length.
+ */
+static bool
+xfs_buffered_write_iomap_valid(
+	struct inode		*inode,
+	const struct iomap	*iomap)
+{
+	int			seq = *((int *)&iomap->private);
+
+	if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq))
+		return false;
+	return true;
+}
+
 const struct iomap_ops xfs_buffered_write_iomap_ops = {
 	.iomap_begin		= xfs_buffered_write_iomap_begin,
 	.iomap_end		= xfs_buffered_write_iomap_end,
+	.iomap_valid		= xfs_buffered_write_iomap_valid,
 };
 
 /*
@@ -1359,6 +1390,7 @@ xfs_read_iomap_begin(
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
+	int			seq;
 
 	ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
 
@@ -1372,13 +1404,14 @@ xfs_read_iomap_begin(
 			       &nimaps, 0);
 	if (!error && (flags & IOMAP_REPORT))
 		error = xfs_reflink_trim_around_shared(ip, &imap, &shared);
+	seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, lockmode);
 
 	if (error)
 		return error;
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 shared ? IOMAP_F_SHARED : 0);
+				 shared ? IOMAP_F_SHARED : 0, seq);
 }
 
 const struct iomap_ops xfs_read_iomap_ops = {
@@ -1438,7 +1471,7 @@ xfs_seek_iomap_begin(
 			end_fsb = min(end_fsb, data_fsb);
 		xfs_trim_extent(&cmap, offset_fsb, end_fsb);
 		error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-					  IOMAP_F_SHARED);
+				IOMAP_F_SHARED, READ_ONCE(ip->i_cowfp->if_seq));
 		/*
 		 * This is a COW extent, so we must probe the page cache
 		 * because there could be dirty page cache being backed
@@ -1460,7 +1493,8 @@ xfs_seek_iomap_begin(
 	imap.br_state = XFS_EXT_NORM;
 done:
 	xfs_trim_extent(&imap, offset_fsb, end_fsb);
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0,
+			READ_ONCE(ip->i_df.if_seq));
 out_unlock:
 	xfs_iunlock(ip, lockmode);
 	return error;
@@ -1486,6 +1520,7 @@ xfs_xattr_iomap_begin(
 	struct xfs_bmbt_irec	imap;
 	int			nimaps = 1, error = 0;
 	unsigned		lockmode;
+	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1502,12 +1537,14 @@ xfs_xattr_iomap_begin(
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
 			       &nimaps, XFS_BMAPI_ATTRFORK);
 out_unlock:
+
+	seq = READ_ONCE(ip->i_af.if_seq);
 	xfs_iunlock(ip, lockmode);
 
 	if (error)
 		return error;
 	ASSERT(nimaps);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 }
 
 const struct iomap_ops xfs_xattr_iomap_ops = {
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 0f62ab633040..792fed2a9072 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -13,14 +13,14 @@ struct xfs_bmbt_irec;
 
 int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
 		xfs_fileoff_t count_fsb, unsigned int flags,
-		struct xfs_bmbt_irec *imap);
+		struct xfs_bmbt_irec *imap, int *sequence);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t, bool);
 xfs_fileoff_t xfs_iomap_eof_align_last_fsb(struct xfs_inode *ip,
 		xfs_fileoff_t end_fsb);
 
 int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
 		struct xfs_bmbt_irec *imap, unsigned int mapping_flags,
-		u16 iomap_flags);
+		u16 iomap_flags, int sequence);
 
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
 		bool *did_zero);
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 37a24f0f7cd4..eea507a80c5c 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -125,6 +125,7 @@ xfs_fs_map_blocks(
 	int			nimaps = 1;
 	uint			lock_flags;
 	int			error = 0;
+	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -189,7 +190,7 @@ xfs_fs_map_blocks(
 		xfs_iunlock(ip, lock_flags);
 
 		error = xfs_iomap_write_direct(ip, offset_fsb,
-				end_fsb - offset_fsb, 0, &imap);
+				end_fsb - offset_fsb, 0, &imap, &seq);
 		if (error)
 			goto out_unlock;
 
@@ -209,7 +210,7 @@ xfs_fs_map_blocks(
 	}
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, 0, 0);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, 0, 0, seq);
 	*device_generation = mp->m_generation;
 	return error;
 out_unlock:


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/14] xfs: drop write error injection is unfixable, remove it
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (5 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 06/14] xfs: use iomap_valid method to detect stale cached iomaps Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 08/14] iomap: pass iter to ->iomap_begin implementations Darrick J. Wong
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, david, hch

From: Dave Chinner <dchinner@redhat.com>

With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.

The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.

On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.

IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.

Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.

And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.

This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.

As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:

xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_errortag.h |   12 +++++-------
 fs/xfs/xfs_error.c           |   27 ++++++++++++++++++++-------
 fs/xfs/xfs_iomap.c           |    9 ---------
 3 files changed, 25 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 5362908164b0..580ccbd5aadc 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -40,13 +40,12 @@
 #define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
 #define XFS_ERRTAG_BMAP_FINISH_ONE			26
 #define XFS_ERRTAG_AG_RESV_CRITICAL			27
+
 /*
- * DEBUG mode instrumentation to test and/or trigger delayed allocation
- * block killing in the event of failed writes. When enabled, all
- * buffered writes are silenty dropped and handled as if they failed.
- * All delalloc blocks in the range of the write (including pre-existing
- * delalloc blocks!) are tossed as part of the write failure error
- * handling sequence.
+ * Drop-writes support removed because write error handling cannot trash
+ * pre-existing delalloc extents in any useful way anymore. We retain the
+ * definition so that we can reject it as an invalid value in
+ * xfs_errortag_valid().
  */
 #define XFS_ERRTAG_DROP_WRITES				28
 #define XFS_ERRTAG_LOG_BAD_CRC				29
@@ -95,7 +94,6 @@
 #define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
 #define XFS_RANDOM_BMAP_FINISH_ONE			1
 #define XFS_RANDOM_AG_RESV_CRITICAL			4
-#define XFS_RANDOM_DROP_WRITES				1
 #define XFS_RANDOM_LOG_BAD_CRC				1
 #define XFS_RANDOM_LOG_ITEM_PIN				1
 #define XFS_RANDOM_BUF_LRU_REF				2
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index c6b2aabd6f18..dea3c0649d2f 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -46,7 +46,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_REFCOUNT_FINISH_ONE,
 	XFS_RANDOM_BMAP_FINISH_ONE,
 	XFS_RANDOM_AG_RESV_CRITICAL,
-	XFS_RANDOM_DROP_WRITES,
+	0, /* XFS_RANDOM_DROP_WRITES has been removed */
 	XFS_RANDOM_LOG_BAD_CRC,
 	XFS_RANDOM_LOG_ITEM_PIN,
 	XFS_RANDOM_BUF_LRU_REF,
@@ -162,7 +162,6 @@ XFS_ERRORTAG_ATTR_RW(refcount_continue_update,	XFS_ERRTAG_REFCOUNT_CONTINUE_UPDA
 XFS_ERRORTAG_ATTR_RW(refcount_finish_one,	XFS_ERRTAG_REFCOUNT_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(bmap_finish_one,	XFS_ERRTAG_BMAP_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(ag_resv_critical,	XFS_ERRTAG_AG_RESV_CRITICAL);
-XFS_ERRORTAG_ATTR_RW(drop_writes,	XFS_ERRTAG_DROP_WRITES);
 XFS_ERRORTAG_ATTR_RW(log_bad_crc,	XFS_ERRTAG_LOG_BAD_CRC);
 XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
@@ -206,7 +205,6 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(refcount_finish_one),
 	XFS_ERRORTAG_ATTR_LIST(bmap_finish_one),
 	XFS_ERRORTAG_ATTR_LIST(ag_resv_critical),
-	XFS_ERRORTAG_ATTR_LIST(drop_writes),
 	XFS_ERRORTAG_ATTR_LIST(log_bad_crc),
 	XFS_ERRORTAG_ATTR_LIST(log_item_pin),
 	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
@@ -256,6 +254,19 @@ xfs_errortag_del(
 	kmem_free(mp->m_errortag);
 }
 
+static bool
+xfs_errortag_valid(
+	unsigned int		error_tag)
+{
+	if (error_tag >= XFS_ERRTAG_MAX)
+		return false;
+
+	/* Error out removed injection types */
+	if (error_tag == XFS_ERRTAG_DROP_WRITES)
+		return false;
+	return true;
+}
+
 bool
 xfs_errortag_test(
 	struct xfs_mount	*mp,
@@ -277,7 +288,9 @@ xfs_errortag_test(
 	if (!mp->m_errortag)
 		return false;
 
-	ASSERT(error_tag < XFS_ERRTAG_MAX);
+	if (!xfs_errortag_valid(error_tag))
+		return false;
+
 	randfactor = mp->m_errortag[error_tag];
 	if (!randfactor || prandom_u32_max(randfactor))
 		return false;
@@ -293,7 +306,7 @@ xfs_errortag_get(
 	struct xfs_mount	*mp,
 	unsigned int		error_tag)
 {
-	if (error_tag >= XFS_ERRTAG_MAX)
+	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
 	return mp->m_errortag[error_tag];
@@ -305,7 +318,7 @@ xfs_errortag_set(
 	unsigned int		error_tag,
 	unsigned int		tag_value)
 {
-	if (error_tag >= XFS_ERRTAG_MAX)
+	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
 	mp->m_errortag[error_tag] = tag_value;
@@ -319,7 +332,7 @@ xfs_errortag_add(
 {
 	BUILD_BUG_ON(ARRAY_SIZE(xfs_errortag_random_default) != XFS_ERRTAG_MAX);
 
-	if (error_tag >= XFS_ERRTAG_MAX)
+	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
 	return xfs_errortag_set(mp, error_tag,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5053ffcf10fe..8e7e51c5f56d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1305,15 +1305,6 @@ xfs_buffered_write_iomap_end(
 	if (iomap->type != IOMAP_DELALLOC)
 		return 0;
 
-	/*
-	 * Behave as if the write failed if drop writes is enabled. Set the NEW
-	 * flag to force delalloc cleanup.
-	 */
-	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_DROP_WRITES)) {
-		iomap->flags |= IOMAP_F_NEW;
-		written = 0;
-	}
-
 	/* If we didn't reserve the blocks, we're not allowed to punch them. */
 	if (!(iomap->flags & IOMAP_F_NEW))
 		return 0;


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/14] iomap: pass iter to ->iomap_begin implementations
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (6 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 07/14] xfs: drop write error injection is unfixable, remove it Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-15  8:36   ` Christoph Hellwig
  2022-11-09 18:16 ` [PATCH 09/14] iomap: pass iter to ->iomap_end implementations Darrick J. Wong
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Clean up the ->iomap_begin call sites by passing a pointer to the iter
structure into the iomap_begin functions.  This will be needed to clean
up the xfs race condition fixes in the next patch, and will hopefully
reduce register pressure as well.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/btrfs/inode.c      |   10 ++++++----
 fs/erofs/data.c       |    8 ++++++--
 fs/erofs/zmap.c       |    6 ++++--
 fs/ext2/inode.c       |    8 ++++++--
 fs/ext4/extents.c     |    5 +++--
 fs/ext4/inode.c       |   32 +++++++++++++++++++++-----------
 fs/f2fs/data.c        |    9 ++++++---
 fs/fuse/dax.c         |    9 ++++++---
 fs/gfs2/bmap.c        |   18 ++++++++++++------
 fs/hpfs/file.c        |    8 ++++++--
 fs/iomap/iter.c       |    3 +--
 fs/xfs/xfs_iomap.c    |   48 ++++++++++++++++++++++++++----------------------
 fs/zonefs/super.c     |   24 +++++++++++++++++-------
 include/linux/iomap.h |    5 +++--
 14 files changed, 123 insertions(+), 70 deletions(-)


diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0e516aefbf51..d1030128769b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7517,11 +7517,13 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	return ret;
 }
 
-static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
-		loff_t length, unsigned int flags, struct iomap *iomap,
-		struct iomap *srcmap)
+static int btrfs_dio_iomap_begin(const struct iomap_iter *iter,
+		struct iomap *iomap, struct iomap *srcmap)
 {
-	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+	struct inode *inode = iter->inode;
+	loff_t start = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index fe8ac0e163f7..093ffbefc027 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -249,9 +249,13 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
 	return 0;
 }
 
-static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+static int erofs_iomap_begin(const struct iomap_iter *iter,
+		struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	int ret;
 	struct erofs_map_blocks map;
 	struct erofs_map_dev mdev;
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index 0bb66927e3d0..434087cbf467 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -767,10 +767,12 @@ int z_erofs_map_blocks_iter(struct inode *inode, struct erofs_map_blocks *map,
 	return err;
 }
 
-static int z_erofs_iomap_begin_report(struct inode *inode, loff_t offset,
-				loff_t length, unsigned int flags,
+static int z_erofs_iomap_begin_report(const struct iomap_iter *iter,
 				struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
 	int ret;
 	struct erofs_map_blocks map = { .m_la = offset };
 
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 918ab2f9e4c0..808a6c5a2db1 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -799,9 +799,13 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
 
 }
 
-static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
+static int ext2_iomap_begin(const struct iomap_iter *iter,
+		struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	unsigned int blkbits = inode->i_blkbits;
 	unsigned long first_block = offset >> blkbits;
 	unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f1956288307f..d462188cd2db 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4903,10 +4903,11 @@ static int ext4_iomap_xattr_fiemap(struct inode *inode, struct iomap *iomap)
 	return error;
 }
 
-static int ext4_iomap_xattr_begin(struct inode *inode, loff_t offset,
-				  loff_t length, unsigned flags,
+static int ext4_iomap_xattr_begin(const struct iomap_iter *iter,
 				  struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
 	int error;
 
 	error = ext4_iomap_xattr_fiemap(inode, iomap);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2b5ef1b64249..8d15b83caaca 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3428,10 +3428,12 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	return ret;
 }
 
-
-static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
+static int __ext4_iomap_begin(const struct iomap_iter *iter, unsigned int flags,
+		struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
 	int ret;
 	struct ext4_map_blocks map;
 	u8 blkbits = inode->i_blkbits;
@@ -3481,18 +3483,23 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	return 0;
 }
 
-static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
-		loff_t length, unsigned flags, struct iomap *iomap,
-		struct iomap *srcmap)
+static int ext4_iomap_begin(const struct iomap_iter *iter,
+		struct iomap *iomap, struct iomap *srcmap)
 {
-	int ret;
+	return __ext4_iomap_begin(iter, iter->flags, iomap, srcmap);
+}
 
+static int ext4_iomap_overwrite_begin(const struct iomap_iter *iter,
+		struct iomap *iomap, struct iomap *srcmap)
+{
 	/*
 	 * Even for writes we don't need to allocate blocks, so just pretend
 	 * we are reading to save overhead of starting a transaction.
 	 */
-	flags &= ~IOMAP_WRITE;
-	ret = ext4_iomap_begin(inode, offset, length, flags, iomap, srcmap);
+	unsigned int flags = iter->flags & ~IOMAP_WRITE;
+	int ret;
+
+	ret = __ext4_iomap_begin(iter, flags, iomap, srcmap);
 	WARN_ON_ONCE(iomap->type != IOMAP_MAPPED);
 	return ret;
 }
@@ -3546,10 +3553,13 @@ static bool ext4_iomap_is_delalloc(struct inode *inode,
 	return true;
 }
 
-static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
-				   loff_t length, unsigned int flags,
+static int ext4_iomap_begin_report(const struct iomap_iter *iter,
 				   struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	int ret;
 	bool delalloc = false;
 	struct ext4_map_blocks map;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index a71e818cd67b..ea79fda6aaa9 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -4105,10 +4105,13 @@ void f2fs_destroy_bio_entry_cache(void)
 	kmem_cache_destroy(bio_entry_slab);
 }
 
-static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-			    unsigned int flags, struct iomap *iomap,
-			    struct iomap *srcmap)
+static int f2fs_iomap_begin(const struct iomap_iter *iter,
+			    struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	struct f2fs_map_blocks map = {};
 	pgoff_t next_pgofs = 0;
 	int err;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index e23e802a8013..65bf4e35bac3 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -558,10 +558,13 @@ static int fuse_upgrade_dax_mapping(struct inode *inode, loff_t pos,
 /* This is just for DAX and the mapping is ephemeral, do not use it for other
  * purposes since there is no block device with a permanent mapping.
  */
-static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
-			    unsigned int flags, struct iomap *iomap,
-			    struct iomap *srcmap)
+static int fuse_iomap_begin(const struct iomap_iter *iter,
+			    struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t pos = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_dax_mapping *dmap;
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 3bdb2c668a71..1e438e0734d4 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -991,12 +991,15 @@ static const struct iomap_page_ops gfs2_iomap_page_ops = {
 	.page_done = gfs2_iomap_page_done,
 };
 
-static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
-				  loff_t length, unsigned flags,
+static int gfs2_iomap_begin_write(const struct iomap_iter *iter,
 				  struct iomap *iomap,
 				  struct metapath *mp)
 {
-	struct gfs2_inode *ip = GFS2_I(inode);
+	struct inode *inode = iter->inode;
+	loff_t pos = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
+ 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	bool unstuff;
 	int ret;
@@ -1076,10 +1079,13 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	return ret;
 }
 
-static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
-			    unsigned flags, struct iomap *iomap,
+static int gfs2_iomap_begin(const struct iomap_iter *iter, struct iomap *iomap,
 			    struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t pos = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct metapath mp = { .mp_aheight = 1, };
 	int ret;
@@ -1112,7 +1118,7 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 		goto out_unlock;
 	}
 
-	ret = gfs2_iomap_begin_write(inode, pos, length, flags, iomap, &mp);
+	ret = gfs2_iomap_begin_write(iter, iomap, &mp);
 
 out_unlock:
 	release_metapath(&mp);
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index f7547a62c81f..43f727d650c6 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -117,9 +117,13 @@ static int hpfs_get_block(struct inode *inode, sector_t iblock, struct buffer_he
 	return r;
 }
 
-static int hpfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
+static int hpfs_iomap_begin(const struct iomap_iter *iter,
+		struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
+	unsigned int flags = iter->flags;
 	struct super_block *sb = inode->i_sb;
 	unsigned int blkbits = inode->i_blkbits;
 	unsigned int n_secs;
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 79a0614eaab7..2e84f8be6d8d 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -88,8 +88,7 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 	if (ret <= 0)
 		return ret;
 
-	ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
-			       &iter->iomap, &iter->srcmap);
+	ret = ops->iomap_begin(iter, &iter->iomap, &iter->srcmap);
 	if (ret < 0)
 		return ret;
 	iomap_iter_done(iter);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8e7e51c5f56d..ca88facfd61e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -734,13 +734,14 @@ imap_spans_range(
 
 static int
 xfs_direct_write_iomap_begin(
-	struct inode		*inode,
-	loff_t			offset,
-	loff_t			length,
-	unsigned		flags,
+	const struct iomap_iter	*iter,
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct inode		*inode = iter->inode;
+	loff_t			offset = iter->pos;
+	loff_t			length = iter->len;
+	unsigned int		flags = iter->flags;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_bmbt_irec	imap, cmap;
@@ -907,13 +908,14 @@ const struct iomap_ops xfs_dax_write_iomap_ops = {
 
 static int
 xfs_buffered_write_iomap_begin(
-	struct inode		*inode,
-	loff_t			offset,
-	loff_t			count,
-	unsigned		flags,
+	const struct iomap_iter *iter,
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct inode		*inode = iter->inode;
+	loff_t			offset = iter->pos;
+	loff_t			count = iter->len;
+	unsigned int		flags = iter->flags;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
@@ -932,8 +934,7 @@ xfs_buffered_write_iomap_begin(
 
 	/* we can't use delayed allocations when using extent size hints */
 	if (xfs_get_extsz_hint(ip))
-		return xfs_direct_write_iomap_begin(inode, offset, count,
-				flags, iomap, srcmap);
+		return xfs_direct_write_iomap_begin(iter, iomap, srcmap);
 
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 
@@ -1366,13 +1367,14 @@ const struct iomap_ops xfs_page_mkwrite_iomap_ops = {
 
 static int
 xfs_read_iomap_begin(
-	struct inode		*inode,
-	loff_t			offset,
-	loff_t			length,
-	unsigned		flags,
+	const struct iomap_iter	*iter,
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct inode		*inode = iter->inode;
+	loff_t			offset = iter->pos;
+	loff_t			length = iter->len;
+	unsigned int		flags = iter->flags;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_bmbt_irec	imap;
@@ -1411,13 +1413,14 @@ const struct iomap_ops xfs_read_iomap_ops = {
 
 static int
 xfs_seek_iomap_begin(
-	struct inode		*inode,
-	loff_t			offset,
-	loff_t			length,
-	unsigned		flags,
+	const struct iomap_iter	*iter,
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct inode		*inode = iter->inode;
+	loff_t			offset = iter->pos;
+	loff_t			length = iter->len;
+	unsigned int		flags = iter->flags;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
@@ -1497,13 +1500,14 @@ const struct iomap_ops xfs_seek_iomap_ops = {
 
 static int
 xfs_xattr_iomap_begin(
-	struct inode		*inode,
-	loff_t			offset,
-	loff_t			length,
-	unsigned		flags,
+	const struct iomap_iter	*iter,
 	struct iomap		*iomap,
 	struct iomap		*srcmap)
 {
+	struct inode		*inode = iter->inode;
+	loff_t			offset = iter->pos;
+	loff_t			length = iter->len;
+	unsigned int		flags = iter->flags;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..9a8e261ece8b 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -109,10 +109,12 @@ static inline void zonefs_i_size_write(struct inode *inode, loff_t isize)
 	}
 }
 
-static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
-				   loff_t length, unsigned int flags,
+static int zonefs_read_iomap_begin(const struct iomap_iter *iter,
 				   struct iomap *iomap, struct iomap *srcmap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	loff_t length = iter->len;
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
 	loff_t isize;
@@ -145,9 +147,9 @@ static const struct iomap_ops zonefs_read_iomap_ops = {
 	.iomap_begin	= zonefs_read_iomap_begin,
 };
 
-static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
-				    loff_t length, unsigned int flags,
-				    struct iomap *iomap, struct iomap *srcmap)
+static int __zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
+		loff_t length, unsigned int flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
@@ -190,6 +192,13 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
 	return 0;
 }
 
+static int zonefs_write_iomap_begin(const struct iomap_iter *iter,
+				    struct iomap *iomap, struct iomap *srcmap)
+{
+	return __zonefs_write_iomap_begin(iter->inode, iter->pos, iter->len,
+			iter->flags, iomap, srcmap);
+}
+
 static const struct iomap_ops zonefs_write_iomap_ops = {
 	.iomap_begin	= zonefs_write_iomap_begin,
 };
@@ -223,8 +232,9 @@ static int zonefs_write_map_blocks(struct iomap_writepage_ctx *wpc,
 	    offset < wpc->iomap.offset + wpc->iomap.length)
 		return 0;
 
-	return zonefs_write_iomap_begin(inode, offset, zi->i_max_size - offset,
-					IOMAP_WRITE, &wpc->iomap, NULL);
+	return __zonefs_write_iomap_begin(inode, offset,
+			zi->i_max_size - offset, IOMAP_WRITE, &wpc->iomap,
+			NULL);
 }
 
 static const struct iomap_writeback_ops zonefs_writeback_ops = {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 308931f0840a..811ea61ba577 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -152,14 +152,15 @@ struct iomap_page_ops {
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
 
+struct iomap_iter;
+
 struct iomap_ops {
 	/*
 	 * Return the existing mapping at pos, or reserve space starting at
 	 * pos for up to length, as long as we can do it as a single mapping.
 	 * The actual length is returned in iomap->length.
 	 */
-	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
-			unsigned flags, struct iomap *iomap,
+	int (*iomap_begin)(const struct iomap_iter *iter, struct iomap *iomap,
 			struct iomap *srcmap);
 
 	/*


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/14] iomap: pass iter to ->iomap_end implementations
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (7 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 08/14] iomap: pass iter to ->iomap_begin implementations Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 10/14] iomap: pass a private pointer to iomap_file_buffered_write Darrick J. Wong
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Clean up the ->iomap_end call sites by passing a pointer to the iter
structure into the iomap_end functions.  This isn't strictly needed,
but it cleans up the callsites neatly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/btrfs/inode.c      |    8 +++++---
 fs/erofs/data.c       |    4 ++--
 fs/ext2/inode.c       |    8 ++++++--
 fs/ext4/inode.c       |    6 +++---
 fs/fuse/dax.c         |    5 ++---
 fs/gfs2/bmap.c        |    7 +++++--
 fs/iomap/iter.c       |    5 ++---
 fs/xfs/xfs_iomap.c    |   28 ++++++++++++++--------------
 include/linux/iomap.h |    4 ++--
 9 files changed, 41 insertions(+), 34 deletions(-)


diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d1030128769b..50afc8a3a5da 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7770,10 +7770,12 @@ static int btrfs_dio_iomap_begin(const struct iomap_iter *iter,
 	return ret;
 }
 
-static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
-		ssize_t written, unsigned int flags, struct iomap *iomap)
+static int btrfs_dio_iomap_end(const struct iomap_iter *iter, u64 length,
+		ssize_t written, struct iomap *iomap)
 {
-	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+	struct inode *inode = iter->inode;
+	loff_t pos = iter->pos;
+	unsigned int flags = iter->flags;
 	struct btrfs_dio_data *dio_data = iter->private;
 	size_t submitted = dio_data->submitted;
 	const bool write = !!(flags & IOMAP_WRITE);
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 093ffbefc027..f3f42ec39c8d 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -312,8 +312,8 @@ static int erofs_iomap_begin(const struct iomap_iter *iter,
 	return 0;
 }
 
-static int erofs_iomap_end(struct inode *inode, loff_t pos, loff_t length,
-		ssize_t written, unsigned int flags, struct iomap *iomap)
+static int erofs_iomap_end(const struct iomap_iter *iter, u64 length,
+		ssize_t written, struct iomap *iomap)
 {
 	void *ptr = iomap->private;
 
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 808a6c5a2db1..f28c47e519db 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -845,9 +845,13 @@ static int ext2_iomap_begin(const struct iomap_iter *iter,
 }
 
 static int
-ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
-		ssize_t written, unsigned flags, struct iomap *iomap)
+ext2_iomap_end(const struct iomap_iter *iter, u64 length, ssize_t written,
+		struct iomap *iomap)
 {
+	struct inode *inode = iter->inode;
+	loff_t offset = iter->pos;
+	unsigned int flags = iter->flags;
+
 	if (iomap->type == IOMAP_MAPPED &&
 	    written < length &&
 	    (flags & IOMAP_WRITE))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8d15b83caaca..7d1f512e5187 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3504,8 +3504,8 @@ static int ext4_iomap_overwrite_begin(const struct iomap_iter *iter,
 	return ret;
 }
 
-static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
-			  ssize_t written, unsigned flags, struct iomap *iomap)
+static int ext4_iomap_end(const struct iomap_iter *iter, u64 length,
+			  ssize_t written, struct iomap *iomap)
 {
 	/*
 	 * Check to see whether an error occurred while writing out the data to
@@ -3514,7 +3514,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 	 * the I/O. Any blocks that may have been allocated in preparation for
 	 * the direct I/O will be reused during buffered I/O.
 	 */
-	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+	if (iter->flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
 		return -ENOTBLK;
 
 	return 0;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 65bf4e35bac3..7fea437246aa 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -635,9 +635,8 @@ static int fuse_iomap_begin(const struct iomap_iter *iter,
 	return 0;
 }
 
-static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
-			  ssize_t written, unsigned int flags,
-			  struct iomap *iomap)
+static int fuse_iomap_end(const struct iomap_iter *iter, u64 length,
+			  ssize_t written, struct iomap *iomap)
 {
 	struct fuse_dax_mapping *dmap = iomap->private;
 
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 1e438e0734d4..f215c0735fa6 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1126,9 +1126,12 @@ static int gfs2_iomap_begin(const struct iomap_iter *iter, struct iomap *iomap,
 	return ret;
 }
 
-static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
-			  ssize_t written, unsigned flags, struct iomap *iomap)
+static int gfs2_iomap_end(const struct iomap_iter *iter, u64 length,
+			  ssize_t written, struct iomap *iomap)
 {
+	struct inode *inode = iter->inode;
+	loff_t pos = iter->pos;
+	unsigned int flags = iter->flags;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 2e84f8be6d8d..494e905844cf 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -76,9 +76,8 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 	int ret;
 
 	if (iter->iomap.length && ops->iomap_end) {
-		ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
-				iter->processed > 0 ? iter->processed : 0,
-				iter->flags, &iter->iomap);
+		ret = ops->iomap_end(iter, iomap_length(iter),
+				max_t(s64, iter->processed, 0), &iter->iomap);
 		if (ret < 0 && !iter->processed)
 			return ret;
 	}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ca88facfd61e..668f66ca84e4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -881,20 +881,19 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 
 static int
 xfs_dax_write_iomap_end(
-	struct inode		*inode,
-	loff_t			pos,
-	loff_t			length,
+	const struct iomap_iter	*iter,
+	u64			mapped_length,
 	ssize_t			written,
-	unsigned		flags,
 	struct iomap		*iomap)
 {
-	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode	*ip = XFS_I(iter->inode);
+	loff_t			pos = iter->pos;
 
 	if (!xfs_is_cow_inode(ip))
 		return 0;
 
 	if (!written) {
-		xfs_reflink_cancel_cow_range(ip, pos, length, true);
+		xfs_reflink_cancel_cow_range(ip, pos, mapped_length, true);
 		return 0;
 	}
 
@@ -1291,14 +1290,14 @@ xfs_buffered_write_delalloc_release(
 
 static int
 xfs_buffered_write_iomap_end(
-	struct inode		*inode,
-	loff_t			offset,
-	loff_t			length,
+	const struct iomap_iter	*iter,
+	u64			mapped_length,
 	ssize_t			written,
-	unsigned		flags,
 	struct iomap		*iomap)
 {
-	struct xfs_mount	*mp = XFS_M(inode->i_sb);
+	struct xfs_inode	*ip = XFS_I(iter->inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	loff_t			offset = iter->pos;
 	loff_t			start_byte;
 	loff_t			end_byte;
 	int			error = 0;
@@ -1319,16 +1318,17 @@ xfs_buffered_write_iomap_end(
 		start_byte = round_down(offset, mp->m_sb.sb_blocksize);
 	else
 		start_byte = round_up(offset + written, mp->m_sb.sb_blocksize);
-	end_byte = round_up(offset + length, mp->m_sb.sb_blocksize);
+	end_byte = round_up(offset + mapped_length, mp->m_sb.sb_blocksize);
 
 	/* Nothing to do if we've written the entire delalloc extent */
 	if (start_byte >= end_byte)
 		return 0;
 
-	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
+	error = xfs_buffered_write_delalloc_release(VFS_I(ip), start_byte,
+			end_byte);
 	if (error && !xfs_is_shutdown(mp)) {
 		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
-			__func__, XFS_I(inode)->i_ino);
+			__func__, ip->i_ino);
 		return error;
 	}
 	return 0;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 811ea61ba577..7485a5a3af17 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -169,8 +169,8 @@ struct iomap_ops {
 	 * needs to be commited, while the rest needs to be unreserved.
 	 * Written might be zero if no data was written.
 	 */
-	int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
-			ssize_t written, unsigned flags, struct iomap *iomap);
+	int (*iomap_end)(const struct iomap_iter *iter, u64 mapped_length,
+			ssize_t written, struct iomap *iomap);
 
 	/*
 	 * Check that the cached iomap still maps correctly to the filesystem's


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/14] iomap: pass a private pointer to iomap_file_buffered_write
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (8 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 09/14] iomap: pass iter to ->iomap_end implementations Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-15  8:37   ` Christoph Hellwig
  2022-11-09 18:16 ` [PATCH 11/14] xfs: move the seq counters for buffered writes to a private struct Darrick J. Wong
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Allow filesystems to pass a filesystem-private pointer into iomap for
writes into the pagecache.  This will now be accessible from
->iomap_begin implementations, which is key to being able to revalidate
mappings after taking folio locks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/gfs2/bmap.c         |    3 ++-
 fs/gfs2/file.c         |    2 +-
 fs/iomap/buffered-io.c |   14 +++++++++-----
 fs/xfs/xfs_file.c      |    2 +-
 fs/xfs/xfs_iomap.c     |    4 ++--
 fs/xfs/xfs_reflink.c   |    2 +-
 fs/zonefs/super.c      |    3 ++-
 include/linux/iomap.h  |    8 ++++----
 8 files changed, 22 insertions(+), 16 deletions(-)


diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index f215c0735fa6..da5ccd25080b 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1295,7 +1295,8 @@ static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
 {
 	BUG_ON(current->journal_info);
-	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
+	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops,
+			NULL);
 }
 
 #define GFS2_JTRUNC_REVOKES 8192
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 60c6fb91fb58..1e7dc0abe119 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1042,7 +1042,7 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
 
 	current->backing_dev_info = inode_to_bdi(inode);
 	pagefault_disable();
-	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops, NULL);
 	pagefault_enable();
 	current->backing_dev_info = NULL;
 	if (ret > 0) {
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d3c565aa29f8..779244960153 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -836,13 +836,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
 
 ssize_t
 iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
-		const struct iomap_ops *ops)
+		const struct iomap_ops *ops, void *private)
 {
 	struct iomap_iter iter = {
 		.inode		= iocb->ki_filp->f_mapping->host,
 		.pos		= iocb->ki_pos,
 		.len		= iov_iter_count(i),
 		.flags		= IOMAP_WRITE,
+		.private	= private,
 	};
 	int ret;
 
@@ -903,13 +904,14 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter,
 
 int
 iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-		const struct iomap_ops *ops)
+		const struct iomap_ops *ops, void *private)
 {
 	struct iomap_iter iter = {
 		.inode		= inode,
 		.pos		= pos,
 		.len		= len,
 		.flags		= IOMAP_WRITE | IOMAP_UNSHARE,
+		.private	= private,
 	};
 	int ret;
 
@@ -966,13 +968,14 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter,
 
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-		const struct iomap_ops *ops)
+		const struct iomap_ops *ops, void *private)
 {
 	struct iomap_iter iter = {
 		.inode		= inode,
 		.pos		= pos,
 		.len		= len,
 		.flags		= IOMAP_ZERO,
+		.private	= private,
 	};
 	int ret;
 
@@ -984,7 +987,7 @@ EXPORT_SYMBOL_GPL(iomap_zero_range);
 
 int
 iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+		const struct iomap_ops *ops, void *private)
 {
 	unsigned int blocksize = i_blocksize(inode);
 	unsigned int off = pos & (blocksize - 1);
@@ -992,7 +995,8 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 	/* Block boundary? Nothing to do */
 	if (!off)
 		return 0;
-	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops,
+			private);
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 595a5bcf46b9..f3671e22ba16 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -722,7 +722,7 @@ xfs_file_buffered_write(
 
 	trace_xfs_file_buffered_write(iocb, from);
 	ret = iomap_file_buffered_write(iocb, from,
-			&xfs_buffered_write_iomap_ops);
+			&xfs_buffered_write_iomap_ops, NULL);
 	if (likely(ret >= 0))
 		iocb->ki_pos += ret;
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 668f66ca84e4..00a4b60c97e9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1559,7 +1559,7 @@ xfs_zero_range(
 		return dax_zero_range(inode, pos, len, did_zero,
 				      &xfs_direct_write_iomap_ops);
 	return iomap_zero_range(inode, pos, len, did_zero,
-				&xfs_buffered_write_iomap_ops);
+				&xfs_buffered_write_iomap_ops, NULL);
 }
 
 int
@@ -1574,5 +1574,5 @@ xfs_truncate_page(
 		return dax_truncate_page(inode, pos, did_zero,
 					&xfs_direct_write_iomap_ops);
 	return iomap_truncate_page(inode, pos, did_zero,
-				   &xfs_buffered_write_iomap_ops);
+				   &xfs_buffered_write_iomap_ops, NULL);
 }
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 93bdd25680bc..31b7b6e5db45 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1694,7 +1694,7 @@ xfs_reflink_unshare(
 	inode_dio_wait(inode);
 
 	error = iomap_file_unshare(inode, offset, len,
-			&xfs_buffered_write_iomap_ops);
+			&xfs_buffered_write_iomap_ops, NULL);
 	if (error)
 		goto out;
 
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 9a8e261ece8b..f4a9f21545bd 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -983,7 +983,8 @@ static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
 	if (ret <= 0)
 		goto inode_unlock;
 
-	ret = iomap_file_buffered_write(iocb, from, &zonefs_write_iomap_ops);
+	ret = iomap_file_buffered_write(iocb, from, &zonefs_write_iomap_ops,
+			NULL);
 	if (ret > 0)
 		iocb->ki_pos += ret;
 	else if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7485a5a3af17..152353164a5a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -243,18 +243,18 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
 }
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-		const struct iomap_ops *ops);
+		const struct iomap_ops *ops, void *private);
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-		const struct iomap_ops *ops);
+		const struct iomap_ops *ops, void *private);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-		bool *did_zero, const struct iomap_ops *ops);
+		bool *did_zero, const struct iomap_ops *ops, void *private);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+		const struct iomap_ops *ops, void *private);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/14] xfs: move the seq counters for buffered writes to a private struct
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (9 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 10/14] iomap: pass a private pointer to iomap_file_buffered_write Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-15  8:38   ` Christoph Hellwig
  2022-11-09 18:16 ` [PATCH 12/14] xfs: validate COW fork sequence counters during buffered writes Darrick J. Wong
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

To avoid racing with writeback, XFS needs to revalidate (io)mappings
after grabbing each folio lock.  Right now, XFS stashes the sequence
counter values in the (void *pointer) field of the iomap, which is not
the greatest solution.  First, we don't record which fork was sampled,
which means that the current code which samples either fork and compares
it to the data fork seqcount is wrong.  Second, if another thread
touches the cow fork, we (conservatively) want to revalidate because
*something* has changed.

The previous three patches reorganized the iomap callbacks to pass the
iomap_iter to the ->iomap_{begin,end} functions and provided a way to
set the iter->private field for a buffered write.  Now we can update the
buffered write paths in XFS to allocate the necessary memory to sample
both forks' sequence counters and revalidate the data fork during a
write operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c   |    2 -
 fs/xfs/libxfs/xfs_bmap.c |    4 +-
 fs/xfs/xfs_aops.c        |    2 -
 fs/xfs/xfs_file.c        |    3 +
 fs/xfs/xfs_iomap.c       |  115 ++++++++++++++++++++++++++++------------------
 fs/xfs/xfs_iomap.h       |   10 +++-
 fs/xfs/xfs_pnfs.c        |    5 +-
 fs/xfs/xfs_reflink.c     |    3 +
 include/linux/iomap.h    |    2 -
 9 files changed, 90 insertions(+), 56 deletions(-)


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 779244960153..130890907615 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -631,7 +631,7 @@ static int iomap_write_begin(struct iomap_iter *iter,
 	 * to zero) and corrupt data.
 	 */
 	if (ops->iomap_valid) {
-		bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);
+		bool iomap_valid = ops->iomap_valid(iter);
 
 		if (!iomap_valid) {
 			iter->iomap.flags |= IOMAP_F_STALE;
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index db225130618c..bdaa55e725a8 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4552,7 +4552,7 @@ xfs_bmapi_convert_delalloc(
 	 */
 	if (!isnullstartblock(bma.got.br_startblock)) {
 		*seq = READ_ONCE(ifp->if_seq);
-		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
+		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
 		goto out_trans_cancel;
 	}
 
@@ -4600,7 +4600,7 @@ xfs_bmapi_convert_delalloc(
 
 	ASSERT(!isnullstartblock(bma.got.br_startblock));
 	*seq = READ_ONCE(ifp->if_seq);
-	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
+	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
 
 	if (whichfork == XFS_COW_FORK)
 		xfs_refcount_alloc_cow_extent(tp, bma.blkno, bma.length);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ca5a9e45a48c..5d1a995b15f8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -373,7 +373,7 @@ xfs_map_blocks(
 	    isnullstartblock(imap.br_startblock))
 		goto allocate_blocks;
 
-	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0, XFS_WPC(wpc)->data_seq);
+	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0);
 	trace_xfs_map_blocks_found(ip, offset, count, whichfork, &imap);
 	return 0;
 allocate_blocks:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f3671e22ba16..64a1e8982467 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -701,6 +701,7 @@ xfs_file_buffered_write(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
+	struct xfs_iomap_buffered_ctx ibc = { };
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret;
@@ -722,7 +723,7 @@ xfs_file_buffered_write(
 
 	trace_xfs_file_buffered_write(iocb, from);
 	ret = iomap_file_buffered_write(iocb, from,
-			&xfs_buffered_write_iomap_ops, NULL);
+			&xfs_buffered_write_iomap_ops, &ibc);
 	if (likely(ret >= 0))
 		iocb->ki_pos += ret;
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 00a4b60c97e9..c3c23524a3d2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -48,14 +48,30 @@ xfs_alert_fsblock_zero(
 	return -EFSCORRUPTED;
 }
 
+static inline void
+xfs_iomap_sample_seq(
+	const struct iomap_iter		*iter,
+	struct xfs_inode		*ip)
+{
+	struct xfs_iomap_buffered_ctx	*ibc = iter->private;
+
+	if (!ibc)
+		return;
+
+	/* The extent tree sequence is needed for iomap validity checking. */
+	ibc->data_seq = READ_ONCE(ip->i_df.if_seq);
+	ibc->has_cow_seq = xfs_inode_has_cow_data(ip);
+	if (ibc->has_cow_seq)
+		ibc->cow_seq = READ_ONCE(ip->i_cowfp->if_seq);
+}
+
 int
 xfs_bmbt_to_iomap(
 	struct xfs_inode	*ip,
 	struct iomap		*iomap,
 	struct xfs_bmbt_irec	*imap,
 	unsigned int		mapping_flags,
-	u16			iomap_flags,
-	int			sequence)
+	u16			iomap_flags)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
@@ -92,9 +108,6 @@ xfs_bmbt_to_iomap(
 	if (xfs_ipincount(ip) &&
 	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
 		iomap->flags |= IOMAP_F_DIRTY;
-
-	/* The extent tree sequence is needed for iomap validity checking. */
-	*((int *)&iomap->private) = sequence;
 	return 0;
 }
 
@@ -193,14 +206,14 @@ xfs_iomap_eof_align_last_fsb(
 	return end_fsb;
 }
 
-int
-xfs_iomap_write_direct(
+static int
+__xfs_iomap_write_direct(
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		offset_fsb,
 	xfs_fileoff_t		count_fsb,
 	unsigned int		flags,
 	struct xfs_bmbt_irec	*imap,
-	int			*seq)
+	const struct iomap_iter	*iter)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -290,8 +303,8 @@ xfs_iomap_write_direct(
 		error = xfs_alert_fsblock_zero(ip, imap);
 
 out_unlock:
-	if (seq)
-		*seq = READ_ONCE(ip->i_df.if_seq);
+	if (iter)
+		xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 
@@ -300,6 +313,18 @@ xfs_iomap_write_direct(
 	goto out_unlock;
 }
 
+int
+xfs_iomap_write_direct(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		count_fsb,
+	unsigned int		flags,
+	struct xfs_bmbt_irec	*imap)
+{
+	return __xfs_iomap_write_direct(ip, offset_fsb, count_fsb, flags, imap,
+			NULL);
+}
+
 STATIC bool
 xfs_quota_need_throttle(
 	struct xfs_inode	*ip,
@@ -751,7 +776,6 @@ xfs_direct_write_iomap_begin(
 	bool			shared = false;
 	u16			iomap_flags = 0;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
-	int			seq;
 
 	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
 
@@ -820,10 +844,10 @@ xfs_direct_write_iomap_begin(
 			goto out_unlock;
 	}
 
-	seq = READ_ONCE(ip->i_df.if_seq);
+	xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags);
 
 allocate_blocks:
 	error = -EAGAIN;
@@ -848,26 +872,26 @@ xfs_direct_write_iomap_begin(
 		end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
 	xfs_iunlock(ip, lockmode);
 
-	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap, &seq);
+	error = __xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
+			flags, &imap, iter);
 	if (error)
 		return error;
 
 	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 iomap_flags | IOMAP_F_NEW, seq);
+				 iomap_flags | IOMAP_F_NEW);
 
 out_found_cow:
-	seq = READ_ONCE(ip->i_df.if_seq);
+	xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, lockmode);
 	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
 	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
 	if (imap.br_startblock != HOLESTARTBLOCK) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
 		if (error)
 			return error;
 	}
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
 
 out_unlock:
 	if (lockmode)
@@ -926,7 +950,6 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 	unsigned int		lockmode = XFS_ILOCK_EXCL;
-	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1105,29 +1128,29 @@ xfs_buffered_write_iomap_begin(
 	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
 	 * them out if the write happens to fail.
 	 */
-	seq = READ_ONCE(ip->i_df.if_seq);
+	xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW);
 
 found_imap:
-	seq = READ_ONCE(ip->i_df.if_seq);
+	xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
 
 found_cow:
-	seq = READ_ONCE(ip->i_df.if_seq);
+	xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	if (imap.br_startoff <= offset_fsb) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
 		if (error)
 			return error;
 		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-					 IOMAP_F_SHARED, seq);
+					 IOMAP_F_SHARED);
 	}
 
 	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0);
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1340,12 +1363,12 @@ xfs_buffered_write_iomap_end(
  */
 static bool
 xfs_buffered_write_iomap_valid(
-	struct inode		*inode,
-	const struct iomap	*iomap)
+	const struct iomap_iter		*iter)
 {
-	int			seq = *((int *)&iomap->private);
+	struct xfs_iomap_buffered_ctx	*ibc = iter->private;
+	struct xfs_inode		*ip = XFS_I(iter->inode);
 
-	if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq))
+	if (ibc->data_seq != READ_ONCE(ip->i_df.if_seq))
 		return false;
 	return true;
 }
@@ -1383,7 +1406,6 @@ xfs_read_iomap_begin(
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
-	int			seq;
 
 	ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
 
@@ -1397,14 +1419,15 @@ xfs_read_iomap_begin(
 			       &nimaps, 0);
 	if (!error && (flags & IOMAP_REPORT))
 		error = xfs_reflink_trim_around_shared(ip, &imap, &shared);
-	seq = READ_ONCE(ip->i_df.if_seq);
+	if (!error)
+		xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, lockmode);
 
 	if (error)
 		return error;
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 shared ? IOMAP_F_SHARED : 0, seq);
+				 shared ? IOMAP_F_SHARED : 0);
 }
 
 const struct iomap_ops xfs_read_iomap_ops = {
@@ -1463,9 +1486,13 @@ xfs_seek_iomap_begin(
 	if (cow_fsb != NULLFILEOFF && cow_fsb <= offset_fsb) {
 		if (data_fsb < cow_fsb + cmap.br_blockcount)
 			end_fsb = min(end_fsb, data_fsb);
+		xfs_iomap_sample_seq(iter, ip);
 		xfs_trim_extent(&cmap, offset_fsb, end_fsb);
 		error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-				IOMAP_F_SHARED, READ_ONCE(ip->i_cowfp->if_seq));
+				IOMAP_F_SHARED);
+		if (error)
+			goto out_unlock;
+
 		/*
 		 * This is a COW extent, so we must probe the page cache
 		 * because there could be dirty page cache being backed
@@ -1487,8 +1514,8 @@ xfs_seek_iomap_begin(
 	imap.br_state = XFS_EXT_NORM;
 done:
 	xfs_trim_extent(&imap, offset_fsb, end_fsb);
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0,
-			READ_ONCE(ip->i_df.if_seq));
+	xfs_iomap_sample_seq(iter, ip);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
 out_unlock:
 	xfs_iunlock(ip, lockmode);
 	return error;
@@ -1515,7 +1542,6 @@ xfs_xattr_iomap_begin(
 	struct xfs_bmbt_irec	imap;
 	int			nimaps = 1, error = 0;
 	unsigned		lockmode;
-	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1532,14 +1558,13 @@ xfs_xattr_iomap_begin(
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
 			       &nimaps, XFS_BMAPI_ATTRFORK);
 out_unlock:
-
-	seq = READ_ONCE(ip->i_af.if_seq);
+	xfs_iomap_sample_seq(iter, ip);
 	xfs_iunlock(ip, lockmode);
 
 	if (error)
 		return error;
 	ASSERT(nimaps);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
 }
 
 const struct iomap_ops xfs_xattr_iomap_ops = {
@@ -1553,13 +1578,14 @@ xfs_zero_range(
 	loff_t			len,
 	bool			*did_zero)
 {
+	struct xfs_iomap_buffered_ctx ibc = { };
 	struct inode		*inode = VFS_I(ip);
 
 	if (IS_DAX(inode))
 		return dax_zero_range(inode, pos, len, did_zero,
 				      &xfs_direct_write_iomap_ops);
 	return iomap_zero_range(inode, pos, len, did_zero,
-				&xfs_buffered_write_iomap_ops, NULL);
+				&xfs_buffered_write_iomap_ops, &ibc);
 }
 
 int
@@ -1568,11 +1594,12 @@ xfs_truncate_page(
 	loff_t			pos,
 	bool			*did_zero)
 {
+	struct xfs_iomap_buffered_ctx ibc = { };
 	struct inode		*inode = VFS_I(ip);
 
 	if (IS_DAX(inode))
 		return dax_truncate_page(inode, pos, did_zero,
 					&xfs_direct_write_iomap_ops);
 	return iomap_truncate_page(inode, pos, did_zero,
-				   &xfs_buffered_write_iomap_ops, NULL);
+				   &xfs_buffered_write_iomap_ops, &ibc);
 }
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 792fed2a9072..369de43ae568 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -11,16 +11,22 @@
 struct xfs_inode;
 struct xfs_bmbt_irec;
 
+struct xfs_iomap_buffered_ctx {
+	unsigned int	data_seq;
+	unsigned int	cow_seq;
+	bool		has_cow_seq;
+};
+
 int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
 		xfs_fileoff_t count_fsb, unsigned int flags,
-		struct xfs_bmbt_irec *imap, int *sequence);
+		struct xfs_bmbt_irec *imap);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t, bool);
 xfs_fileoff_t xfs_iomap_eof_align_last_fsb(struct xfs_inode *ip,
 		xfs_fileoff_t end_fsb);
 
 int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
 		struct xfs_bmbt_irec *imap, unsigned int mapping_flags,
-		u16 iomap_flags, int sequence);
+		u16 iomap_flags);
 
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
 		bool *did_zero);
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index eea507a80c5c..37a24f0f7cd4 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -125,7 +125,6 @@ xfs_fs_map_blocks(
 	int			nimaps = 1;
 	uint			lock_flags;
 	int			error = 0;
-	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -190,7 +189,7 @@ xfs_fs_map_blocks(
 		xfs_iunlock(ip, lock_flags);
 
 		error = xfs_iomap_write_direct(ip, offset_fsb,
-				end_fsb - offset_fsb, 0, &imap, &seq);
+				end_fsb - offset_fsb, 0, &imap);
 		if (error)
 			goto out_unlock;
 
@@ -210,7 +209,7 @@ xfs_fs_map_blocks(
 	}
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, 0, 0, seq);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, 0, 0);
 	*device_generation = mp->m_generation;
 	return error;
 out_unlock:
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 31b7b6e5db45..33b4853a31d5 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1683,6 +1683,7 @@ xfs_reflink_unshare(
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
+	struct xfs_iomap_buffered_ctx ibc = { };
 	struct inode		*inode = VFS_I(ip);
 	int			error;
 
@@ -1694,7 +1695,7 @@ xfs_reflink_unshare(
 	inode_dio_wait(inode);
 
 	error = iomap_file_unshare(inode, offset, len,
-			&xfs_buffered_write_iomap_ops, NULL);
+			&xfs_buffered_write_iomap_ops, &ibc);
 	if (error)
 		goto out;
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 152353164a5a..af8ead9ac362 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -182,7 +182,7 @@ struct iomap_ops {
 	 * This is called with the folio over the specified file position
 	 * held locked by the iomap code.
 	 */
-	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+	bool (*iomap_valid)(const struct iomap_iter *iter);
 };
 
 /**


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 12/14] xfs: validate COW fork sequence counters during buffered writes
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (10 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 11/14] xfs: move the seq counters for buffered writes to a private struct Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:16 ` [PATCH 13/14] xfs: add debug knob to slow down writeback for fun Darrick J. Wong
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

In the previous patch, we transformed the iomap revalidation code to use
an explicit context object where data and cow fork sequence counters are
tracked explicitly.  The existing validation function only validated the
data fork sequence counter, so now let's make it validate both.

I /think/ this isn't actually necessary here because we're writing to
the page cache, and the page state does not track or care about cow
status.  However, this question came up when Dave and I were chatting
about this patchset on IRC, so here it is for formal consideration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c3c23524a3d2..5e746df2c63f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1370,6 +1370,9 @@ xfs_buffered_write_iomap_valid(
 
 	if (ibc->data_seq != READ_ONCE(ip->i_df.if_seq))
 		return false;
+	if (ibc->has_cow_seq &&
+	    ibc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
+		return false;
 	return true;
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 13/14] xfs: add debug knob to slow down writeback for fun
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (11 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 12/14] xfs: validate COW fork sequence counters during buffered writes Darrick J. Wong
@ 2022-11-09 18:16 ` Darrick J. Wong
  2022-11-09 18:17 ` [PATCH 14/14] xfs: add debug knob to slow down write " Darrick J. Wong
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:16 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Add a new error injection knob so that we can arbitrarily slow down
writeback to test for race conditions and aberrant reclaim behavior if
the writeback mechanisms are slow to issue writeback.  This will enable
functional testing for the ifork sequence counters introduced in commit
745b3f76d1c8 ("xfs: maintain a sequence count for inode fork
manipulations").

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 +++-
 fs/xfs/xfs_aops.c            |   12 ++++++++++--
 fs/xfs/xfs_error.c           |   13 +++++++++++++
 fs/xfs/xfs_error.h           |   22 ++++++++++++++++++++++
 4 files changed, 48 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 580ccbd5aadc..7c723a36df02 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -61,7 +61,8 @@
 #define XFS_ERRTAG_LARP					39
 #define XFS_ERRTAG_DA_LEAF_SPLIT			40
 #define XFS_ERRTAG_ATTR_LEAF_TO_NODE			41
-#define XFS_ERRTAG_MAX					42
+#define XFS_ERRTAG_WB_DELAY_MS				44
+#define XFS_ERRTAG_MAX					45
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -107,5 +108,6 @@
 #define XFS_RANDOM_LARP					1
 #define XFS_RANDOM_DA_LEAF_SPLIT			1
 #define XFS_RANDOM_ATTR_LEAF_TO_NODE			1
+#define XFS_RANDOM_WB_DELAY_MS				3000
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5d1a995b15f8..95200aa77676 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -17,6 +17,8 @@
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
 #include "xfs_reflink.h"
+#include "xfs_errortag.h"
+#include "xfs_error.h"
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -218,11 +220,15 @@ xfs_imap_valid(
 	 * checked (and found nothing at this offset) could have added
 	 * overlapping blocks.
 	 */
-	if (XFS_WPC(wpc)->data_seq != READ_ONCE(ip->i_df.if_seq))
+	if (XFS_WPC(wpc)->data_seq != READ_ONCE(ip->i_df.if_seq)) {
+		XFS_ERRORTAG_REPORT(ip->i_mount, XFS_ERRTAG_WB_DELAY_MS);
 		return false;
+	}
 	if (xfs_inode_has_cow_data(ip) &&
-	    XFS_WPC(wpc)->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
+	    XFS_WPC(wpc)->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)) {
+		XFS_ERRORTAG_REPORT(ip->i_mount, XFS_ERRTAG_WB_DELAY_MS);
 		return false;
+	}
 	return true;
 }
 
@@ -286,6 +292,8 @@ xfs_map_blocks(
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
+	XFS_ERRORTAG_DELAY(mp, XFS_ERRTAG_WB_DELAY_MS);
+
 	/*
 	 * COW fork blocks can overlap data fork blocks even if the blocks
 	 * aren't shared.  COW I/O always takes precedent, so we must always
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index dea3c0649d2f..b949cfe7dd18 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -60,6 +60,9 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_LARP,
 	XFS_RANDOM_DA_LEAF_SPLIT,
 	XFS_RANDOM_ATTR_LEAF_TO_NODE,
+	0, /* 42 */
+	0, /* 43 */
+	XFS_RANDOM_WB_DELAY_MS,
 };
 
 struct xfs_errortag_attr {
@@ -175,6 +178,7 @@ XFS_ERRORTAG_ATTR_RW(ag_resv_fail, XFS_ERRTAG_AG_RESV_FAIL);
 XFS_ERRORTAG_ATTR_RW(larp,		XFS_ERRTAG_LARP);
 XFS_ERRORTAG_ATTR_RW(da_leaf_split,	XFS_ERRTAG_DA_LEAF_SPLIT);
 XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node,	XFS_ERRTAG_ATTR_LEAF_TO_NODE);
+XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -218,6 +222,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(larp),
 	XFS_ERRORTAG_ATTR_LIST(da_leaf_split),
 	XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
+	XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_errortag);
@@ -267,6 +272,14 @@ xfs_errortag_valid(
 	return true;
 }
 
+bool
+xfs_errortag_enabled(
+	struct xfs_mount	*mp,
+	unsigned int		tag)
+{
+	return mp->m_errortag && mp->m_errortag[tag] != 0;
+}
+
 bool
 xfs_errortag_test(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 5191e9145e55..15baa5428e1a 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -45,6 +45,26 @@ extern bool xfs_errortag_test(struct xfs_mount *mp, const char *expression,
 		const char *file, int line, unsigned int error_tag);
 #define XFS_TEST_ERROR(expr, mp, tag)		\
 	((expr) || xfs_errortag_test((mp), #expr, __FILE__, __LINE__, (tag)))
+bool xfs_errortag_enabled(struct xfs_mount *mp, unsigned int tag);
+#define XFS_ERRORTAG_REPORT(mp, tag)		\
+	do { \
+		if (!xfs_errortag_enabled((mp), (tag))) \
+			break; \
+		xfs_warn((mp), \
+"Injected errortag (%s) set to %u at file %s, line %d, on filesystem \"%s\"", \
+				#tag, (mp)->m_errortag[(tag)], \
+				__FILE__, __LINE__, (mp)->m_super->s_id); \
+	} while (0)
+#define XFS_ERRORTAG_DELAY(mp, tag)		\
+	do { \
+		if (!xfs_errortag_enabled((mp), (tag))) \
+			break; \
+		xfs_warn_ratelimited((mp), \
+"Injecting %ums delay at file %s, line %d, on filesystem \"%s\"", \
+				(mp)->m_errortag[(tag)], __FILE__, __LINE__, \
+				(mp)->m_super->s_id); \
+		mdelay((mp)->m_errortag[(tag)]); \
+	} while (0)
 
 extern int xfs_errortag_get(struct xfs_mount *mp, unsigned int error_tag);
 extern int xfs_errortag_set(struct xfs_mount *mp, unsigned int error_tag,
@@ -55,6 +75,8 @@ extern int xfs_errortag_clearall(struct xfs_mount *mp);
 #define xfs_errortag_init(mp)			(0)
 #define xfs_errortag_del(mp)
 #define XFS_TEST_ERROR(expr, mp, tag)		(expr)
+#define XFS_ERRORTAG_REPORT(mp, tag)		((void)0)
+#define XFS_ERRORTAG_DELAY(mp, tag)		((void)0)
 #define xfs_errortag_set(mp, tag, val)		(ENOSYS)
 #define xfs_errortag_add(mp, tag)		(ENOSYS)
 #define xfs_errortag_clearall(mp)		(ENOSYS)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 14/14] xfs: add debug knob to slow down write for fun
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (12 preceding siblings ...)
  2022-11-09 18:16 ` [PATCH 13/14] xfs: add debug knob to slow down writeback for fun Darrick J. Wong
@ 2022-11-09 18:17 ` Darrick J. Wong
  2022-11-09 18:22 ` [PATCH 15/14] fstest: regression test for writeback corruption bug Darrick J. Wong
  2022-11-09 18:23 ` [PATCH 16/14] fstest: regression test for writes racing with reclaim writeback Darrick J. Wong
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:17 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Add a new error injection knob so that we can arbitrarily slow down
pagecahe writes to test for race conditions and aberrant reclaim
behavior if the writeback mechanisms are slow to issue writeback.  This
will enable functional testing for the ifork sequence counters
introduced in commit XXXXXXXXXXXX that fixes write racing with reclaim
writeback.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 +++-
 fs/xfs/xfs_error.c           |    3 +++
 fs/xfs/xfs_iomap.c           |   12 ++++++++++--
 3 files changed, 16 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 7c723a36df02..7e04523f44f2 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -62,7 +62,8 @@
 #define XFS_ERRTAG_DA_LEAF_SPLIT			40
 #define XFS_ERRTAG_ATTR_LEAF_TO_NODE			41
 #define XFS_ERRTAG_WB_DELAY_MS				44
-#define XFS_ERRTAG_MAX					45
+#define XFS_ERRTAG_WRITE_DELAY_MS			45
+#define XFS_ERRTAG_MAX					46
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -109,5 +110,6 @@
 #define XFS_RANDOM_DA_LEAF_SPLIT			1
 #define XFS_RANDOM_ATTR_LEAF_TO_NODE			1
 #define XFS_RANDOM_WB_DELAY_MS				3000
+#define XFS_RANDOM_WRITE_DELAY_MS			3000
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index b949cfe7dd18..ee00f5ab11a7 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -63,6 +63,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	0, /* 42 */
 	0, /* 43 */
 	XFS_RANDOM_WB_DELAY_MS,
+	XFS_RANDOM_WRITE_DELAY_MS,
 };
 
 struct xfs_errortag_attr {
@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(larp,		XFS_ERRTAG_LARP);
 XFS_ERRORTAG_ATTR_RW(da_leaf_split,	XFS_ERRTAG_DA_LEAF_SPLIT);
 XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node,	XFS_ERRTAG_ATTR_LEAF_TO_NODE);
 XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
+XFS_ERRORTAG_ATTR_RW(write_delay_ms,	XFS_ERRTAG_WRITE_DELAY_MS);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -223,6 +225,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(da_leaf_split),
 	XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
 	XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
+	XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_errortag);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5e746df2c63f..048e7d9a739b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -27,6 +27,8 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -1368,11 +1370,17 @@ xfs_buffered_write_iomap_valid(
 	struct xfs_iomap_buffered_ctx	*ibc = iter->private;
 	struct xfs_inode		*ip = XFS_I(iter->inode);
 
-	if (ibc->data_seq != READ_ONCE(ip->i_df.if_seq))
+	if (ibc->data_seq != READ_ONCE(ip->i_df.if_seq)) {
+		XFS_ERRORTAG_REPORT(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
 		return false;
+	}
 	if (ibc->has_cow_seq &&
-	    ibc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
+	    ibc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)) {
+		XFS_ERRORTAG_REPORT(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
 		return false;
+	}
+
+	XFS_ERRORTAG_DELAY(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
 	return true;
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 15/14] fstest: regression test for writeback corruption bug
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (13 preceding siblings ...)
  2022-11-09 18:17 ` [PATCH 14/14] xfs: add debug knob to slow down write " Darrick J. Wong
@ 2022-11-09 18:22 ` Darrick J. Wong
  2022-11-09 18:23 ` [PATCH 16/14] fstest: regression test for writes racing with reclaim writeback Darrick J. Wong
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:22 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

This is a regression test for a data corruption bug that existed in XFS'
copy on write code between 4.9 and 4.19.  The root cause is a
concurrency bug wherein we would drop ILOCK_SHARED after querying the
CoW fork in xfs_map_cow and retake it before querying the data fork in
xfs_map_blocks.  See the test description for a lot more details.

Cc: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/924     |  197 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/924.out |    2 +
 2 files changed, 199 insertions(+)
 create mode 100755 tests/xfs/924
 create mode 100644 tests/xfs/924.out

diff --git a/tests/xfs/924 b/tests/xfs/924
new file mode 100755
index 0000000000..486afefedd
--- /dev/null
+++ b/tests/xfs/924
@@ -0,0 +1,197 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 924
+#
+# This is a regression test for a data corruption bug that existed in XFS' copy
+# on write code between 4.9 and 4.19.  The root cause is a concurrency bug
+# wherein we would drop ILOCK_SHARED after querying the CoW fork in xfs_map_cow
+# and retake it before querying the data fork in xfs_map_blocks.  If a second
+# thread changes the CoW fork mappings between the two calls, it's possible for
+# xfs_map_blocks to return a zero-block mapping, which results in writeback
+# being elided for that block.  Elided writeback of dirty data results in
+# silent loss of writes.
+#
+# Worse yet, kernels from that era still used buffer heads, which means that an
+# elided writeback leaves the page clean but the bufferheads dirty.  Due to a
+# naïve optimization in mark_buffer_dirty, the SetPageDirty call is elided if
+# the bufferhead is dirty, which means that a subsequent rewrite of the data
+# block will never result in the page being marked dirty, and all subsequent
+# writes are lost.
+#
+# It turns out that Christoph Hellwig unwittingly fixed the race in commit
+# 5c665e5b5af6 ("xfs: remove xfs_map_cow"), and no testcase was ever written.
+# Four years later, we hit it on a production 4.14 kernel.  This testcase
+# relies on a debugging knob that introduces artificial delays into writeback.
+#
+# Before the race, the file blocks 0-1 are not shared and blocks 2-5 are
+# shared.  There are no extents in CoW fork.
+#
+# Two threads race like this:
+#
+# Thread 1 (writeback block 0)     | Thread 2  (write to block 2)
+# ---------------------------------|--------------------------------
+#                                  |
+# 1. Check if block 0 in CoW fork  |
+#    from xfs_map_cow.             |
+#                                  |
+# 2. Block 0 not found in CoW      |
+#    fork; the block is considered |
+#    not shared.                   |
+#                                  |
+# 3. xfs_map_blocks looks up data  |
+#    fork to get a map covering    |
+#    block 0.                      |
+#                                  |
+# 4. It gets a data fork mapping   |
+#    for block 0 with length 2.    |
+#                                  |
+#                                  | 1. A buffered write to block 2 sees
+#                                  |    that it is a shared block and no
+#                                  |    extent covers block 2 in CoW fork.
+#                                  |
+#                                  |    It creates a new CoW fork mapping.
+#                                  |    Due to the cowextsize, the new
+#                                  |    extent starts at block 0 with
+#                                  |    length 128.
+#                                  |
+#                                  |
+# 5. It lookup CoW fork again to   |
+#    trim the map (0, 2) to a      |
+#    shared block boundary.        |
+#                                  |
+# 5a. It finds (0, 128) in CoW fork|
+# 5b. It trims the data fork map   |
+#     from (0, 1) to (0, 0) (!!!)  |
+#                                  |
+# 6. The xfs_imap_valid call after |
+#    the xfs_map_blocks call checks|
+#    if the mapping (0, 0) covers  |
+#    block 0.  The result is "NO". |
+#                                  |
+# 7. Since block 0 has no physical |
+#    block mapped, it's not added  |
+#    to the ioend.  This is the    |
+#    first problem.                |
+#                                  |
+# 8. xfs_add_to_ioend usually      |
+#    clears the bufferhead dirty   |
+#    flag  Because this is skipped,|
+#    we leave the page clean with  |
+#    the associated buffer head(s) |
+#    dirty (the second problem).   |
+#    Now the dirty state is        |
+#    inconsistent.
+#
+# On newer kernels, this is also a functionality test for the ifork sequence
+# counter because the writeback completions will change the data fork and force
+# revalidations of the wb mapping.
+#
+. ./common/preamble
+_begin_fstest auto quick clone
+
+# Import common functions.
+. ./common/reflink
+. ./common/inject
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+_fixed_by_kernel_commit 5c665e5b5af6 "xfs: remove xfs_map_cow"
+_require_error_injection
+_require_scratch_reflink
+_require_cp_reflink
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount >> $seqres.full
+
+knob="$(_find_xfs_mountdev_errortag_knob $SCRATCH_DEV "wb_delay_ms")"
+test -w "$knob" || _notrun "Kernel does not have wb_delay_ms error injector"
+
+blksz=65536
+_require_congruent_file_oplen $SCRATCH_MNT $blksz
+
+# Make sure we have sufficient extent size to create speculative CoW
+# preallocations.
+$XFS_IO_PROG -c 'cowextsize 1m' $SCRATCH_MNT
+
+# Write out a file with the first two blocks unshared and the rest shared.
+_pwrite_byte 0x59 0 $((160 * blksz)) $SCRATCH_MNT/file >> $seqres.full
+_pwrite_byte 0x59 0 $((160 * blksz)) $SCRATCH_MNT/file.compare >> $seqres.full
+sync
+
+_cp_reflink $SCRATCH_MNT/file $SCRATCH_MNT/file.reflink
+
+_pwrite_byte 0x58 0 $((2 * blksz)) $SCRATCH_MNT/file >> $seqres.full
+_pwrite_byte 0x58 0 $((2 * blksz)) $SCRATCH_MNT/file.compare >> $seqres.full
+sync
+
+# Avoid creation of large folios on newer kernels by cycling the mount and
+# immediately writing to the page cache.
+_scratch_cycle_mount
+
+# Write the same data to file.compare as we're about to do to file.  Do this
+# before slowing down writeback to avoid unnecessary delay.
+_pwrite_byte 0x57 0 $((2 * blksz)) $SCRATCH_MNT/file.compare >> $seqres.full
+_pwrite_byte 0x56 $((2 * blksz)) $((2 * blksz)) $SCRATCH_MNT/file.compare >> $seqres.full
+sync
+
+# Introduce a half-second wait to each writeback block mapping call.  This
+# gives us a chance to race speculative cow prealloc with writeback.
+wb_delay=500
+echo $wb_delay > $knob
+curval="$(cat $knob)"
+test "$curval" -eq $wb_delay || echo "expected wb_delay_ms == $wb_delay"
+
+# Start thread 1 + writeback above
+$XFS_IO_PROG -c "pwrite -S 0x57 0 $((2 * blksz))" \
+	-c 'bmap -celpv' -c 'bmap -elpv' \
+	-c 'fsync' $SCRATCH_MNT/file >> $seqres.full &
+sleep 1
+
+# Start a sentry to look for evidence of the XFS_ERRORTAG_REPORT logging.  If
+# we see that, we know we've forced writeback to revalidate a mapping.  The
+# test has been successful, so turn off the delay.
+sentryfile=$TEST_DIR/$seq.sentry
+wait_for_errortag() {
+	while [ -e "$sentryfile" ]; do
+		if _check_dmesg_for XFS_ERRTAG_WB_DELAY_MS; then
+			echo 0 > $knob
+			break;
+		fi
+		sleep 1
+	done
+}
+touch $sentryfile
+wait_for_errortag &
+
+# Start thread 2 to create the cowextsize reservation
+$XFS_IO_PROG -c "pwrite -S 0x56 $((2 * blksz)) $((2 * blksz))" \
+	-c 'bmap -celpv' -c 'bmap -elpv' \
+	-c 'fsync' $SCRATCH_MNT/file >> $seqres.full
+rm -f $sentryfile
+
+_check_dmesg_for XFS_ERRTAG_WB_DELAY_MS
+saw_delay=$?
+
+# Flush everything to disk.  If the bug manifests, then after the cycle,
+# file should have stale 0x58 in block 0 because we silently dropped a write.
+_scratch_cycle_mount
+
+if ! cmp -s $SCRATCH_MNT/file $SCRATCH_MNT/file.compare; then
+	echo file and file.compare do not match
+	$XFS_IO_PROG -c 'bmap -celpv' -c 'bmap -elpv' $SCRATCH_MNT/file >> $seqres.full
+	echo file.compare
+	od -tx1 -Ad -c $SCRATCH_MNT/file.compare
+	echo file
+	od -tx1 -Ad -c $SCRATCH_MNT/file
+elif [ $saw_delay -ne 0 ]; then
+	# The files matched, but nothing got logged about the revalidation?
+	echo "Expected to hear about XFS_ERRTAG_WB_DELAY_MS?"
+fi
+
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/924.out b/tests/xfs/924.out
new file mode 100644
index 0000000000..c6655da35a
--- /dev/null
+++ b/tests/xfs/924.out
@@ -0,0 +1,2 @@
+QA output created by 924
+Silence is golden

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 16/14] fstest: regression test for writes racing with reclaim writeback
  2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
                   ` (14 preceding siblings ...)
  2022-11-09 18:22 ` [PATCH 15/14] fstest: regression test for writeback corruption bug Darrick J. Wong
@ 2022-11-09 18:23 ` Darrick J. Wong
  15 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2022-11-09 18:23 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: linux-xfs, linux-fsdevel, david, hch

From: Darrick J. Wong <djwong@kernel.org>

This test uses the new write delay debug knobs to set up a slow-moving
write racing with writeback of an unwritten block at the end of the
file range targetted by the slow write.  The test succeeds if the file
does not end up corrupt and if the kernel log contains a log message
about the revalidation occurring.

NOTE: I'm not convinced that madvise actually causes the page to be
removed from the pagecache, which means that this is effectively a
functional test for the invalidation, not a corruption reproducer.
On the other hand, the functional test can be done quickly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/925     |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/925.out |    2 +
 2 files changed, 112 insertions(+)
 create mode 100755 tests/xfs/925
 create mode 100644 tests/xfs/925.out

diff --git a/tests/xfs/925 b/tests/xfs/925
new file mode 100755
index 0000000000..53699c0167
--- /dev/null
+++ b/tests/xfs/925
@@ -0,0 +1,110 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 925
+#
+# This is a regression test for a data corruption bug that existed in iomap's
+# buffered write routines.
+#
+. ./common/preamble
+_begin_fstest auto quick rw
+
+# Import common functions.
+. ./common/inject
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+#_fixed_by_kernel_commit 5c665e5b5af6 "xfs: remove xfs_map_cow"
+_require_error_injection
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount >> $seqres.full
+
+knob="$(_find_xfs_mountdev_errortag_knob $SCRATCH_DEV "write_delay_ms")"
+test -w "$knob" || _notrun "Kernel does not have write_delay_ms error injector"
+
+blocks=10
+blksz=$(get_page_size)
+filesz=$((blocks * blksz))
+dirty_offset=$(( filesz - 1 ))
+write_len=$(( ( (blocks - 1) * blksz) + 1 ))
+
+# Create a large file with a large unwritten range.
+$XFS_IO_PROG -f -c "falloc 0 $filesz" $SCRATCH_MNT/file >> $seqres.full
+
+# Write the same data to file.compare as we're about to do to file.  Do this
+# before slowing down writeback to avoid unnecessary delay.
+_pwrite_byte 0x58 $dirty_offset 1 $SCRATCH_MNT/file.compare >> $seqres.full
+_pwrite_byte 0x57 0 $write_len $SCRATCH_MNT/file.compare >> $seqres.full
+
+# Reinitialize the page cache for this file.
+_scratch_cycle_mount
+
+# Dirty the last page in the range and immediately set the write delay so that
+# any subsequent writes have to wait.
+$XFS_IO_PROG -c "pwrite -S 0x58 $dirty_offset 1" $SCRATCH_MNT/file >> $seqres.full
+write_delay=500
+echo $write_delay > $knob
+curval="$(cat $knob)"
+test "$curval" -eq $write_delay || echo "expected write_delay_ms == $write_delay"
+
+# Start a sentry to look for evidence of the XFS_ERRORTAG_REPORT logging.  If
+# we see that, we know we've forced writeback to revalidate a mapping.  The
+# test has been successful, so turn off the delay.
+sentryfile=$TEST_DIR/$seq.sentry
+wait_for_errortag() {
+	while [ -e "$sentryfile" ]; do
+		if _check_dmesg_for XFS_ERRTAG_WRITE_DELAY_MS; then
+			echo 0 > $knob
+			break;
+		fi
+		sleep 1
+	done
+}
+touch $sentryfile
+wait_for_errortag &
+
+# Start thread 1 + writeback above
+($XFS_IO_PROG -c "pwrite -S 0x57 -b $write_len 0 $write_len" \
+	-c 'bmap -celpv' -c 'bmap -elpv' \
+	$SCRATCH_MNT/file >> $seqres.full; rm -f $sentryfile) &
+sleep 1
+
+# Start thread 2 to simulate reclaim writeback via sync_file_range and fadvise
+# to drop the page cache.  XXX does that actually work?
+#	-c "fadvise -d $dirty_offset 1" \
+dirty_pageoff=$((filesz - blksz))
+$XFS_IO_PROG -c "sync_range -a -w $dirty_pageoff $blksz" \
+	-c "mmap -r 0 $filesz" \
+	-c "madvise -d 0 $filesz" \
+	-c 'bmap -celpv' -c 'bmap -elpv' \
+	$SCRATCH_MNT/file >> $seqres.full
+wait
+rm -f $sentryfile
+
+_check_dmesg_for XFS_ERRTAG_WRITE_DELAY_MS
+saw_delay=$?
+
+# Flush everything to disk.  If the bug manifests, then after the cycle,
+# file should have stale 0x58 in block 0 because we silently dropped a write.
+_scratch_cycle_mount
+
+if ! cmp -s $SCRATCH_MNT/file $SCRATCH_MNT/file.compare; then
+	echo file and file.compare do not match
+	$XFS_IO_PROG -c 'bmap -celpv' -c 'bmap -elpv' $SCRATCH_MNT/file >> $seqres.full
+	echo file.compare
+	od -tx1 -Ad -c $SCRATCH_MNT/file.compare
+	echo file
+	od -tx1 -Ad -c $SCRATCH_MNT/file
+elif [ $saw_delay -ne 0 ]; then
+	# The files matched, but nothing got logged about the revalidation?
+	echo "Expected to hear about XFS_ERRTAG_WB_DELAY_MS?"
+fi
+
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/925.out b/tests/xfs/925.out
new file mode 100644
index 0000000000..95088ce8a5
--- /dev/null
+++ b/tests/xfs/925.out
@@ -0,0 +1,2 @@
+QA output created by 925
+Silence is golden

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 04/14] xfs: buffered write failure should not truncate the page cache
  2022-11-09 18:16 ` [PATCH 04/14] xfs: buffered write failure should not truncate the page cache Darrick J. Wong
@ 2022-11-15  8:32   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, david, hch

On Wed, Nov 09, 2022 at 10:16:07AM -0800, Darrick J. Wong wrote:
>  mm/filemap.c       |    1 

These new exports are another really good indicator that the
punch out delalloc code really should live in iomap and not in
the file system.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/14] iomap: pass iter to ->iomap_begin implementations
  2022-11-09 18:16 ` [PATCH 08/14] iomap: pass iter to ->iomap_begin implementations Darrick J. Wong
@ 2022-11-15  8:36   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, david, hch

On Wed, Nov 09, 2022 at 10:16:29AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Clean up the ->iomap_begin call sites by passing a pointer to the iter
> structure into the iomap_begin functions.  This will be needed to clean
> up the xfs race condition fixes in the next patch, and will hopefully
> reduce register pressure as well.

So if we go back to the iomap_iter idea from willy it did something
similar, but replaced ->iomap_begin and ->iomap_end with a single
iter callback that basically implemented all of iomap_iter().

Depending on your view point this patch gets us closer to that
or is pointless churn :)

Technicaly the changes look fine to me.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/14] iomap: pass a private pointer to iomap_file_buffered_write
  2022-11-09 18:16 ` [PATCH 10/14] iomap: pass a private pointer to iomap_file_buffered_write Darrick J. Wong
@ 2022-11-15  8:37   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, david, hch

On Wed, Nov 09, 2022 at 10:16:40AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Allow filesystems to pass a filesystem-private pointer into iomap for
> writes into the pagecache.  This will now be accessible from
> ->iomap_begin implementations, which is key to being able to revalidate
> mappings after taking folio locks.

Looks good (and this is similar to what we already do for other
users of iomap):

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 11/14] xfs: move the seq counters for buffered writes to a private struct
  2022-11-09 18:16 ` [PATCH 11/14] xfs: move the seq counters for buffered writes to a private struct Darrick J. Wong
@ 2022-11-15  8:38   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, david, hch

This looks reasonable, but needs to be folded into the patch
introducing the sequence validation.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-11-15  8:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-09 18:15 [PATCHSET RFCRAP v2 00/14] xfs, iomap: fix data corruption due to stale cached iomaps Darrick J. Wong
2022-11-09 18:15 ` [PATCH 01/14] xfs: write page faults in iomap are not buffered writes Darrick J. Wong
2022-11-09 18:15 ` [PATCH 02/14] xfs: punching delalloc extents on write failure is racy Darrick J. Wong
2022-11-09 18:16 ` [PATCH 03/14] xfs: use byte ranges for write cleanup ranges Darrick J. Wong
2022-11-09 18:16 ` [PATCH 04/14] xfs: buffered write failure should not truncate the page cache Darrick J. Wong
2022-11-15  8:32   ` Christoph Hellwig
2022-11-09 18:16 ` [PATCH 05/14] iomap: write iomap validity checks Darrick J. Wong
2022-11-09 18:16 ` [PATCH 06/14] xfs: use iomap_valid method to detect stale cached iomaps Darrick J. Wong
2022-11-09 18:16 ` [PATCH 07/14] xfs: drop write error injection is unfixable, remove it Darrick J. Wong
2022-11-09 18:16 ` [PATCH 08/14] iomap: pass iter to ->iomap_begin implementations Darrick J. Wong
2022-11-15  8:36   ` Christoph Hellwig
2022-11-09 18:16 ` [PATCH 09/14] iomap: pass iter to ->iomap_end implementations Darrick J. Wong
2022-11-09 18:16 ` [PATCH 10/14] iomap: pass a private pointer to iomap_file_buffered_write Darrick J. Wong
2022-11-15  8:37   ` Christoph Hellwig
2022-11-09 18:16 ` [PATCH 11/14] xfs: move the seq counters for buffered writes to a private struct Darrick J. Wong
2022-11-15  8:38   ` Christoph Hellwig
2022-11-09 18:16 ` [PATCH 12/14] xfs: validate COW fork sequence counters during buffered writes Darrick J. Wong
2022-11-09 18:16 ` [PATCH 13/14] xfs: add debug knob to slow down writeback for fun Darrick J. Wong
2022-11-09 18:17 ` [PATCH 14/14] xfs: add debug knob to slow down write " Darrick J. Wong
2022-11-09 18:22 ` [PATCH 15/14] fstest: regression test for writeback corruption bug Darrick J. Wong
2022-11-09 18:23 ` [PATCH 16/14] fstest: regression test for writes racing with reclaim writeback Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.