linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps
@ 2022-11-15  1:30 Dave Chinner
  2022-11-15  1:30 ` [PATCH 1/9] mm: export mapping_seek_hole_data() Dave Chinner
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

[cc linux-mm for exporting mapping_seek_hole_data())]

Recently a customer workload encountered a data corruption in a
specific multi-threaded write operation. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.

The result of this was random partial blocks containing zeroes
instead of the correct data.  The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.

The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk.  This is how we get random zero ranges in the file
instead of the correct data.

The gory details of the race condition can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.

In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.

If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple....

However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.

This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.

As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.

That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.

Eventually, I realised that if the XFS code tracked the processed
write regions by byte ranges instead of fileysetm block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. THese are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.

Of course, now xfs/196 fails. This is a error injection test that is
supposed to exercise the delalloc extent recover code that the above
fixes just completely reworked. the error injection assumes that it
can just truncate the page cache over the write and then punch out
the delalloc extent completely. This is fundamentally broken, and
only has been working by chance - the chance is that writes are page
aligned and page aligned writes don't install large folios in the
page cache.

IOWs, with sub-folio block size, and not know what size folios are
in the cache, we can't actually guarantee that we can remove the
cached dirty folios from the cache via truncation, and hence the new
code will not remove the delalloc extents under those dirty folios.
As a result the error injection results is writing zeroes to disk
rather that removing the delalloc extents from memory. I can't make
this error injection to work the way it was intended, so I removed
it. The code that it is supposed to exercise is now exercised every time we
detect a stale iomap, so we have much better coverage of the failed
write error handling than the error injection provides us with,
anyway....

So, this passes fstests on 1kb and 4kb block sizes and the data
corruption reproducer does not detect data corruption, so this set
of fixes is /finally/ something I'd consider ready for merge.
Comments and testing welcome!

-Dave.

Version 2:
- export mapping_seek_hole_data()
- fix missing initialisation of the iomap sequence in xfs_fs_map_blocks() in the
  pnfs code.
- move ->iomap_valid callback to the struct iomap_page_ops so that it is
  returned with the iomap rather than having the iomap_ops plumbed through the
  entire stack.
- added a u64 validity_cookie to the struct iomap for carrying iomap
  verification information along with the iomap itself.
- added IOMAP_F_XATTR for XFS to be able to tell the difference between iomaps
  that map attribute extents instead of file data extents.
- converted the IOMAP_F_* flags to use the (1U << NN) definition pattern.
- added patch to convert xfs_bmap_punch_delalloc_range() to take a byte range.


Version 1:
- https://lore.kernel.org/linux-xfs/20221101003412.3842572-1-david@fromorbit.com/
- complete rework of iomap stale detection
- complete rework of XFS partial delalloc write error handling.

Original RFC:
- https://lore.kernel.org/linux-xfs/20220921082959.1411675-1-david@fromorbit.com/



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/9] mm: export mapping_seek_hole_data()
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:40   ` Christoph Hellwig
  2022-11-15  1:30 ` [PATCH 2/9] xfs: write page faults in iomap are not buffered writes Dave Chinner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

XFS needs this for finding cached dirty data regions when cleaning
up short writes, so it needs to be exported as XFS can be built as a
module.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/filemap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..07d255c41c43 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2925,6 +2925,7 @@ loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
 		return end;
 	return start;
 }
+EXPORT_SYMBOL_GPL(mapping_seek_hole_data);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/9] xfs: write page faults in iomap are not buffered writes
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
  2022-11-15  1:30 ` [PATCH 1/9] mm: export mapping_seek_hole_data() Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  1:30 ` [PATCH 3/9] xfs: punching delalloc extents on write failure is racy Dave Chinner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
we mark the iomap as IOMAP_F_NEW so that the the write context
understands that it allocated the delalloc region.

If we then fail that buffered write, xfs_buffered_write_iomap_end()
checks for the IOMAP_F_NEW flag and if it is set, it punches out
the unused delalloc region that was allocated for the write.

The assumption this code makes is that all buffered write operations
that can allocate space are run under an exclusive lock (i_rwsem).
This is an invalid assumption: page faults in mmap()d regions call
through this same function pair to map the file range being faulted
and this runs only holding the inode->i_mapping->invalidate_lock in
shared mode.

IOWs, we can have races between page faults and write() calls that
fail the nested page cache write operation that result in data loss.
That is, the failing iomap_end call will punch out the data that
the other racing iomap iteration brought into the page cache. This
can be reproduced with generic/34[46] if we arbitrarily fail page
cache copy-in operations from write() syscalls.

Code analysis tells us that the iomap_page_mkwrite() function holds
the already instantiated and uptodate folio locked across the iomap
mapping iterations. Hence the folio cannot be removed from memory
whilst we are mapping the range it covers, and as such we do not
care if the mapping changes state underneath the iomap iteration
loop:

1. if the folio is not already dirty, there is no writeback races
   possible.
2. if we allocated the mapping (delalloc or unwritten), the folio
   cannot already be dirty. See #1.
3. If the folio is already dirty, it must be up to date. As we hold
   it locked, it cannot be reclaimed from memory. Hence we always
   have valid data in the page cache while iterating the mapping.
4. Valid data in the page cache can exist when the underlying
   mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
   change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
   change the data in the page - it only affects actions if we are
   initialising a new page. Hence #3 applies  and we don't care
   about these extent map transitions racing with
   iomap_page_mkwrite().
5. iomap_page_mkwrite() checks for page invalidation races
   (truncate, hole punch, etc) after it locks the folio. We also
   hold the mapping->invalidation_lock here, and hence the mapping
   cannot change due to extent removal operations while we are
   iterating the folio.

As such, filesystems that don't use bufferheads will never fail
the iomap_folio_mkwrite_iter() operation on the current mapping,
regardless of whether the iomap should be considered stale.

Further, the range we are asked to iterate is limited to the range
inside EOF that the folio spans. Hence, for XFS, we will only map
the exact range we are asked for, and we will only do speculative
preallocation with delalloc if we are mapping a hole at the EOF
page. The iterator will consume the entire range of the folio that
is within EOF, and anything beyond the EOF block cannot be accessed.
We never need to truncate this post-EOF speculative prealloc away in
the context of the iomap_page_mkwrite() iterator because if it
remains unused we'll remove it when the last reference to the inode
goes away.

Hence we don't actually need an .iomap_end() cleanup/error handling
path at all for iomap_page_mkwrite() for XFS. This means we can
separate the page fault processing from the complexity of the
.iomap_end() processing in the buffered write path. This also means
that the buffered write path will also be able to take the
mapping->invalidate_lock as necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  | 2 +-
 fs/xfs/xfs_iomap.c | 9 +++++++++
 fs/xfs/xfs_iomap.h | 1 +
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e462d39c840e..595a5bcf46b9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1325,7 +1325,7 @@ __xfs_filemap_fault(
 		if (write_fault) {
 			xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 			ret = iomap_page_mkwrite(vmf,
-					&xfs_buffered_write_iomap_ops);
+					&xfs_page_mkwrite_iomap_ops);
 			xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 		} else {
 			ret = filemap_fault(vmf);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 07da03976ec1..5cea069a38b4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1187,6 +1187,15 @@ const struct iomap_ops xfs_buffered_write_iomap_ops = {
 	.iomap_end		= xfs_buffered_write_iomap_end,
 };
 
+/*
+ * iomap_page_mkwrite() will never fail in a way that requires delalloc extents
+ * that it allocated to be revoked. Hence we do not need an .iomap_end method
+ * for this operation.
+ */
+const struct iomap_ops xfs_page_mkwrite_iomap_ops = {
+	.iomap_begin		= xfs_buffered_write_iomap_begin,
+};
+
 static int
 xfs_read_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index c782e8c0479c..0f62ab633040 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -47,6 +47,7 @@ xfs_aligned_fsb_count(
 }
 
 extern const struct iomap_ops xfs_buffered_write_iomap_ops;
+extern const struct iomap_ops xfs_page_mkwrite_iomap_ops;
 extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/9] xfs: punching delalloc extents on write failure is racy
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
  2022-11-15  1:30 ` [PATCH 1/9] mm: export mapping_seek_hole_data() Dave Chinner
  2022-11-15  1:30 ` [PATCH 2/9] xfs: write page faults in iomap are not buffered writes Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:41   ` Christoph Hellwig
  2022-11-15 23:53   ` Darrick J. Wong
  2022-11-15  1:30 ` [PATCH 4/9] xfs: use byte ranges for write cleanup ranges Dave Chinner
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() has a comment about the safety of
punching delalloc extents based holding the IOLOCK_EXCL. This
comment is wrong, and punching delalloc extents is not race free.

When we punch out a delalloc extent after a write failure in
xfs_buffered_write_iomap_end(), we punch out the page cache with
truncate_pagecache_range() before we punch out the delalloc extents.
At this point, we only hold the IOLOCK_EXCL, so there is nothing
stopping mmap() write faults racing with this cleanup operation,
reinstantiating a folio over the range we are about to punch and
hence requiring the delalloc extent to be kept.

If this race condition is hit, we can end up with a dirty page in
the page cache that has no delalloc extent or space reservation
backing it. This leads to bad things happening at writeback time.

To avoid this race condition, we need the page cache truncation to
be atomic w.r.t. the extent manipulation. We can do this by holding
the mapping->invalidate_lock exclusively across this operation -
this will prevent new pages from being inserted into the page cache
whilst we are removing the pages and the backing extent and space
reservation.

Taking the mapping->invalidate_lock exclusively in the buffered
write IO path is safe - it naturally nests inside the IOLOCK (see
truncate and fallocate paths). iomap_zero_range() can be called from
under the mapping->invalidate_lock (from the truncate path via
either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
will not instantiate new delalloc pages (because it skips holes) and
hence will not ever need to punch out delalloc extents on failure.

Fix the locking issue, and clean up the code logic a little to avoid
unnecessary work if we didn't allocate the delalloc extent or wrote
the entire region we allocated.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5cea069a38b4..a2e45ea1b0cb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1147,6 +1147,10 @@ xfs_buffered_write_iomap_end(
 		written = 0;
 	}
 
+	/* If we didn't reserve the blocks, we're not allowed to punch them. */
+	if (!(iomap->flags & IOMAP_F_NEW))
+		return 0;
+
 	/*
 	 * start_fsb refers to the first unused block after a short write. If
 	 * nothing was written, round offset down to point at the first block in
@@ -1158,27 +1162,28 @@ xfs_buffered_write_iomap_end(
 		start_fsb = XFS_B_TO_FSB(mp, offset + written);
 	end_fsb = XFS_B_TO_FSB(mp, offset + length);
 
+	/* Nothing to do if we've written the entire delalloc extent */
+	if (start_fsb >= end_fsb)
+		return 0;
+
 	/*
-	 * Trim delalloc blocks if they were allocated by this write and we
-	 * didn't manage to write the whole range.
-	 *
-	 * We don't need to care about racing delalloc as we hold i_mutex
-	 * across the reserve/allocate/unreserve calls. If there are delalloc
-	 * blocks in the range, they are ours.
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite between the page cache
+	 * truncation and the delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && start_fsb < end_fsb) {
-		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
-					 XFS_FSB_TO_B(mp, end_fsb) - 1);
-
-		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-					       end_fsb - start_fsb);
-		if (error && !xfs_is_shutdown(mp)) {
-			xfs_alert(mp, "%s: unable to clean up ino %lld",
-				__func__, ip->i_ino);
-			return error;
-		}
+	filemap_invalidate_lock(inode->i_mapping);
+	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
+				 XFS_FSB_TO_B(mp, end_fsb) - 1);
+
+	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+				       end_fsb - start_fsb);
+	filemap_invalidate_unlock(inode->i_mapping);
+	if (error && !xfs_is_shutdown(mp)) {
+		xfs_alert(mp, "%s: unable to clean up ino %lld",
+			__func__, ip->i_ino);
+		return error;
 	}
-
 	return 0;
 }
 
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/9] xfs: use byte ranges for write cleanup ranges
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
                   ` (2 preceding siblings ...)
  2022-11-15  1:30 ` [PATCH 3/9] xfs: punching delalloc extents on write failure is racy Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:42   ` Christoph Hellwig
  2022-11-15 23:57   ` Darrick J. Wong
  2022-11-15  1:30 ` [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Dave Chinner
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() currently converts the byte ranges
passed to it to filesystem blocks to pass them to the bmap code to
punch out delalloc blocks, but then has to convert filesytem
blocks back to byte ranges for page cache truncate.

We're about to make the page cache truncate go away and replace it
with a page cache walk, so having to convert everything to/from/to
filesystem blocks is messy and error-prone. It is much easier to
pass around byte ranges and convert to page indexes and/or
filesystem blocks only where those units are needed.

In preparation for the page cache walk being added, add a helper
that converts byte ranges to filesystem blocks and calls
xfs_bmap_punch_delalloc_range() and convert
xfs_buffered_write_iomap_end() to calculate limits in byte ranges.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a2e45ea1b0cb..7bb55dbc19d3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1120,6 +1120,20 @@ xfs_buffered_write_iomap_begin(
 	return error;
 }
 
+static int
+xfs_buffered_write_delalloc_punch(
+	struct inode		*inode,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	struct xfs_mount	*mp = XFS_M(inode->i_sb);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
+	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
+
+	return xfs_bmap_punch_delalloc_range(XFS_I(inode), start_fsb,
+				end_fsb - start_fsb);
+}
+
 static int
 xfs_buffered_write_iomap_end(
 	struct inode		*inode,
@@ -1129,10 +1143,9 @@ xfs_buffered_write_iomap_end(
 	unsigned		flags,
 	struct iomap		*iomap)
 {
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		start_fsb;
-	xfs_fileoff_t		end_fsb;
+	struct xfs_mount	*mp = XFS_M(inode->i_sb);
+	loff_t			start_byte;
+	loff_t			end_byte;
 	int			error = 0;
 
 	if (iomap->type != IOMAP_DELALLOC)
@@ -1157,13 +1170,13 @@ xfs_buffered_write_iomap_end(
 	 * the range.
 	 */
 	if (unlikely(!written))
-		start_fsb = XFS_B_TO_FSBT(mp, offset);
+		start_byte = round_down(offset, mp->m_sb.sb_blocksize);
 	else
-		start_fsb = XFS_B_TO_FSB(mp, offset + written);
-	end_fsb = XFS_B_TO_FSB(mp, offset + length);
+		start_byte = round_up(offset + written, mp->m_sb.sb_blocksize);
+	end_byte = round_up(offset + length, mp->m_sb.sb_blocksize);
 
 	/* Nothing to do if we've written the entire delalloc extent */
-	if (start_fsb >= end_fsb)
+	if (start_byte >= end_byte)
 		return 0;
 
 	/*
@@ -1173,15 +1186,12 @@ xfs_buffered_write_iomap_end(
 	 * leave dirty pages with no space reservation in the cache.
 	 */
 	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
-				 XFS_FSB_TO_B(mp, end_fsb) - 1);
-
-	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-				       end_fsb - start_fsb);
+	truncate_pagecache_range(inode, start_byte, end_byte - 1);
+	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
 	filemap_invalidate_unlock(inode->i_mapping);
 	if (error && !xfs_is_shutdown(mp)) {
-		xfs_alert(mp, "%s: unable to clean up ino %lld",
-			__func__, ip->i_ino);
+		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
+			__func__, XFS_I(inode)->i_ino);
 		return error;
 	}
 	return 0;
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
                   ` (3 preceding siblings ...)
  2022-11-15  1:30 ` [PATCH 4/9] xfs: use byte ranges for write cleanup ranges Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:43   ` Christoph Hellwig
                     ` (2 more replies)
  2022-11-15  1:30 ` [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range Dave Chinner
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() currently invalidates the page cache
over the unused range of the delalloc extent it allocated. While the
write allocated the delalloc extent, it does not own it exclusively
as the write does not hold any locks that prevent either writeback
or mmap page faults from changing the state of either the page cache
or the extent state backing this range.

Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.

generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.

Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.

TO do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.

Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.

Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 141 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7bb55dbc19d3..2d48fcc7bd6f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
 				end_fsb - start_fsb);
 }
 
+/*
+ * Scan the data range passed to us for dirty page cache folios. If we find a
+ * dirty folio, punch out the preceeding range and update the offset from which
+ * the next punch will start from.
+ *
+ * We can punch out clean pages because they either contain data that has been
+ * written back - in which case the delalloc punch over that range is a no-op -
+ * or they have been read faults in which case they contain zeroes and we can
+ * remove the delalloc backing range and any new writes to those pages will do
+ * the normal hole filling operation...
+ *
+ * This makes the logic simple: we only need to keep the delalloc extents only
+ * over the dirty ranges of the page cache.
+ */
+static int
+xfs_buffered_write_delalloc_scan(
+	struct inode		*inode,
+	loff_t			*punch_start_byte,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			offset = start_byte;
+
+	while (offset < end_byte) {
+		struct folio	*folio;
+
+		/* grab locked page */
+		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
+		if (!folio) {
+			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
+			continue;
+		}
+
+		/* if dirty, punch up to offset */
+		if (folio_test_dirty(folio)) {
+			if (offset > *punch_start_byte) {
+				int	error;
+
+				error = xfs_buffered_write_delalloc_punch(inode,
+						*punch_start_byte, offset);
+				if (error) {
+					folio_unlock(folio);
+					folio_put(folio);
+					return error;
+				}
+			}
+
+			/*
+			 * Make sure the next punch start is correctly bound to
+			 * the end of this data range, not the end of the folio.
+			 */
+			*punch_start_byte = min_t(loff_t, end_byte,
+					folio_next_index(folio) << PAGE_SHIFT);
+		}
+
+		/* move offset to start of next folio in range */
+		offset = folio_next_index(folio) << PAGE_SHIFT;
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+	return 0;
+}
+
+/*
+ * Punch out all the delalloc blocks in the range given except for those that
+ * have dirty data still pending in the page cache - those are going to be
+ * written and so must still retain the delalloc backing for writeback.
+ *
+ * As we are scanning the page cache for data, we don't need to reimplement the
+ * wheel - mapping_seek_hole_data() does exactly what we need to identify the
+ * start and end of data ranges correctly even for sub-folio block sizes. This
+ * byte range based iteration is especially convenient because it means we don't
+ * have to care about variable size folios, nor where the start or end of the
+ * data range lies within a folio, if they lie within the same folio or even if
+ * there are multiple discontiguous data ranges within the folio.
+ */
+static int
+xfs_buffered_write_delalloc_release(
+	struct inode		*inode,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			punch_start_byte = start_byte;
+	int			error = 0;
+
+	/*
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite whilst we walk the
+	 * cache and perform delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	while (start_byte < end_byte) {
+		loff_t		data_end;
+
+		start_byte = mapping_seek_hole_data(inode->i_mapping,
+				start_byte, end_byte, SEEK_DATA);
+		/*
+		 * If there is no more data to scan, all that is left is to
+		 * punch out the remaining range.
+		 */
+		if (start_byte == -ENXIO || start_byte == end_byte)
+			break;
+		if (start_byte < 0) {
+			error = start_byte;
+			goto out_unlock;
+		}
+		ASSERT(start_byte >= punch_start_byte);
+		ASSERT(start_byte < end_byte);
+
+		/*
+		 * We find the end of this contiguous cached data range by
+		 * seeking from start_byte to the beginning of the next hole.
+		 */
+		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
+				end_byte, SEEK_HOLE);
+		if (data_end < 0) {
+			error = data_end;
+			goto out_unlock;
+		}
+		ASSERT(data_end > start_byte);
+		ASSERT(data_end <= end_byte);
+
+		error = xfs_buffered_write_delalloc_scan(inode,
+				&punch_start_byte, start_byte, data_end);
+		if (error)
+			goto out_unlock;
+
+		/* The next data search starts at the end of this one. */
+		start_byte = data_end;
+	}
+
+	if (punch_start_byte < end_byte)
+		error = xfs_buffered_write_delalloc_punch(inode,
+				punch_start_byte, end_byte);
+out_unlock:
+	filemap_invalidate_unlock(inode->i_mapping);
+	return error;
+}
+
 static int
 xfs_buffered_write_iomap_end(
 	struct inode		*inode,
@@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
 	if (start_byte >= end_byte)
 		return 0;
 
-	/*
-	 * Lock the mapping to avoid races with page faults re-instantiating
-	 * folios and dirtying them via ->page_mkwrite between the page cache
-	 * truncation and the delalloc extent removal. Failing to do this can
-	 * leave dirty pages with no space reservation in the cache.
-	 */
-	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(inode, start_byte, end_byte - 1);
-	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
-	filemap_invalidate_unlock(inode->i_mapping);
+	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
 	if (error && !xfs_is_shutdown(mp)) {
 		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
 			__func__, XFS_I(inode)->i_ino);
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
                   ` (4 preceding siblings ...)
  2022-11-15  1:30 ` [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:44   ` Christoph Hellwig
  2022-11-15  1:30 ` [PATCH 7/9] iomap: write iomap validity checks Dave Chinner
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

All the callers of xfs_bmap_punch_delalloc_range() jump through
hoops to convert a byte range to filesystem blocks before calling
xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
xfs_bmap_punch_delalloc_range() and have it do the conversion to
filesystem blocks internally.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c      | 16 ++++++----------
 fs/xfs/xfs_bmap_util.c | 10 ++++++----
 fs/xfs/xfs_bmap_util.h |  2 +-
 fs/xfs/xfs_iomap.c     | 21 ++++-----------------
 4 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5d1a995b15f8..6aadc5815068 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -114,9 +114,8 @@ xfs_end_ioend(
 	if (unlikely(error)) {
 		if (ioend->io_flags & IOMAP_F_SHARED) {
 			xfs_reflink_cancel_cow_range(ip, offset, size, true);
-			xfs_bmap_punch_delalloc_range(ip,
-						      XFS_B_TO_FSBT(mp, offset),
-						      XFS_B_TO_FSB(mp, size));
+			xfs_bmap_punch_delalloc_range(ip, offset,
+					offset + size);
 		}
 		goto done;
 	}
@@ -455,12 +454,8 @@ xfs_discard_folio(
 	struct folio		*folio,
 	loff_t			pos)
 {
-	struct inode		*inode = folio->mapping->host;
-	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode	*ip = XFS_I(folio->mapping->host);
 	struct xfs_mount	*mp = ip->i_mount;
-	size_t			offset = offset_in_folio(folio, pos);
-	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, pos);
-	xfs_fileoff_t		pageoff_fsb = XFS_B_TO_FSBT(mp, offset);
 	int			error;
 
 	if (xfs_is_shutdown(mp))
@@ -470,8 +465,9 @@ xfs_discard_folio(
 		"page discard on page "PTR_FMT", inode 0x%llx, pos %llu.",
 			folio, ip->i_ino, pos);
 
-	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			i_blocks_per_folio(inode, folio) - pageoff_fsb);
+	error = xfs_bmap_punch_delalloc_range(ip, pos,
+			round_up(pos, folio_size(folio)));
+
 	if (error && !xfs_is_shutdown(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 }
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 04d0c2bff67c..867645b74d88 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -590,11 +590,13 @@ xfs_getbmap(
 int
 xfs_bmap_punch_delalloc_range(
 	struct xfs_inode	*ip,
-	xfs_fileoff_t		start_fsb,
-	xfs_fileoff_t		length)
+	xfs_off_t		start_byte,
+	xfs_off_t		end_byte)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = &ip->i_df;
-	xfs_fileoff_t		end_fsb = start_fsb + length;
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
+	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
 	struct xfs_bmbt_irec	got, del;
 	struct xfs_iext_cursor	icur;
 	int			error = 0;
@@ -607,7 +609,7 @@ xfs_bmap_punch_delalloc_range(
 
 	while (got.br_startoff + got.br_blockcount > start_fsb) {
 		del = got;
-		xfs_trim_extent(&del, start_fsb, length);
+		xfs_trim_extent(&del, start_fsb, end_fsb - start_fsb);
 
 		/*
 		 * A delete can push the cursor forward. Step back to the
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 24b37d211f1d..6888078f5c31 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -31,7 +31,7 @@ xfs_bmap_rtalloc(struct xfs_bmalloca *ap)
 #endif /* CONFIG_XFS_RT */
 
 int	xfs_bmap_punch_delalloc_range(struct xfs_inode *ip,
-		xfs_fileoff_t start_fsb, xfs_fileoff_t length);
+		xfs_off_t start_byte, xfs_off_t end_byte);
 
 struct kgetbmap {
 	__s64		bmv_offset;	/* file offset of segment in blocks */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2d48fcc7bd6f..04da22943e7c 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1120,20 +1120,6 @@ xfs_buffered_write_iomap_begin(
 	return error;
 }
 
-static int
-xfs_buffered_write_delalloc_punch(
-	struct inode		*inode,
-	loff_t			start_byte,
-	loff_t			end_byte)
-{
-	struct xfs_mount	*mp = XFS_M(inode->i_sb);
-	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
-	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
-
-	return xfs_bmap_punch_delalloc_range(XFS_I(inode), start_fsb,
-				end_fsb - start_fsb);
-}
-
 /*
  * Scan the data range passed to us for dirty page cache folios. If we find a
  * dirty folio, punch out the preceeding range and update the offset from which
@@ -1172,8 +1158,9 @@ xfs_buffered_write_delalloc_scan(
 			if (offset > *punch_start_byte) {
 				int	error;
 
-				error = xfs_buffered_write_delalloc_punch(inode,
-						*punch_start_byte, offset);
+				error = xfs_bmap_punch_delalloc_range(
+						XFS_I(inode), *punch_start_byte,
+						offset);
 				if (error) {
 					folio_unlock(folio);
 					folio_put(folio);
@@ -1267,7 +1254,7 @@ xfs_buffered_write_delalloc_release(
 	}
 
 	if (punch_start_byte < end_byte)
-		error = xfs_buffered_write_delalloc_punch(inode,
+		error = xfs_bmap_punch_delalloc_range(XFS_I(inode),
 				punch_start_byte, end_byte);
 out_unlock:
 	filemap_invalidate_unlock(inode->i_mapping);
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/9] iomap: write iomap validity checks
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
                   ` (5 preceding siblings ...)
  2022-11-15  1:30 ` [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:45   ` Christoph Hellwig
  2022-11-15  1:30 ` [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
  2022-11-15  1:30 ` [PATCH 9/9] xfs: drop write error injection is unfixable, remove it Dave Chinner
  8 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

A recent multithreaded write data corruption has been uncovered in
the iomap write code. The core of the problem is partial folio
writes can be flushed to disk while a new racing write can map it
and fill the rest of the page:

writeback			new write

allocate blocks
  blocks are unwritten
submit IO
.....
				map blocks
				iomap indicates UNWRITTEN range
				loop {
				  lock folio
				  copyin data
.....
IO completes
  runs unwritten extent conv
    blocks are marked written
				  <iomap now stale>
				  get next folio
				}

Now add memory pressure such that memory reclaim evicts the
partially written folio that has already been written to disk.

When the new write finally gets to the last partial page of the new
write, it does not find it in cache, so it instantiates a new page,
sees the iomap is unwritten, and zeros the part of the page that
it does not have data from. This overwrites the data on disk that
was originally written.

The full description of the corruption mechanism can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

To solve this problem, we need to check whether the iomap is still
valid after we lock each folio during the write. We have to do it
after we lock the page so that we don't end up with state changes
occurring while we wait for the folio to be locked.

Hence we need a mechanism to be able to check that the cached iomap
is still valid (similar to what we already do in buffered
writeback), and we need a way for ->begin_write to back out and
tell the high level iomap iterator that we need to remap the
remaining write range.

The iomap needs to grow some storage for the validity cookie that
the filesystem provides to travel with the iomap. XFS, in
particular, also needs to know some more information about what the
iomap maps (attribute extents rather than file data extents) to for
the validity cookie to cover all the types of iomaps we might need
to validate.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 29 +++++++++++++++++++++++++++-
 fs/iomap/iter.c        | 19 ++++++++++++++++++-
 include/linux/iomap.h  | 43 ++++++++++++++++++++++++++++++++++--------
 3 files changed, 81 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 91ee0b308e13..8354b0fdaa94 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -584,7 +584,7 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
 	return iomap_read_inline_data(iter, folio);
 }
 
-static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
+static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio **foliop)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
@@ -618,6 +618,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
 		goto out_no_page;
 	}
+
+	/*
+	 * Now we have a locked folio, before we do anything with it we need to
+	 * check that the iomap we have cached is not stale. The inode extent
+	 * mapping can change due to concurrent IO in flight (e.g.
+	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
+	 * reclaimed a previously partially written page at this index after IO
+	 * completion before this write reaches this file offset) and hence we
+	 * could do the wrong thing here (zero a page range incorrectly or fail
+	 * to zero) and corrupt data.
+	 */
+	if (page_ops && page_ops->iomap_valid) {
+		bool iomap_valid = page_ops->iomap_valid(iter->inode,
+							&iter->iomap);
+		if (!iomap_valid) {
+			iter->iomap.flags |= IOMAP_F_STALE;
+			status = 0;
+			goto out_unlock;
+		}
+	}
+
 	if (pos + len > folio_pos(folio) + folio_size(folio))
 		len = folio_pos(folio) + folio_size(folio) - pos;
 
@@ -773,6 +794,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (unlikely(status))
 			break;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		page = folio_file_page(folio, pos >> PAGE_SHIFT);
 		if (mapping_writably_mapped(mapping))
@@ -856,6 +879,8 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (unlikely(status))
 			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		status = iomap_write_end(iter, pos, bytes, bytes, folio);
 		if (WARN_ON_ONCE(status == 0))
@@ -911,6 +936,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (status)
 			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		offset = offset_in_folio(folio, pos);
 		if (bytes > folio_size(folio) - offset)
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index a1c7592d2ade..79a0614eaab7 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -7,12 +7,28 @@
 #include <linux/iomap.h>
 #include "trace.h"
 
+/*
+ * Advance to the next range we need to map.
+ *
+ * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
+ * processed - it was aborted because the extent the iomap spanned may have been
+ * changed during the operation. In this case, the iteration behaviour is to
+ * remap the unprocessed range of the iter, and that means we may need to remap
+ * even when we've made no progress (i.e. iter->processed = 0). Hence the
+ * "finished iterating" case needs to distinguish between
+ * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
+ * need to remap the entire remaining range.
+ */
 static inline int iomap_iter_advance(struct iomap_iter *iter)
 {
+	bool stale = iter->iomap.flags & IOMAP_F_STALE;
+
 	/* handle the previous iteration (if any) */
 	if (iter->iomap.length) {
-		if (iter->processed <= 0)
+		if (iter->processed < 0)
 			return iter->processed;
+		if (!iter->processed && !stale)
+			return 0;
 		if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
 			return -EIO;
 		iter->pos += iter->processed;
@@ -33,6 +49,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter)
 	WARN_ON_ONCE(iter->iomap.offset > iter->pos);
 	WARN_ON_ONCE(iter->iomap.length == 0);
 	WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
+	WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE);
 
 	trace_iomap_iter_dstmap(iter->inode, &iter->iomap);
 	if (iter->srcmap.type != IOMAP_HOLE)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17..f166d80b68bf 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -49,26 +49,35 @@ struct vm_fault;
  *
  * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of
  * buffer heads for this mapping.
+ *
+ * IOMAP_F_XATTR indicates that the iomap is for an extended attribute extent
+ * rather than a file data extent.
  */
-#define IOMAP_F_NEW		0x01
-#define IOMAP_F_DIRTY		0x02
-#define IOMAP_F_SHARED		0x04
-#define IOMAP_F_MERGED		0x08
-#define IOMAP_F_BUFFER_HEAD	0x10
-#define IOMAP_F_ZONE_APPEND	0x20
+#define IOMAP_F_NEW		(1U << 0)
+#define IOMAP_F_DIRTY		(1U << 1)
+#define IOMAP_F_SHARED		(1U << 2)
+#define IOMAP_F_MERGED		(1U << 3)
+#define IOMAP_F_BUFFER_HEAD	(1U << 4)
+#define IOMAP_F_ZONE_APPEND	(1U << 5)
+#define IOMAP_F_XATTR		(1U << 6)
 
 /*
  * Flags set by the core iomap code during operations:
  *
  * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
  * has changed as the result of this write operation.
+ *
+ * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
+ * range it covers needs to be remapped by the high level before the operation
+ * can proceed.
  */
-#define IOMAP_F_SIZE_CHANGED	0x100
+#define IOMAP_F_SIZE_CHANGED	(1U << 8)
+#define IOMAP_F_STALE		(1U << 9)
 
 /*
  * Flags from 0x1000 up are for file system specific usage:
  */
-#define IOMAP_F_PRIVATE		0x1000
+#define IOMAP_F_PRIVATE		(1U << 12)
 
 
 /*
@@ -89,6 +98,7 @@ struct iomap {
 	void			*inline_data;
 	void			*private; /* filesystem private */
 	const struct iomap_page_ops *page_ops;
+	u64			validity_cookie; /* used with .iomap_valid() */
 };
 
 static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
@@ -128,6 +138,23 @@ struct iomap_page_ops {
 	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
 	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
 			struct page *page);
+
+	/*
+	 * Check that the cached iomap still maps correctly to the filesystem's
+	 * internal extent map. FS internal extent maps can change while iomap
+	 * is iterating a cached iomap, so this hook allows iomap to detect that
+	 * the iomap needs to be refreshed during a long running write
+	 * operation.
+	 *
+	 * The filesystem can store internal state (e.g. a sequence number) in
+	 * iomap->validity_cookie when the iomap is first mapped to be able to
+	 * detect changes between mapping time and whenever .iomap_valid() is
+	 * called.
+	 *
+	 * This is called with the folio over the specified file position held
+	 * locked by the iomap code.
+	 */
+	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };
 
 /*
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
                   ` (6 preceding siblings ...)
  2022-11-15  1:30 ` [PATCH 7/9] iomap: write iomap validity checks Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  2022-11-15  8:49   ` Christoph Hellwig
  2022-11-15  1:30 ` [PATCH 9/9] xfs: drop write error injection is unfixable, remove it Dave Chinner
  8 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

or the ->iomap_valid() introduction commit for exact details of the
corruption vector.

The validity cookie we store in the iomap is based on the type of
iomap we return. It is expected that the iomap->flags we set in
xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
returned to us in the iomap passed via the .iomap_valid() callback.
This ensures that the validity cookie is always checking the correct
inode fork sequence numbers to detect potential changes that affect
the extent cached by the iomap.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   6 ++-
 fs/xfs/xfs_aops.c        |   2 +-
 fs/xfs/xfs_iomap.c       | 105 +++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_iomap.h       |   5 +-
 fs/xfs/xfs_pnfs.c        |   6 ++-
 5 files changed, 97 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 49d0d4ea63fc..56b9b7db38bb 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4551,7 +4551,8 @@ xfs_bmapi_convert_delalloc(
 	 * the extent.  Just return the real extent at this offset.
 	 */
 	if (!isnullstartblock(bma.got.br_startblock)) {
-		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
+		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
+				xfs_iomap_inode_sequence(ip, flags));
 		*seq = READ_ONCE(ifp->if_seq);
 		goto out_trans_cancel;
 	}
@@ -4599,7 +4600,8 @@ xfs_bmapi_convert_delalloc(
 	XFS_STATS_INC(mp, xs_xstrat_quick);
 
 	ASSERT(!isnullstartblock(bma.got.br_startblock));
-	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
+	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
+				xfs_iomap_inode_sequence(ip, flags));
 	*seq = READ_ONCE(ifp->if_seq);
 
 	if (whichfork == XFS_COW_FORK)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6aadc5815068..a22d90af40c8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -372,7 +372,7 @@ xfs_map_blocks(
 	    isnullstartblock(imap.br_startblock))
 		goto allocate_blocks;
 
-	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0);
+	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0, XFS_WPC(wpc)->data_seq);
 	trace_xfs_map_blocks_found(ip, offset, count, whichfork, &imap);
 	return 0;
 allocate_blocks:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 04da22943e7c..fa578c913323 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -48,13 +48,54 @@ xfs_alert_fsblock_zero(
 	return -EFSCORRUPTED;
 }
 
+u64
+xfs_iomap_inode_sequence(
+	struct xfs_inode	*ip,
+	u16			iomap_flags)
+{
+	u64			cookie = 0;
+
+	if (iomap_flags & IOMAP_F_XATTR)
+		return READ_ONCE(ip->i_af.if_seq);
+	if ((iomap_flags & IOMAP_F_SHARED) && ip->i_cowfp)
+		cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32;
+	return cookie | READ_ONCE(ip->i_df.if_seq);
+}
+
+/*
+ * Check that the iomap passed to us is still valid for the given offset and
+ * length.
+ */
+static bool
+xfs_iomap_valid(
+	struct inode		*inode,
+	const struct iomap	*iomap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	u64			cookie = 0;
+
+	if (iomap->flags & IOMAP_F_XATTR) {
+		cookie = READ_ONCE(ip->i_af.if_seq);
+	} else {
+		if ((iomap->flags & IOMAP_F_SHARED) && ip->i_cowfp)
+			cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32;
+		cookie |= READ_ONCE(ip->i_df.if_seq);
+	}
+	return cookie == iomap->validity_cookie;
+}
+
+const struct iomap_page_ops xfs_iomap_page_ops = {
+	.iomap_valid		= xfs_iomap_valid,
+};
+
 int
 xfs_bmbt_to_iomap(
 	struct xfs_inode	*ip,
 	struct iomap		*iomap,
 	struct xfs_bmbt_irec	*imap,
 	unsigned int		mapping_flags,
-	u16			iomap_flags)
+	u16			iomap_flags,
+	u64			sequence_cookie)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
@@ -91,6 +132,9 @@ xfs_bmbt_to_iomap(
 	if (xfs_ipincount(ip) &&
 	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
 		iomap->flags |= IOMAP_F_DIRTY;
+
+	iomap->validity_cookie = sequence_cookie;
+	iomap->page_ops = &xfs_iomap_page_ops;
 	return 0;
 }
 
@@ -195,7 +239,8 @@ xfs_iomap_write_direct(
 	xfs_fileoff_t		offset_fsb,
 	xfs_fileoff_t		count_fsb,
 	unsigned int		flags,
-	struct xfs_bmbt_irec	*imap)
+	struct xfs_bmbt_irec	*imap,
+	u64			*seq)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -285,6 +330,8 @@ xfs_iomap_write_direct(
 		error = xfs_alert_fsblock_zero(ip, imap);
 
 out_unlock:
+	if (seq)
+		*seq = xfs_iomap_inode_sequence(ip, 0);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 
@@ -743,6 +790,7 @@ xfs_direct_write_iomap_begin(
 	bool			shared = false;
 	u16			iomap_flags = 0;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
+	u64			seq;
 
 	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
 
@@ -811,9 +859,10 @@ xfs_direct_write_iomap_begin(
 			goto out_unlock;
 	}
 
+	seq = xfs_iomap_inode_sequence(ip, iomap_flags);
 	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
 
 allocate_blocks:
 	error = -EAGAIN;
@@ -839,24 +888,26 @@ xfs_direct_write_iomap_begin(
 	xfs_iunlock(ip, lockmode);
 
 	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
-			flags, &imap);
+			flags, &imap, &seq);
 	if (error)
 		return error;
 
 	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 iomap_flags | IOMAP_F_NEW);
+				 iomap_flags | IOMAP_F_NEW, seq);
 
 out_found_cow:
-	xfs_iunlock(ip, lockmode);
 	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
 	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
 	if (imap.br_startblock != HOLESTARTBLOCK) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
+		seq = xfs_iomap_inode_sequence(ip, 0);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
 		if (error)
-			return error;
+			goto out_unlock;
 	}
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+	xfs_iunlock(ip, lockmode);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
 
 out_unlock:
 	if (lockmode)
@@ -915,6 +966,7 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 	unsigned int		lockmode = XFS_ILOCK_EXCL;
+	u64			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1094,26 +1146,31 @@ xfs_buffered_write_iomap_begin(
 	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
 	 * them out if the write happens to fail.
 	 */
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
 
 found_imap:
+	seq = xfs_iomap_inode_sequence(ip, 0);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 
 found_cow:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	seq = xfs_iomap_inode_sequence(ip, 0);
 	if (imap.br_startoff <= offset_fsb) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
 		if (error)
-			return error;
+			goto out_unlock;
+		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-					 IOMAP_F_SHARED);
+					 IOMAP_F_SHARED, seq);
 	}
 
 	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1346,6 +1403,7 @@ xfs_read_iomap_begin(
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
+	u64			seq;
 
 	ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
 
@@ -1359,13 +1417,14 @@ xfs_read_iomap_begin(
 			       &nimaps, 0);
 	if (!error && (flags & IOMAP_REPORT))
 		error = xfs_reflink_trim_around_shared(ip, &imap, &shared);
+	seq = xfs_iomap_inode_sequence(ip, shared ? IOMAP_F_SHARED : 0);
 	xfs_iunlock(ip, lockmode);
 
 	if (error)
 		return error;
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 shared ? IOMAP_F_SHARED : 0);
+				 shared ? IOMAP_F_SHARED : 0, seq);
 }
 
 const struct iomap_ops xfs_read_iomap_ops = {
@@ -1390,6 +1449,7 @@ xfs_seek_iomap_begin(
 	struct xfs_bmbt_irec	imap, cmap;
 	int			error = 0;
 	unsigned		lockmode;
+	u64			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1424,8 +1484,9 @@ xfs_seek_iomap_begin(
 		if (data_fsb < cow_fsb + cmap.br_blockcount)
 			end_fsb = min(end_fsb, data_fsb);
 		xfs_trim_extent(&cmap, offset_fsb, end_fsb);
+		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
 		error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-					  IOMAP_F_SHARED);
+				IOMAP_F_SHARED, seq);
 		/*
 		 * This is a COW extent, so we must probe the page cache
 		 * because there could be dirty page cache being backed
@@ -1446,8 +1507,9 @@ xfs_seek_iomap_begin(
 	imap.br_startblock = HOLESTARTBLOCK;
 	imap.br_state = XFS_EXT_NORM;
 done:
+	seq = xfs_iomap_inode_sequence(ip, 0);
 	xfs_trim_extent(&imap, offset_fsb, end_fsb);
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 out_unlock:
 	xfs_iunlock(ip, lockmode);
 	return error;
@@ -1473,6 +1535,7 @@ xfs_xattr_iomap_begin(
 	struct xfs_bmbt_irec	imap;
 	int			nimaps = 1, error = 0;
 	unsigned		lockmode;
+	int			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1489,12 +1552,14 @@ xfs_xattr_iomap_begin(
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
 			       &nimaps, XFS_BMAPI_ATTRFORK);
 out_unlock:
+
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_XATTR);
 	xfs_iunlock(ip, lockmode);
 
 	if (error)
 		return error;
 	ASSERT(nimaps);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_XATTR, seq);
 }
 
 const struct iomap_ops xfs_xattr_iomap_ops = {
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 0f62ab633040..4da13440bae9 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -13,14 +13,15 @@ struct xfs_bmbt_irec;
 
 int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
 		xfs_fileoff_t count_fsb, unsigned int flags,
-		struct xfs_bmbt_irec *imap);
+		struct xfs_bmbt_irec *imap, u64 *sequence);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t, bool);
 xfs_fileoff_t xfs_iomap_eof_align_last_fsb(struct xfs_inode *ip,
 		xfs_fileoff_t end_fsb);
 
+u64 xfs_iomap_inode_sequence(struct xfs_inode *ip, u16 iomap_flags);
 int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
 		struct xfs_bmbt_irec *imap, unsigned int mapping_flags,
-		u16 iomap_flags);
+		u16 iomap_flags, u64 sequence_cookie);
 
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
 		bool *did_zero);
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 37a24f0f7cd4..38d23f0e703a 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -125,6 +125,7 @@ xfs_fs_map_blocks(
 	int			nimaps = 1;
 	uint			lock_flags;
 	int			error = 0;
+	u64			seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -176,6 +177,7 @@ xfs_fs_map_blocks(
 	lock_flags = xfs_ilock_data_map_shared(ip);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
 				&imap, &nimaps, bmapi_flags);
+	seq = xfs_iomap_inode_sequence(ip, 0);
 
 	ASSERT(!nimaps || imap.br_startblock != DELAYSTARTBLOCK);
 
@@ -189,7 +191,7 @@ xfs_fs_map_blocks(
 		xfs_iunlock(ip, lock_flags);
 
 		error = xfs_iomap_write_direct(ip, offset_fsb,
-				end_fsb - offset_fsb, 0, &imap);
+				end_fsb - offset_fsb, 0, &imap, &seq);
 		if (error)
 			goto out_unlock;
 
@@ -209,7 +211,7 @@ xfs_fs_map_blocks(
 	}
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, 0, 0);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, 0, 0, seq);
 	*device_generation = mp->m_generation;
 	return error;
 out_unlock:
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 9/9] xfs: drop write error injection is unfixable, remove it
  2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
                   ` (7 preceding siblings ...)
  2022-11-15  1:30 ` [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
@ 2022-11-15  1:30 ` Dave Chinner
  8 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-15  1:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.

The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.

On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.

IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.

Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.

And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.

This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.

As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:

xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_errortag.h | 12 +++++-------
 fs/xfs/xfs_error.c           | 27 ++++++++++++++++++++-------
 fs/xfs/xfs_iomap.c           |  9 ---------
 3 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 5362908164b0..580ccbd5aadc 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -40,13 +40,12 @@
 #define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
 #define XFS_ERRTAG_BMAP_FINISH_ONE			26
 #define XFS_ERRTAG_AG_RESV_CRITICAL			27
+
 /*
- * DEBUG mode instrumentation to test and/or trigger delayed allocation
- * block killing in the event of failed writes. When enabled, all
- * buffered writes are silenty dropped and handled as if they failed.
- * All delalloc blocks in the range of the write (including pre-existing
- * delalloc blocks!) are tossed as part of the write failure error
- * handling sequence.
+ * Drop-writes support removed because write error handling cannot trash
+ * pre-existing delalloc extents in any useful way anymore. We retain the
+ * definition so that we can reject it as an invalid value in
+ * xfs_errortag_valid().
  */
 #define XFS_ERRTAG_DROP_WRITES				28
 #define XFS_ERRTAG_LOG_BAD_CRC				29
@@ -95,7 +94,6 @@
 #define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
 #define XFS_RANDOM_BMAP_FINISH_ONE			1
 #define XFS_RANDOM_AG_RESV_CRITICAL			4
-#define XFS_RANDOM_DROP_WRITES				1
 #define XFS_RANDOM_LOG_BAD_CRC				1
 #define XFS_RANDOM_LOG_ITEM_PIN				1
 #define XFS_RANDOM_BUF_LRU_REF				2
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index c6b2aabd6f18..dea3c0649d2f 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -46,7 +46,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_REFCOUNT_FINISH_ONE,
 	XFS_RANDOM_BMAP_FINISH_ONE,
 	XFS_RANDOM_AG_RESV_CRITICAL,
-	XFS_RANDOM_DROP_WRITES,
+	0, /* XFS_RANDOM_DROP_WRITES has been removed */
 	XFS_RANDOM_LOG_BAD_CRC,
 	XFS_RANDOM_LOG_ITEM_PIN,
 	XFS_RANDOM_BUF_LRU_REF,
@@ -162,7 +162,6 @@ XFS_ERRORTAG_ATTR_RW(refcount_continue_update,	XFS_ERRTAG_REFCOUNT_CONTINUE_UPDA
 XFS_ERRORTAG_ATTR_RW(refcount_finish_one,	XFS_ERRTAG_REFCOUNT_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(bmap_finish_one,	XFS_ERRTAG_BMAP_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(ag_resv_critical,	XFS_ERRTAG_AG_RESV_CRITICAL);
-XFS_ERRORTAG_ATTR_RW(drop_writes,	XFS_ERRTAG_DROP_WRITES);
 XFS_ERRORTAG_ATTR_RW(log_bad_crc,	XFS_ERRTAG_LOG_BAD_CRC);
 XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
@@ -206,7 +205,6 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(refcount_finish_one),
 	XFS_ERRORTAG_ATTR_LIST(bmap_finish_one),
 	XFS_ERRORTAG_ATTR_LIST(ag_resv_critical),
-	XFS_ERRORTAG_ATTR_LIST(drop_writes),
 	XFS_ERRORTAG_ATTR_LIST(log_bad_crc),
 	XFS_ERRORTAG_ATTR_LIST(log_item_pin),
 	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
@@ -256,6 +254,19 @@ xfs_errortag_del(
 	kmem_free(mp->m_errortag);
 }
 
+static bool
+xfs_errortag_valid(
+	unsigned int		error_tag)
+{
+	if (error_tag >= XFS_ERRTAG_MAX)
+		return false;
+
+	/* Error out removed injection types */
+	if (error_tag == XFS_ERRTAG_DROP_WRITES)
+		return false;
+	return true;
+}
+
 bool
 xfs_errortag_test(
 	struct xfs_mount	*mp,
@@ -277,7 +288,9 @@ xfs_errortag_test(
 	if (!mp->m_errortag)
 		return false;
 
-	ASSERT(error_tag < XFS_ERRTAG_MAX);
+	if (!xfs_errortag_valid(error_tag))
+		return false;
+
 	randfactor = mp->m_errortag[error_tag];
 	if (!randfactor || prandom_u32_max(randfactor))
 		return false;
@@ -293,7 +306,7 @@ xfs_errortag_get(
 	struct xfs_mount	*mp,
 	unsigned int		error_tag)
 {
-	if (error_tag >= XFS_ERRTAG_MAX)
+	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
 	return mp->m_errortag[error_tag];
@@ -305,7 +318,7 @@ xfs_errortag_set(
 	unsigned int		error_tag,
 	unsigned int		tag_value)
 {
-	if (error_tag >= XFS_ERRTAG_MAX)
+	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
 	mp->m_errortag[error_tag] = tag_value;
@@ -319,7 +332,7 @@ xfs_errortag_add(
 {
 	BUILD_BUG_ON(ARRAY_SIZE(xfs_errortag_random_default) != XFS_ERRTAG_MAX);
 
-	if (error_tag >= XFS_ERRTAG_MAX)
+	if (!xfs_errortag_valid(error_tag))
 		return -EINVAL;
 
 	return xfs_errortag_set(mp, error_tag,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index fa578c913323..bf1661410272 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1335,15 +1335,6 @@ xfs_buffered_write_iomap_end(
 	if (iomap->type != IOMAP_DELALLOC)
 		return 0;
 
-	/*
-	 * Behave as if the write failed if drop writes is enabled. Set the NEW
-	 * flag to force delalloc cleanup.
-	 */
-	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_DROP_WRITES)) {
-		iomap->flags |= IOMAP_F_NEW;
-		written = 0;
-	}
-
 	/* If we didn't reserve the blocks, we're not allowed to punch them. */
 	if (!(iomap->flags & IOMAP_F_NEW))
 		return 0;
-- 
2.37.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] mm: export mapping_seek_hole_data()
  2022-11-15  1:30 ` [PATCH 1/9] mm: export mapping_seek_hole_data() Dave Chinner
@ 2022-11-15  8:40   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> XFS needs this for finding cached dirty data regions when cleaning
> up short writes, so it needs to be exported as XFS can be built as a
> module.

NAK.  Even if you use these helpers to make your life a little easier,
the main delalloc punch logic belongs into iomap and not XFS, and does
not need the exports.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] xfs: punching delalloc extents on write failure is racy
  2022-11-15  1:30 ` [PATCH 3/9] xfs: punching delalloc extents on write failure is racy Dave Chinner
@ 2022-11-15  8:41   ` Christoph Hellwig
  2022-11-15 23:53   ` Darrick J. Wong
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buffered_write_iomap_end() has a comment about the safety of
> punching delalloc extents based holding the IOLOCK_EXCL. This
> comment is wrong, and punching delalloc extents is not race free.
> 
> When we punch out a delalloc extent after a write failure in
> xfs_buffered_write_iomap_end(), we punch out the page cache with
> truncate_pagecache_range() before we punch out the delalloc extents.
> At this point, we only hold the IOLOCK_EXCL, so there is nothing
> stopping mmap() write faults racing with this cleanup operation,
> reinstantiating a folio over the range we are about to punch and
> hence requiring the delalloc extent to be kept.
> 
> If this race condition is hit, we can end up with a dirty page in
> the page cache that has no delalloc extent or space reservation
> backing it. This leads to bad things happening at writeback time.
> 
> To avoid this race condition, we need the page cache truncation to
> be atomic w.r.t. the extent manipulation. We can do this by holding
> the mapping->invalidate_lock exclusively across this operation -
> this will prevent new pages from being inserted into the page cache
> whilst we are removing the pages and the backing extent and space
> reservation.
> 
> Taking the mapping->invalidate_lock exclusively in the buffered
> write IO path is safe - it naturally nests inside the IOLOCK (see
> truncate and fallocate paths). iomap_zero_range() can be called from
> under the mapping->invalidate_lock (from the truncate path via
> either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
> will not instantiate new delalloc pages (because it skips holes) and
> hence will not ever need to punch out delalloc extents on failure.
> 
> Fix the locking issue, and clean up the code logic a little to avoid
> unnecessary work if we didn't allocate the delalloc extent or wrote
> the entire region we allocated.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

> +	filemap_invalidate_lock(inode->i_mapping);
> +	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
> +				 XFS_FSB_TO_B(mp, end_fsb) - 1);

No need to use VFS_I here, the inode is passed as a funtion argument.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/9] xfs: use byte ranges for write cleanup ranges
  2022-11-15  1:30 ` [PATCH 4/9] xfs: use byte ranges for write cleanup ranges Dave Chinner
@ 2022-11-15  8:42   ` Christoph Hellwig
  2022-11-15 23:57   ` Darrick J. Wong
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:38PM +1100, Dave Chinner wrote:
> +static int
> +xfs_buffered_write_delalloc_punch(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	struct xfs_mount	*mp = XFS_M(inode->i_sb);
> +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
> +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
> +
> +	return xfs_bmap_punch_delalloc_range(XFS_I(inode), start_fsb,
> +				end_fsb - start_fsb);
> +}

What is the reason to not switch xfs_bmap_punch_delalloc_range to
just take the byte range as discussed last round?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-15  1:30 ` [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Dave Chinner
@ 2022-11-15  8:43   ` Christoph Hellwig
  2022-11-16  0:48   ` Darrick J. Wong
  2022-11-16 13:57   ` Brian Foster
  2 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

> +static int
> +xfs_buffered_write_delalloc_scan(

As pointed out last time, this is generic iomap (or in fact filemap.c,
but maybe start small) logic, so move it there and just pass
xfs_buffered_write_delalloc_punch as a callback.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  2022-11-15  1:30 ` [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range Dave Chinner
@ 2022-11-15  8:44   ` Christoph Hellwig
  2022-11-15 23:48     ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:40PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All the callers of xfs_bmap_punch_delalloc_range() jump through
> hoops to convert a byte range to filesystem blocks before calling
> xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
> xfs_bmap_punch_delalloc_range() and have it do the conversion to
> filesystem blocks internally.

Ok, so we do this here.   Why can't we just do this earlier and
avoid the strange inbetween stage with a wrapper?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] iomap: write iomap validity checks
  2022-11-15  1:30 ` [PATCH 7/9] iomap: write iomap validity checks Dave Chinner
@ 2022-11-15  8:45   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

I'm still not a an of the indirect call, but we can revisit
this later if needed:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps
  2022-11-15  1:30 ` [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
@ 2022-11-15  8:49   ` Christoph Hellwig
  2022-11-15 23:26     ` Darrick J. Wong
  2022-11-16  0:10     ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-15  8:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:42PM +1100, Dave Chinner wrote:
> +/*
> + * Check that the iomap passed to us is still valid for the given offset and
> + * length.
> + */
> +static bool
> +xfs_iomap_valid(
> +	struct inode		*inode,
> +	const struct iomap	*iomap)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	u64			cookie = 0;
> +
> +	if (iomap->flags & IOMAP_F_XATTR) {
> +		cookie = READ_ONCE(ip->i_af.if_seq);
> +	} else {
> +		if ((iomap->flags & IOMAP_F_SHARED) && ip->i_cowfp)
> +			cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32;
> +		cookie |= READ_ONCE(ip->i_df.if_seq);
> +	}
> +	return cookie == iomap->validity_cookie;

How can this be called with IOMAP_F_XATTR set?

Also the code seems to duplicate xfs_iomap_inode_sequence, so
we just call that:

	return cookie == xfs_iomap_inode_sequence(XFS_I(inode), iomap->flags);



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps
  2022-11-15  8:49   ` Christoph Hellwig
@ 2022-11-15 23:26     ` Darrick J. Wong
  2022-11-16  0:10     ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-11-15 23:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:49:49AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 15, 2022 at 12:30:42PM +1100, Dave Chinner wrote:
> > +/*
> > + * Check that the iomap passed to us is still valid for the given offset and
> > + * length.
> > + */
> > +static bool
> > +xfs_iomap_valid(
> > +	struct inode		*inode,
> > +	const struct iomap	*iomap)
> > +{
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	u64			cookie = 0;
> > +
> > +	if (iomap->flags & IOMAP_F_XATTR) {
> > +		cookie = READ_ONCE(ip->i_af.if_seq);
> > +	} else {
> > +		if ((iomap->flags & IOMAP_F_SHARED) && ip->i_cowfp)
> > +			cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32;
> > +		cookie |= READ_ONCE(ip->i_df.if_seq);
> > +	}
> > +	return cookie == iomap->validity_cookie;
> 
> How can this be called with IOMAP_F_XATTR set?

xfs_xattr_iomap_begin for FIEMAP, if I'm not mistaken.

> Also the code seems to duplicate xfs_iomap_inode_sequence, so
> we just call that:
> 
> 	return cookie == xfs_iomap_inode_sequence(XFS_I(inode), iomap->flags);

I was thinking that too.

--D


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  2022-11-15  8:44   ` Christoph Hellwig
@ 2022-11-15 23:48     ` Darrick J. Wong
  2022-11-16  0:57       ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2022-11-15 23:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:44:46AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 15, 2022 at 12:30:40PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > All the callers of xfs_bmap_punch_delalloc_range() jump through
> > hoops to convert a byte range to filesystem blocks before calling
> > xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
> > xfs_bmap_punch_delalloc_range() and have it do the conversion to
> > filesystem blocks internally.
> 
> Ok, so we do this here.   Why can't we just do this earlier and
> avoid the strange inbetween stage with a wrapper?

Probably to avoid rewriting the previous patch with this units change,
is my guess.  Dave, do you want to merge with this patch 4?  I'm
satisfied enough with patches 4 and 6 that I'd rather get this out to
for-next for further testing than play more patch golf.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] xfs: punching delalloc extents on write failure is racy
  2022-11-15  1:30 ` [PATCH 3/9] xfs: punching delalloc extents on write failure is racy Dave Chinner
  2022-11-15  8:41   ` Christoph Hellwig
@ 2022-11-15 23:53   ` Darrick J. Wong
  1 sibling, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-11-15 23:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buffered_write_iomap_end() has a comment about the safety of
> punching delalloc extents based holding the IOLOCK_EXCL. This
> comment is wrong, and punching delalloc extents is not race free.
> 
> When we punch out a delalloc extent after a write failure in
> xfs_buffered_write_iomap_end(), we punch out the page cache with
> truncate_pagecache_range() before we punch out the delalloc extents.
> At this point, we only hold the IOLOCK_EXCL, so there is nothing
> stopping mmap() write faults racing with this cleanup operation,
> reinstantiating a folio over the range we are about to punch and
> hence requiring the delalloc extent to be kept.
> 
> If this race condition is hit, we can end up with a dirty page in
> the page cache that has no delalloc extent or space reservation
> backing it. This leads to bad things happening at writeback time.
> 
> To avoid this race condition, we need the page cache truncation to
> be atomic w.r.t. the extent manipulation. We can do this by holding
> the mapping->invalidate_lock exclusively across this operation -
> this will prevent new pages from being inserted into the page cache
> whilst we are removing the pages and the backing extent and space
> reservation.
> 
> Taking the mapping->invalidate_lock exclusively in the buffered
> write IO path is safe - it naturally nests inside the IOLOCK (see
> truncate and fallocate paths). iomap_zero_range() can be called from
> under the mapping->invalidate_lock (from the truncate path via
> either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
> will not instantiate new delalloc pages (because it skips holes) and
> hence will not ever need to punch out delalloc extents on failure.
> 
> Fix the locking issue, and clean up the code logic a little to avoid
> unnecessary work if we didn't allocate the delalloc extent or wrote
> the entire region we allocated.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Didn't I already tag this...?

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++------------------
>  1 file changed, 23 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 5cea069a38b4..a2e45ea1b0cb 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1147,6 +1147,10 @@ xfs_buffered_write_iomap_end(
>  		written = 0;
>  	}
>  
> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> +	if (!(iomap->flags & IOMAP_F_NEW))
> +		return 0;
> +
>  	/*
>  	 * start_fsb refers to the first unused block after a short write. If
>  	 * nothing was written, round offset down to point at the first block in
> @@ -1158,27 +1162,28 @@ xfs_buffered_write_iomap_end(
>  		start_fsb = XFS_B_TO_FSB(mp, offset + written);
>  	end_fsb = XFS_B_TO_FSB(mp, offset + length);
>  
> +	/* Nothing to do if we've written the entire delalloc extent */
> +	if (start_fsb >= end_fsb)
> +		return 0;
> +
>  	/*
> -	 * Trim delalloc blocks if they were allocated by this write and we
> -	 * didn't manage to write the whole range.
> -	 *
> -	 * We don't need to care about racing delalloc as we hold i_mutex
> -	 * across the reserve/allocate/unreserve calls. If there are delalloc
> -	 * blocks in the range, they are ours.
> +	 * Lock the mapping to avoid races with page faults re-instantiating
> +	 * folios and dirtying them via ->page_mkwrite between the page cache
> +	 * truncation and the delalloc extent removal. Failing to do this can
> +	 * leave dirty pages with no space reservation in the cache.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && start_fsb < end_fsb) {
> -		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
> -					 XFS_FSB_TO_B(mp, end_fsb) - 1);
> -
> -		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -					       end_fsb - start_fsb);
> -		if (error && !xfs_is_shutdown(mp)) {
> -			xfs_alert(mp, "%s: unable to clean up ino %lld",
> -				__func__, ip->i_ino);
> -			return error;
> -		}
> +	filemap_invalidate_lock(inode->i_mapping);
> +	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
> +				 XFS_FSB_TO_B(mp, end_fsb) - 1);
> +
> +	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> +				       end_fsb - start_fsb);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	if (error && !xfs_is_shutdown(mp)) {
> +		xfs_alert(mp, "%s: unable to clean up ino %lld",
> +			__func__, ip->i_ino);
> +		return error;
>  	}
> -
>  	return 0;
>  }
>  
> -- 
> 2.37.2
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/9] xfs: use byte ranges for write cleanup ranges
  2022-11-15  1:30 ` [PATCH 4/9] xfs: use byte ranges for write cleanup ranges Dave Chinner
  2022-11-15  8:42   ` Christoph Hellwig
@ 2022-11-15 23:57   ` Darrick J. Wong
  1 sibling, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-11-15 23:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:38PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buffered_write_iomap_end() currently converts the byte ranges
> passed to it to filesystem blocks to pass them to the bmap code to
> punch out delalloc blocks, but then has to convert filesytem
> blocks back to byte ranges for page cache truncate.
> 
> We're about to make the page cache truncate go away and replace it
> with a page cache walk, so having to convert everything to/from/to
> filesystem blocks is messy and error-prone. It is much easier to
> pass around byte ranges and convert to page indexes and/or
> filesystem blocks only where those units are needed.
> 
> In preparation for the page cache walk being added, add a helper
> that converts byte ranges to filesystem blocks and calls
> xfs_bmap_punch_delalloc_range() and convert
> xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_iomap.c | 40 +++++++++++++++++++++++++---------------
>  1 file changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index a2e45ea1b0cb..7bb55dbc19d3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1120,6 +1120,20 @@ xfs_buffered_write_iomap_begin(
>  	return error;
>  }
>  
> +static int
> +xfs_buffered_write_delalloc_punch(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	struct xfs_mount	*mp = XFS_M(inode->i_sb);
> +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
> +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
> +
> +	return xfs_bmap_punch_delalloc_range(XFS_I(inode), start_fsb,
> +				end_fsb - start_fsb);
> +}
> +
>  static int
>  xfs_buffered_write_iomap_end(
>  	struct inode		*inode,
> @@ -1129,10 +1143,9 @@ xfs_buffered_write_iomap_end(
>  	unsigned		flags,
>  	struct iomap		*iomap)
>  {
> -	struct xfs_inode	*ip = XFS_I(inode);
> -	struct xfs_mount	*mp = ip->i_mount;
> -	xfs_fileoff_t		start_fsb;
> -	xfs_fileoff_t		end_fsb;
> +	struct xfs_mount	*mp = XFS_M(inode->i_sb);
> +	loff_t			start_byte;
> +	loff_t			end_byte;
>  	int			error = 0;
>  
>  	if (iomap->type != IOMAP_DELALLOC)
> @@ -1157,13 +1170,13 @@ xfs_buffered_write_iomap_end(
>  	 * the range.
>  	 */
>  	if (unlikely(!written))
> -		start_fsb = XFS_B_TO_FSBT(mp, offset);
> +		start_byte = round_down(offset, mp->m_sb.sb_blocksize);
>  	else
> -		start_fsb = XFS_B_TO_FSB(mp, offset + written);
> -	end_fsb = XFS_B_TO_FSB(mp, offset + length);
> +		start_byte = round_up(offset + written, mp->m_sb.sb_blocksize);
> +	end_byte = round_up(offset + length, mp->m_sb.sb_blocksize);
>  
>  	/* Nothing to do if we've written the entire delalloc extent */
> -	if (start_fsb >= end_fsb)
> +	if (start_byte >= end_byte)
>  		return 0;
>  
>  	/*
> @@ -1173,15 +1186,12 @@ xfs_buffered_write_iomap_end(
>  	 * leave dirty pages with no space reservation in the cache.
>  	 */
>  	filemap_invalidate_lock(inode->i_mapping);
> -	truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
> -				 XFS_FSB_TO_B(mp, end_fsb) - 1);
> -
> -	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -				       end_fsb - start_fsb);
> +	truncate_pagecache_range(inode, start_byte, end_byte - 1);
> +	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
>  	filemap_invalidate_unlock(inode->i_mapping);
>  	if (error && !xfs_is_shutdown(mp)) {
> -		xfs_alert(mp, "%s: unable to clean up ino %lld",
> -			__func__, ip->i_ino);
> +		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
> +			__func__, XFS_I(inode)->i_ino);
>  		return error;
>  	}
>  	return 0;
> -- 
> 2.37.2
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps
  2022-11-15  8:49   ` Christoph Hellwig
  2022-11-15 23:26     ` Darrick J. Wong
@ 2022-11-16  0:10     ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-16  0:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:49:49AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 15, 2022 at 12:30:42PM +1100, Dave Chinner wrote:
> > +/*
> > + * Check that the iomap passed to us is still valid for the given offset and
> > + * length.
> > + */
> > +static bool
> > +xfs_iomap_valid(
> > +	struct inode		*inode,
> > +	const struct iomap	*iomap)
> > +{
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	u64			cookie = 0;
> > +
> > +	if (iomap->flags & IOMAP_F_XATTR) {
> > +		cookie = READ_ONCE(ip->i_af.if_seq);
> > +	} else {
> > +		if ((iomap->flags & IOMAP_F_SHARED) && ip->i_cowfp)
> > +			cookie = (u64)READ_ONCE(ip->i_cowfp->if_seq) << 32;
> > +		cookie |= READ_ONCE(ip->i_df.if_seq);
> > +	}
> > +	return cookie == iomap->validity_cookie;
> 
> How can this be called with IOMAP_F_XATTR set?

It can't be. *Yet*.

However: ->iomap_begin can be called to map an attribute fork, and
->iomap_valid *could* be called on the attr iomap that is returned.
At which point, the iomap needs to be self-describing and the
validation cookie needs to hold the attr fork sequence number for
everything to work.

Failing to handle/validate attr iomaps correctly would clearly be an
XFS implementation bug and making it work correctly is trivial and
costs nothing. Hence making it work is a no-brainer - leaving it as
a landmine for someone else to trip over in future is a poor outcome
for everyone.

> Also the code seems to duplicate xfs_iomap_inode_sequence, so
> we just call that:
> 
> 	return cookie == xfs_iomap_inode_sequence(XFS_I(inode), iomap->flags);

Yup, will fix.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-15  1:30 ` [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Dave Chinner
  2022-11-15  8:43   ` Christoph Hellwig
@ 2022-11-16  0:48   ` Darrick J. Wong
  2022-11-17  1:06     ` Dave Chinner
  2022-11-16 13:57   ` Brian Foster
  2 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2022-11-16  0:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buffered_write_iomap_end() currently invalidates the page cache
> over the unused range of the delalloc extent it allocated. While the
> write allocated the delalloc extent, it does not own it exclusively
> as the write does not hold any locks that prevent either writeback
> or mmap page faults from changing the state of either the page cache
> or the extent state backing this range.
> 
> Whilst xfs_bmap_punch_delalloc_range() already handles races in
> extent conversion - it will only punch out delalloc extents and it
> ignores any other type of extent - the page cache truncate does not
> discriminate between data written by this write or some other task.
> As a result, truncating the page cache can result in data corruption
> if the write races with mmap modifications to the file over the same
> range.
> 
> generic/346 exercises this workload, and if we randomly fail writes
> (as will happen when iomap gets stale iomap detection later in the
> patchset), it will randomly corrupt the file data because it removes
> data written by mmap() in the same page as the write() that failed.
> 
> Hence we do not want to punch out the page cache over the range of
> the extent we failed to write to - what we actually need to do is
> detect the ranges that have dirty data in cache over them and *not
> punch them out*.
> 
> TO do this, we have to walk the page cache over the range of the
> delalloc extent we want to remove. This is made complex by the fact
> we have to handle partially up-to-date folios correctly and this can
> happen even when the FSB size == PAGE_SIZE because we now support
> multi-page folios in the page cache.
> 
> Because we are only interested in discovering the edges of data
> ranges in the page cache (i.e. hole-data boundaries) we can make use
> of mapping_seek_hole_data() to find those transitions in the page
> cache. As we hold the invalidate_lock, we know that the boundaries
> are not going to change while we walk the range. This interface is
> also byte-based and is sub-page block aware, so we can find the data
> ranges in the cache based on byte offsets rather than page, folio or
> fs block sized chunks. This greatly simplifies the logic of finding
> dirty cached ranges in the page cache.
> 
> Once we've identified a range that contains cached data, we can then
> iterate the range folio by folio. This allows us to determine if the
> data is dirty and hence perform the correct delalloc extent punching
> operations. The seek interface we use to iterate data ranges will
> give us sub-folio start/end granularity, so we may end up looking up
> the same folio multiple times as the seek interface iterates across
> each discontiguous data region in the folio.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7bb55dbc19d3..2d48fcc7bd6f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
>  				end_fsb - start_fsb);
>  }
>  
> +/*
> + * Scan the data range passed to us for dirty page cache folios. If we find a
> + * dirty folio, punch out the preceeding range and update the offset from which
> + * the next punch will start from.
> + *
> + * We can punch out clean pages because they either contain data that has been
> + * written back - in which case the delalloc punch over that range is a no-op -
> + * or they have been read faults in which case they contain zeroes and we can
> + * remove the delalloc backing range and any new writes to those pages will do
> + * the normal hole filling operation...
> + *
> + * This makes the logic simple: we only need to keep the delalloc extents only
> + * over the dirty ranges of the page cache.
> + */
> +static int
> +xfs_buffered_write_delalloc_scan(
> +	struct inode		*inode,
> +	loff_t			*punch_start_byte,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			offset = start_byte;
> +
> +	while (offset < end_byte) {
> +		struct folio	*folio;
> +
> +		/* grab locked page */
> +		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
> +		if (!folio) {
> +			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* if dirty, punch up to offset */
> +		if (folio_test_dirty(folio)) {
> +			if (offset > *punch_start_byte) {
> +				int	error;
> +
> +				error = xfs_buffered_write_delalloc_punch(inode,
> +						*punch_start_byte, offset);

This sounds an awful lot like what iomap_writeback_ops.discard_folio()
does, albeit without the xfs_alert screaming everywhere.

Moving along... so we punch out delalloc reservations for any part of
the page cache that is clean.  "punch_start_byte" is the start pos of
the last range of clean cache, and we want to punch up to the start of
this dirty folio...

> +				if (error) {
> +					folio_unlock(folio);
> +					folio_put(folio);
> +					return error;
> +				}
> +			}
> +
> +			/*
> +			 * Make sure the next punch start is correctly bound to
> +			 * the end of this data range, not the end of the folio.
> +			 */
> +			*punch_start_byte = min_t(loff_t, end_byte,
> +					folio_next_index(folio) << PAGE_SHIFT);

...and then start a new clean range after this folio (or at the end_byte
to signal that we're done)...

> +		}
> +
> +		/* move offset to start of next folio in range */
> +		offset = folio_next_index(folio) << PAGE_SHIFT;
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Punch out all the delalloc blocks in the range given except for those that
> + * have dirty data still pending in the page cache - those are going to be
> + * written and so must still retain the delalloc backing for writeback.
> + *
> + * As we are scanning the page cache for data, we don't need to reimplement the
> + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> + * start and end of data ranges correctly even for sub-folio block sizes. This
> + * byte range based iteration is especially convenient because it means we don't
> + * have to care about variable size folios, nor where the start or end of the
> + * data range lies within a folio, if they lie within the same folio or even if
> + * there are multiple discontiguous data ranges within the folio.
> + */
> +static int
> +xfs_buffered_write_delalloc_release(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			punch_start_byte = start_byte;
> +	int			error = 0;
> +
> +	/*
> +	 * Lock the mapping to avoid races with page faults re-instantiating
> +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> +	 * cache and perform delalloc extent removal. Failing to do this can
> +	 * leave dirty pages with no space reservation in the cache.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +	while (start_byte < end_byte) {
> +		loff_t		data_end;
> +
> +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> +				start_byte, end_byte, SEEK_DATA);
> +		/*
> +		 * If there is no more data to scan, all that is left is to
> +		 * punch out the remaining range.
> +		 */
> +		if (start_byte == -ENXIO || start_byte == end_byte)
> +			break;
> +		if (start_byte < 0) {
> +			error = start_byte;
> +			goto out_unlock;
> +		}
> +		ASSERT(start_byte >= punch_start_byte);
> +		ASSERT(start_byte < end_byte);
> +
> +		/*
> +		 * We find the end of this contiguous cached data range by
> +		 * seeking from start_byte to the beginning of the next hole.
> +		 */
> +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> +				end_byte, SEEK_HOLE);
> +		if (data_end < 0) {
> +			error = data_end;
> +			goto out_unlock;
> +		}
> +		ASSERT(data_end > start_byte);
> +		ASSERT(data_end <= end_byte);
> +
> +		error = xfs_buffered_write_delalloc_scan(inode,
> +				&punch_start_byte, start_byte, data_end);

...and we use SEEK_HOLE/SEEK_DATA to find the ranges of pagecache where
there's even going to be folios mapped.  But in structuring the code
like this, xfs now has to know details about folio state again, and the
point of iomap/buffered-io.c is to delegate handling of the pagecache
away from XFS, at least for filesystems that want to manage buffered IO
the same way XFS does.

IOWs, I agree with Christoph that these two functions that compute the
ranges that need delalloc-punching really ought to be in the iomap code.
TBH I wonder if this isn't better suited for being called by
iomap_write_iter after a short write on an IOMAP_F_NEW iomap, with an
function pointer in iomap page_ops for iomap to tell xfs to drop the
delalloc reservations.

--D

> +		if (error)
> +			goto out_unlock;
> +
> +		/* The next data search starts at the end of this one. */
> +		start_byte = data_end;
> +	}
> +
> +	if (punch_start_byte <1 end_byte)
> +		error = xfs_buffered_write_delalloc_punch(inode,
> +				punch_start_byte, end_byte);
> +out_unlock:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return error;
> +}
> +
>  static int
>  xfs_buffered_write_iomap_end(
>  	struct inode		*inode,
> @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
>  	if (start_byte >= end_byte)
>  		return 0;
>  
> -	/*
> -	 * Lock the mapping to avoid races with page faults re-instantiating
> -	 * folios and dirtying them via ->page_mkwrite between the page cache
> -	 * truncation and the delalloc extent removal. Failing to do this can
> -	 * leave dirty pages with no space reservation in the cache.
> -	 */
> -	filemap_invalidate_lock(inode->i_mapping);
> -	truncate_pagecache_range(inode, start_byte, end_byte - 1);
> -	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
> -	filemap_invalidate_unlock(inode->i_mapping);
> +	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
>  	if (error && !xfs_is_shutdown(mp)) {
>  		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
>  			__func__, XFS_I(inode)->i_ino);
> -- 
> 2.37.2
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  2022-11-15 23:48     ` Darrick J. Wong
@ 2022-11-16  0:57       ` Dave Chinner
  2022-11-16  5:46         ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-11-16  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 03:48:33PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 15, 2022 at 12:44:46AM -0800, Christoph Hellwig wrote:
> > On Tue, Nov 15, 2022 at 12:30:40PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > All the callers of xfs_bmap_punch_delalloc_range() jump through
> > > hoops to convert a byte range to filesystem blocks before calling
> > > xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
> > > xfs_bmap_punch_delalloc_range() and have it do the conversion to
> > > filesystem blocks internally.
> > 
> > Ok, so we do this here.   Why can't we just do this earlier and
> > avoid the strange inbetween stage with a wrapper?
> 
> Probably to avoid rewriting the previous patch with this units change,
> is my guess.  Dave, do you want to merge with this patch 4?  I'm
> satisfied enough with patches 4 and 6 that I'd rather get this out to
> for-next for further testing than play more patch golf.

The fact that Christoph NAK'd exporting mapping_seek_hole_data()
this time around is just Plain Fucking Annoying. He didn't say
anything in the last thread about it after I explained why I used
it, so I had assumed it was all OK.

I've already been playing patch golf all morning now to rearrange
all the deck chairs to avoid exporting mapping_seek_hole_data().
Instead we now have an exported iomap function that wraps
mapping_seek_hole_data, and the wrapper function I added in patch 4
is now the callback function that is passed 3 layers deep into the
iomap code.

Hence the xfs_buffered_write_delalloc_punch() function needs to
remain - we can't elide it entire like this patch does - because now
we need a callback function that we can provide to the iomap code so
we avoid coupling internal XFS implementation functions to external
VFS APIs.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  2022-11-16  0:57       ` Dave Chinner
@ 2022-11-16  5:46         ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2022-11-16  5:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm

On Wed, Nov 16, 2022 at 11:57:23AM +1100, Dave Chinner wrote:
> The fact that Christoph NAK'd exporting mapping_seek_hole_data()
> this time around is just Plain Fucking Annoying.

Thank you so much, I love you too.

> He didn't say
> anything in the last thread about it after I explained why I used
> it, so I had assumed it was all OK.

Quote from the last thread:

"So, the whole scan for delalloc logic seems pretty generic, I think
it can an should be lifted to iomap, with
xfs_buffered_write_delalloc_punch provided as a callback."

> I've already been playing patch golf all morning now to rearrange
> all the deck chairs to avoid exporting mapping_seek_hole_data().
> Instead we now have an exported iomap function that wraps
> mapping_seek_hole_data, and the wrapper function I added in patch 4
> is now the callback function that is passed 3 layers deep into the
> iomap code.
> 
> Hence the xfs_buffered_write_delalloc_punch() function needs to
> remain - we can't elide it entire like this patch does - because now
> we need a callback function that we can provide to the iomap code so
> we avoid coupling internal XFS implementation functions to external
> VFS APIs.

Which was roughly was the intention all along.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-15  1:30 ` [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Dave Chinner
  2022-11-15  8:43   ` Christoph Hellwig
  2022-11-16  0:48   ` Darrick J. Wong
@ 2022-11-16 13:57   ` Brian Foster
  2022-11-17  0:41     ` Dave Chinner
  2 siblings, 1 reply; 32+ messages in thread
From: Brian Foster @ 2022-11-16 13:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7bb55dbc19d3..2d48fcc7bd6f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
>  				end_fsb - start_fsb);
>  }
>  
...
> +/*
> + * Punch out all the delalloc blocks in the range given except for those that
> + * have dirty data still pending in the page cache - those are going to be
> + * written and so must still retain the delalloc backing for writeback.
> + *
> + * As we are scanning the page cache for data, we don't need to reimplement the
> + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> + * start and end of data ranges correctly even for sub-folio block sizes. This
> + * byte range based iteration is especially convenient because it means we don't
> + * have to care about variable size folios, nor where the start or end of the
> + * data range lies within a folio, if they lie within the same folio or even if
> + * there are multiple discontiguous data ranges within the folio.
> + */
> +static int
> +xfs_buffered_write_delalloc_release(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			punch_start_byte = start_byte;
> +	int			error = 0;
> +
> +	/*
> +	 * Lock the mapping to avoid races with page faults re-instantiating
> +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> +	 * cache and perform delalloc extent removal. Failing to do this can
> +	 * leave dirty pages with no space reservation in the cache.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +	while (start_byte < end_byte) {
> +		loff_t		data_end;
> +
> +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> +				start_byte, end_byte, SEEK_DATA);

FWIW, the fact that mapping seek data is based on uptodate status means
that seek behavior can change based on prior reads. For example, see how
seek hole/data presents reads of unwritten ranges as data [1]. The same
thing isn't observable for holes because iomap doesn't check the mapping
in that case, but underlying iop state is the same and that is what this
code is looking at.

The filtering being done here means we essentially only care about dirty
pages backed by delalloc blocks. That means if you get here with a dirty
page and the portion of the page affected by this failed write is
uptodate, this won't punch an underlying delalloc block even though
nothing else may have written to it in the meantime. That sort of state
can be created by a prior read of the range on a sub-page block size fs,
or perhaps a racing async readahead (via read fault of a lower
offset..?), etc.

I suspect this is not a serious error because the page is dirty and
writeback will thus convert the block. The only exception to that I can
see is if the block is beyond EOF (consider a mapped read to a page that
straddles EOF, followed by a post-eof write that fails), writeback won't
actually map the block directly. It may convert if contiguous with
delalloc blocks inside EOF (and sufficiently sized physical extents
exist), or even if not, should still otherwise be cleaned up by the
various other means we already have to manage post-eof blocks.

So IOW there's a tradeoff being made here for possible spurious
allocation and I/O and a subtle dependency on writeback that should
probably be documented somewhere. The larger concern is that if
writeback eventually changes based on dirty range tracking in a way that
breaks this dependency, that introduces yet another stale delalloc block
landmine associated with this error handling code (regardless of whether
you want to call that a bug in this code, seek data, whatever), and
those problems are difficult enough to root cause as it is.

Brian

[1]

# xfs_io -fc "falloc 0 4k" -c "seek -a 0" -c "pread 0 4k" -c "seek -a 0" <file>
Whence  Result
HOLE    0
read 4096/4096 bytes at offset 0
4 KiB, 4 ops; 0.0000 sec (156 MiB/sec and 160000.0000 ops/sec)
Whence  Result
DATA    0
HOLE    4096

> +		/*
> +		 * If there is no more data to scan, all that is left is to
> +		 * punch out the remaining range.
> +		 */
> +		if (start_byte == -ENXIO || start_byte == end_byte)
> +			break;
> +		if (start_byte < 0) {
> +			error = start_byte;
> +			goto out_unlock;
> +		}
> +		ASSERT(start_byte >= punch_start_byte);
> +		ASSERT(start_byte < end_byte);
> +
> +		/*
> +		 * We find the end of this contiguous cached data range by
> +		 * seeking from start_byte to the beginning of the next hole.
> +		 */
> +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> +				end_byte, SEEK_HOLE);
> +		if (data_end < 0) {
> +			error = data_end;
> +			goto out_unlock;
> +		}
> +		ASSERT(data_end > start_byte);
> +		ASSERT(data_end <= end_byte);
> +
> +		error = xfs_buffered_write_delalloc_scan(inode,
> +				&punch_start_byte, start_byte, data_end);
> +		if (error)
> +			goto out_unlock;
> +
> +		/* The next data search starts at the end of this one. */
> +		start_byte = data_end;
> +	}
> +
> +	if (punch_start_byte < end_byte)
> +		error = xfs_buffered_write_delalloc_punch(inode,
> +				punch_start_byte, end_byte);
> +out_unlock:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return error;
> +}
> +
>  static int
>  xfs_buffered_write_iomap_end(
>  	struct inode		*inode,
> @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
>  	if (start_byte >= end_byte)
>  		return 0;
>  
> -	/*
> -	 * Lock the mapping to avoid races with page faults re-instantiating
> -	 * folios and dirtying them via ->page_mkwrite between the page cache
> -	 * truncation and the delalloc extent removal. Failing to do this can
> -	 * leave dirty pages with no space reservation in the cache.
> -	 */
> -	filemap_invalidate_lock(inode->i_mapping);
> -	truncate_pagecache_range(inode, start_byte, end_byte - 1);
> -	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
> -	filemap_invalidate_unlock(inode->i_mapping);
> +	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
>  	if (error && !xfs_is_shutdown(mp)) {
>  		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
>  			__func__, XFS_I(inode)->i_ino);
> -- 
> 2.37.2
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-16 13:57   ` Brian Foster
@ 2022-11-17  0:41     ` Dave Chinner
  2022-11-17 18:28       ` Darrick J. Wong
  2022-11-18 17:20       ` Brian Foster
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-17  0:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> ...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 141 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 7bb55dbc19d3..2d48fcc7bd6f 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> >  				end_fsb - start_fsb);
> >  }
> >  
> ...
> > +/*
> > + * Punch out all the delalloc blocks in the range given except for those that
> > + * have dirty data still pending in the page cache - those are going to be
> > + * written and so must still retain the delalloc backing for writeback.
> > + *
> > + * As we are scanning the page cache for data, we don't need to reimplement the
> > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> > + * start and end of data ranges correctly even for sub-folio block sizes. This
> > + * byte range based iteration is especially convenient because it means we don't
> > + * have to care about variable size folios, nor where the start or end of the
> > + * data range lies within a folio, if they lie within the same folio or even if
> > + * there are multiple discontiguous data ranges within the folio.
> > + */
> > +static int
> > +xfs_buffered_write_delalloc_release(
> > +	struct inode		*inode,
> > +	loff_t			start_byte,
> > +	loff_t			end_byte)
> > +{
> > +	loff_t			punch_start_byte = start_byte;
> > +	int			error = 0;
> > +
> > +	/*
> > +	 * Lock the mapping to avoid races with page faults re-instantiating
> > +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > +	 * cache and perform delalloc extent removal. Failing to do this can
> > +	 * leave dirty pages with no space reservation in the cache.
> > +	 */
> > +	filemap_invalidate_lock(inode->i_mapping);
> > +	while (start_byte < end_byte) {
> > +		loff_t		data_end;
> > +
> > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > +				start_byte, end_byte, SEEK_DATA);
> 
> FWIW, the fact that mapping seek data is based on uptodate status means
> that seek behavior can change based on prior reads.

Yup. It should be obvious that any page cache scan based algorithm
will change based on changing page cache residency.

> For example, see how
> seek hole/data presents reads of unwritten ranges as data [1]. The same
> thing isn't observable for holes because iomap doesn't check the mapping
> in that case, but underlying iop state is the same and that is what this
> code is looking at.

Well, yes.  That's the fundamental, underlying issue that this
patchset is addressing for the write() operation: that the page
cache contents and the underlying filesystem extent map are not
guaranteed to be coherent and can be changed independently of each
other.

The whole problem with looking exclusively at filesystem level
extent state (and hence FIEMAP) is that the extent state doesn't
tell us whether the is uncommitted data over the range of the extent
in the page cache.  The filesystem extent state and page cache data
*can't be coherent* in a writeback caching environment. This is the
fundamental difference between what the filesystem extent map tells
us (FIEMAP) and what querying the page cache tells us
(SEEK_DATA/SEEK_HOLE).

This is also the underlying problem with iomap_truncate_page() - it
fails to query the page cache for data over unwritten extents, so
fails to zero the post-EOF part of dirty folios over unwritten
extents and so it all goes wrong...

> The filtering being done here means we essentially only care about dirty
> pages backed by delalloc blocks. That means if you get here with a dirty
> page and the portion of the page affected by this failed write is
> uptodate, this won't punch an underlying delalloc block even though
> nothing else may have written to it in the meantime.

Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap
will not span ranges that have pre-existing *dirty* data in the
page cache. Those *must* already have (del)allocated extents, hence
the iomap for the newly allocated delalloc extent will always end
before pre-existing dirty data in the page cache starts.

Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
precludes stepping into ranges that have pre-existing cached dirty
data.

We also can't get a racing write() to the same range right now
because this is all under IOLOCK_EXCL, hence we only ever see dirty
folios as a result of race with page faults. page faults zero the
entire folio they insert into the page cache and
iomap_folio_mkwrite_iter() asserts that the entire folio is marked
up to date. Hence if we find a dirty folio outside the range the
write() dirtied, we are guaranteed that the entire dirty folio is up
to date....

Yes, there can be pre-existing *clean* folios (and clean partially
up to date folios) in the page cache, but we won't have dirty
partially up to date pages in the middle of the range we are
scanning. Hence we only need to care about the edge cases (folios
that overlap start and ends). We skip the partially written start
block, and we always punch up to the end block if it is different
from the last block we punched up to. If the end of the data spans
into a dirty folio, we know that dirty range is up to date because
the seek scan only returns ranges that are up to date. Hence we
don't punch those partial blocks out....

Regardless, let's assume we have a racing write that has partially
updated and dirtied a folio (because we've moved to
XFS_IOLOCK_SHARED locking for writes). This case is already handled
by the mapping_seek_hole_data() based iteration.

That is, the mapping_seek_hole_data() loop provides us with
*discrete ranges of up to date data* that are independent of folio
size, up-to-date range granularity, dirty range tracking, filesystem
block size, etc.

Hence if the next discrete range we discover is in the same dirty
folio as the previous discrete range of up to date data, we know we
have a sub-folio sized hole in the data that is not up to date.
Because there is no data over this range, we have to punch out the
underlying delalloc extent over that range. 

IOWs, the dirty state of the folio and/or the granularity of the
dirty range tracking is irrelevant here - we know there was no data
in the cache (dlean or dirty) over this range because it is
discontiguous with the previous range of data returned.

IOWs, if we have this "up to date" map on a dirty folio like this:

Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+

Then the unrolled iteration and punching we do would look like this:

First iteration of the range:

punch_start:
		V
		+-------+UUUUUUU+-------+UUUUUUU+-------+

SEEK_DATA:		V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_HOLE:			^
Data range:		+UUUUUUU+
Punch range:	+-------+
Extent map:	+-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+

Second iteration:

punch_start			V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_DATA:				V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_HOLE:					^
Data range:				+UUUUUUU+
Punch range:			+-------+
Extent map:	+-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+

Third iteration:

punch_start					V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_DATA: - moves into next folio in cache
....
Punch range:					+-------+ ......
Extent map:	+-------+DDDDDDD+-------+DDDDDDD+-------+ ......
			(to end of scan range or start of next data)

As you can see, this scan does not care about folio size, sub-folio
range granularity or filesystem block sizes.  It also matches
exactly how writeback handles dirty, partially up to date folios, so
there's no stray delalloc blocks left around to be tripped over
after failed or short writes occur.

Indeed, if we move to sub-folio dirty range tracking, we can simply
add a mapping_seek_hole_data() variant that walks dirty ranges in
the page cache rather than up to date ranges. Then we can remove the
inner loop from this code that looks up folios to determine dirty
state. The above algorithm does not change - we just walk from
discrete range to discrete range punching the gaps between them....

IOWs, the algorithm is largely future proof - the only thing that
needs to change if we change iomap to track sub-folio dirty ranges
is how we check the data range for being dirty. That should be no
surprise, really, the surprise should be that we can make some
simple mods to page cache seek to remove the need for checking dirty
state in this code altogether....

> That sort of state
> can be created by a prior read of the range on a sub-page block size fs,
> or perhaps a racing async readahead (via read fault of a lower
> offset..?), etc.

Yup, generic/346 exercises this racing unaligned, sub-folio mmap
write vs write() case. This test, specifically, was the reason I
moved to using mapping_seek_hole_data() - g/346 found an endless
stream of bugs in the sub-multi-page-folio range iteration code I
kept trying to write....

> I suspect this is not a serious error because the page is dirty
> and writeback will thus convert the block. The only exception to
> that I can see is if the block is beyond EOF (consider a mapped
> read to a page that straddles EOF, followed by a post-eof write
> that fails), writeback won't actually map the block directly.

I don't think that can happen. iomap_write_failed() calls
truncate_pagecache_range() to remove any newly instantiated cached
data beyond the original EOF. Hence the delalloc punch will remove
everything beyond the original EOF that was allocated for the failed
write. Hence when we get to writeback we're not going to find any
up-to-date data beyond the EOF block in the page cache, nor any
stray delalloc blocks way beyond EOF....

> It may convert if contiguous with delalloc blocks inside EOF (and
> sufficiently sized physical extents exist), or even if not, should
> still otherwise be cleaned up by the various other means we
> already have to manage post-eof blocks.
>
> So IOW there's a tradeoff being made here for possible spurious
> allocation and I/O and a subtle dependency on writeback that
> should probably be documented somewhere.

As per above, I don't think there is any spurious/stale allocation
left behind by the punch code, nor is there any dependency on
writeback to ignore it such issues.

> The larger concern is that if
> writeback eventually changes based on dirty range tracking in a way that
> breaks this dependency, that introduces yet another stale delalloc block
> landmine associated with this error handling code (regardless of whether
> you want to call that a bug in this code, seek data, whatever), and
> those problems are difficult enough to root cause as it is.

If iomap changes how it tracks dirty ranges, this punch code only
needs small changes to work with that correctly. There aren't any
unknown landmines here - if we change dirty tracking, we know that
we have to update the code that depends on the existing dirty
tracking mechanisms to work correctly with the new infrastructure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-16  0:48   ` Darrick J. Wong
@ 2022-11-17  1:06     ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2022-11-17  1:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 15, 2022 at 04:48:58PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > +/*
> > + * Scan the data range passed to us for dirty page cache folios. If we find a
> > + * dirty folio, punch out the preceeding range and update the offset from which
> > + * the next punch will start from.
> > + *
> > + * We can punch out clean pages because they either contain data that has been
> > + * written back - in which case the delalloc punch over that range is a no-op -
> > + * or they have been read faults in which case they contain zeroes and we can
> > + * remove the delalloc backing range and any new writes to those pages will do
> > + * the normal hole filling operation...
> > + *
> > + * This makes the logic simple: we only need to keep the delalloc extents only
> > + * over the dirty ranges of the page cache.
> > + */
> > +static int
> > +xfs_buffered_write_delalloc_scan(
> > +	struct inode		*inode,
> > +	loff_t			*punch_start_byte,
> > +	loff_t			start_byte,
> > +	loff_t			end_byte)
> > +{
> > +	loff_t			offset = start_byte;
> > +
> > +	while (offset < end_byte) {
> > +		struct folio	*folio;
> > +
> > +		/* grab locked page */
> > +		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
> > +		if (!folio) {
> > +			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		/* if dirty, punch up to offset */
> > +		if (folio_test_dirty(folio)) {
> > +			if (offset > *punch_start_byte) {
> > +				int	error;
> > +
> > +				error = xfs_buffered_write_delalloc_punch(inode,
> > +						*punch_start_byte, offset);
> 
> This sounds an awful lot like what iomap_writeback_ops.discard_folio()
> does, albeit without the xfs_alert screaming everywhere.

similar, but .discard_folio() is trashing uncommitted written data,
whilst this loop is explicitly preserving uncommitted written
data....

> Moving along... so we punch out delalloc reservations for any part of
> the page cache that is clean.  "punch_start_byte" is the start pos of
> the last range of clean cache, and we want to punch up to the start of
> this dirty folio...
> 
> > +				if (error) {
> > +					folio_unlock(folio);
> > +					folio_put(folio);
> > +					return error;
> > +				}
> > +			}
> > +
> > +			/*
> > +			 * Make sure the next punch start is correctly bound to
> > +			 * the end of this data range, not the end of the folio.
> > +			 */
> > +			*punch_start_byte = min_t(loff_t, end_byte,
> > +					folio_next_index(folio) << PAGE_SHIFT);
> 
> ...and then start a new clean range after this folio (or at the end_byte
> to signal that we're done)...

Yes.

> > +	filemap_invalidate_lock(inode->i_mapping);
> > +	while (start_byte < end_byte) {
> > +		loff_t		data_end;
> > +
> > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > +				start_byte, end_byte, SEEK_DATA);
> > +		/*
> > +		 * If there is no more data to scan, all that is left is to
> > +		 * punch out the remaining range.
> > +		 */
> > +		if (start_byte == -ENXIO || start_byte == end_byte)
> > +			break;
> > +		if (start_byte < 0) {
> > +			error = start_byte;
> > +			goto out_unlock;
> > +		}
> > +		ASSERT(start_byte >= punch_start_byte);
> > +		ASSERT(start_byte < end_byte);
> > +
> > +		/*
> > +		 * We find the end of this contiguous cached data range by
> > +		 * seeking from start_byte to the beginning of the next hole.
> > +		 */
> > +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> > +				end_byte, SEEK_HOLE);
> > +		if (data_end < 0) {
> > +			error = data_end;
> > +			goto out_unlock;
> > +		}
> > +		ASSERT(data_end > start_byte);
> > +		ASSERT(data_end <= end_byte);
> > +
> > +		error = xfs_buffered_write_delalloc_scan(inode,
> > +				&punch_start_byte, start_byte, data_end);
> 
> ...and we use SEEK_HOLE/SEEK_DATA to find the ranges of pagecache where
> there's even going to be folios mapped.  But in structuring the code
> like this, xfs now has to know details about folio state again, and the
> point of iomap/buffered-io.c is to delegate handling of the pagecache
> away from XFS, at least for filesystems that want to manage buffered IO
> the same way XFS does.
> 
> IOWs, I agree with Christoph that these two functions that compute the
> ranges that need delalloc-punching really ought to be in the iomap code.

Which I've already done.

> TBH I wonder if this isn't better suited for being called by
> iomap_write_iter after a short write on an IOMAP_F_NEW iomap, with an
> function pointer in iomap page_ops for iomap to tell xfs to drop the
> delalloc reservations.

IIRC, the whole point of the iomap_begin()/iomap_end() operation
pairs is that the iter functions don't need to concern themselves
with the filesystem operations needed to manipulate extents and
clean up filesystem state as a result of failed operations.

I'm pretty sure we need this same error handling for zeroing and
unshare operations, because they could also detect stale cached
iomaps and have to go remap their extents. Maybe they don't allocate
new extents, but that is beside the point - the error handling that
is necessary is common to multiple functions, and right now they
don't have to care about cleaning up iomap/filesystem state when
short writes/errors occur.

Hence I don't think we want to break the architectural layering by
doing this - I think it would just lead to having to handle the same
issues in multiple places, and we don't gain any extra control over
or information about how to perform the cleanup of the unused
portion of the iomap....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-17  0:41     ` Dave Chinner
@ 2022-11-17 18:28       ` Darrick J. Wong
  2022-11-18 17:20       ` Brian Foster
  1 sibling, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-11-17 18:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, linux-fsdevel, linux-mm

On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 141 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 7bb55dbc19d3..2d48fcc7bd6f 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> > >  				end_fsb - start_fsb);
> > >  }
> > >  
> > ...
> > > +/*
> > > + * Punch out all the delalloc blocks in the range given except for those that
> > > + * have dirty data still pending in the page cache - those are going to be
> > > + * written and so must still retain the delalloc backing for writeback.
> > > + *
> > > + * As we are scanning the page cache for data, we don't need to reimplement the
> > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> > > + * start and end of data ranges correctly even for sub-folio block sizes. This
> > > + * byte range based iteration is especially convenient because it means we don't
> > > + * have to care about variable size folios, nor where the start or end of the
> > > + * data range lies within a folio, if they lie within the same folio or even if
> > > + * there are multiple discontiguous data ranges within the folio.
> > > + */
> > > +static int
> > > +xfs_buffered_write_delalloc_release(
> > > +	struct inode		*inode,
> > > +	loff_t			start_byte,
> > > +	loff_t			end_byte)
> > > +{
> > > +	loff_t			punch_start_byte = start_byte;
> > > +	int			error = 0;
> > > +
> > > +	/*
> > > +	 * Lock the mapping to avoid races with page faults re-instantiating
> > > +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > > +	 * cache and perform delalloc extent removal. Failing to do this can
> > > +	 * leave dirty pages with no space reservation in the cache.
> > > +	 */
> > > +	filemap_invalidate_lock(inode->i_mapping);
> > > +	while (start_byte < end_byte) {
> > > +		loff_t		data_end;
> > > +
> > > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > > +				start_byte, end_byte, SEEK_DATA);
> > 
> > FWIW, the fact that mapping seek data is based on uptodate status means
> > that seek behavior can change based on prior reads.
> 
> Yup. It should be obvious that any page cache scan based algorithm
> will change based on changing page cache residency.
> 
> > For example, see how
> > seek hole/data presents reads of unwritten ranges as data [1]. The same
> > thing isn't observable for holes because iomap doesn't check the mapping
> > in that case, but underlying iop state is the same and that is what this
> > code is looking at.
> 
> Well, yes.  That's the fundamental, underlying issue that this
> patchset is addressing for the write() operation: that the page
> cache contents and the underlying filesystem extent map are not
> guaranteed to be coherent and can be changed independently of each
> other.
> 
> The whole problem with looking exclusively at filesystem level
> extent state (and hence FIEMAP) is that the extent state doesn't
> tell us whether the is uncommitted data over the range of the extent
> in the page cache.  The filesystem extent state and page cache data
> *can't be coherent* in a writeback caching environment. This is the
> fundamental difference between what the filesystem extent map tells
> us (FIEMAP) and what querying the page cache tells us
> (SEEK_DATA/SEEK_HOLE).
> 
> This is also the underlying problem with iomap_truncate_page() - it
> fails to query the page cache for data over unwritten extents, so
> fails to zero the post-EOF part of dirty folios over unwritten
> extents and so it all goes wrong...
> 
> > The filtering being done here means we essentially only care about dirty
> > pages backed by delalloc blocks. That means if you get here with a dirty
> > page and the portion of the page affected by this failed write is
> > uptodate, this won't punch an underlying delalloc block even though
> > nothing else may have written to it in the meantime.
> 
> Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap
> will not span ranges that have pre-existing *dirty* data in the
> page cache. Those *must* already have (del)allocated extents, hence
> the iomap for the newly allocated delalloc extent will always end
> before pre-existing dirty data in the page cache starts.
> 
> Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> precludes stepping into ranges that have pre-existing cached dirty
> data.
> 
> We also can't get a racing write() to the same range right now
> because this is all under IOLOCK_EXCL, hence we only ever see dirty
> folios as a result of race with page faults. page faults zero the
> entire folio they insert into the page cache and
> iomap_folio_mkwrite_iter() asserts that the entire folio is marked
> up to date. Hence if we find a dirty folio outside the range the
> write() dirtied, we are guaranteed that the entire dirty folio is up
> to date....
> 
> Yes, there can be pre-existing *clean* folios (and clean partially
> up to date folios) in the page cache, but we won't have dirty
> partially up to date pages in the middle of the range we are
> scanning. Hence we only need to care about the edge cases (folios
> that overlap start and ends). We skip the partially written start
> block, and we always punch up to the end block if it is different
> from the last block we punched up to. If the end of the data spans
> into a dirty folio, we know that dirty range is up to date because
> the seek scan only returns ranges that are up to date. Hence we
> don't punch those partial blocks out....
> 
> Regardless, let's assume we have a racing write that has partially
> updated and dirtied a folio (because we've moved to
> XFS_IOLOCK_SHARED locking for writes). This case is already handled
> by the mapping_seek_hole_data() based iteration.
> 
> That is, the mapping_seek_hole_data() loop provides us with
> *discrete ranges of up to date data* that are independent of folio
> size, up-to-date range granularity, dirty range tracking, filesystem
> block size, etc.
> 
> Hence if the next discrete range we discover is in the same dirty
> folio as the previous discrete range of up to date data, we know we
> have a sub-folio sized hole in the data that is not up to date.
> Because there is no data over this range, we have to punch out the
> underlying delalloc extent over that range. 
> 
> IOWs, the dirty state of the folio and/or the granularity of the
> dirty range tracking is irrelevant here - we know there was no data
> in the cache (dlean or dirty) over this range because it is
> discontiguous with the previous range of data returned.
> 
> IOWs, if we have this "up to date" map on a dirty folio like this:
> 
> Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Then the unrolled iteration and punching we do would look like this:
> 
> First iteration of the range:
> 
> punch_start:
> 		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> 
> SEEK_DATA:		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:			^
> Data range:		+UUUUUUU+
> Punch range:	+-------+
> Extent map:	+-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Second iteration:
> 
> punch_start			V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA:				V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:					^
> Data range:				+UUUUUUU+
> Punch range:			+-------+
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> 
> Third iteration:
> 
> punch_start					V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA: - moves into next folio in cache
> ....
> Punch range:					+-------+ ......
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+-------+ ......
> 			(to end of scan range or start of next data)
> 
> As you can see, this scan does not care about folio size, sub-folio
> range granularity or filesystem block sizes.  It also matches
> exactly how writeback handles dirty, partially up to date folios, so
> there's no stray delalloc blocks left around to be tripped over
> after failed or short writes occur.

I had been wondering about the particular case of dealing with partially
uptodate folios, so it was really helpful to watch this all in
ASCIIVision(tm).

> Indeed, if we move to sub-folio dirty range tracking, we can simply
> add a mapping_seek_hole_data() variant that walks dirty ranges in
> the page cache rather than up to date ranges. Then we can remove the
> inner loop from this code that looks up folios to determine dirty
> state. The above algorithm does not change - we just walk from
> discrete range to discrete range punching the gaps between them....
> 
> IOWs, the algorithm is largely future proof - the only thing that
> needs to change if we change iomap to track sub-folio dirty ranges
> is how we check the data range for being dirty. That should be no
> surprise, really, the surprise should be that we can make some
> simple mods to page cache seek to remove the need for checking dirty
> state in this code altogether....
> 
> > That sort of state
> > can be created by a prior read of the range on a sub-page block size fs,
> > or perhaps a racing async readahead (via read fault of a lower
> > offset..?), etc.
> 
> Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> write vs write() case. This test, specifically, was the reason I
> moved to using mapping_seek_hole_data() - g/346 found an endless
> stream of bugs in the sub-multi-page-folio range iteration code I
> kept trying to write....
> 
> > I suspect this is not a serious error because the page is dirty
> > and writeback will thus convert the block. The only exception to
> > that I can see is if the block is beyond EOF (consider a mapped
> > read to a page that straddles EOF, followed by a post-eof write
> > that fails), writeback won't actually map the block directly.
> 
> I don't think that can happen. iomap_write_failed() calls
> truncate_pagecache_range() to remove any newly instantiated cached
> data beyond the original EOF. Hence the delalloc punch will remove
> everything beyond the original EOF that was allocated for the failed
> write. Hence when we get to writeback we're not going to find any
> up-to-date data beyond the EOF block in the page cache, nor any
> stray delalloc blocks way beyond EOF....
> 
> > It may convert if contiguous with delalloc blocks inside EOF (and
> > sufficiently sized physical extents exist), or even if not, should
> > still otherwise be cleaned up by the various other means we
> > already have to manage post-eof blocks.
> >
> > So IOW there's a tradeoff being made here for possible spurious
> > allocation and I/O and a subtle dependency on writeback that
> > should probably be documented somewhere.
> 
> As per above, I don't think there is any spurious/stale allocation
> left behind by the punch code, nor is there any dependency on
> writeback to ignore it such issues.
> 
> > The larger concern is that if
> > writeback eventually changes based on dirty range tracking in a way that
> > breaks this dependency, that introduces yet another stale delalloc block
> > landmine associated with this error handling code (regardless of whether
> > you want to call that a bug in this code, seek data, whatever), and
> > those problems are difficult enough to root cause as it is.
> 
> If iomap changes how it tracks dirty ranges, this punch code only
> needs small changes to work with that correctly. There aren't any
> unknown landmines here - if we change dirty tracking, we know that
> we have to update the code that depends on the existing dirty
> tracking mechanisms to work correctly with the new infrastructure...

I hope that anyone trying to change the iomap dirty tracking code will
find it easier to do with the iteration code in buffered-io.c.  Thanks
for taking the time to rework that part for v3.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-17  0:41     ` Dave Chinner
  2022-11-17 18:28       ` Darrick J. Wong
@ 2022-11-18 17:20       ` Brian Foster
  2022-11-21 23:13         ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Brian Foster @ 2022-11-18 17:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 141 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 7bb55dbc19d3..2d48fcc7bd6f 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> > >  				end_fsb - start_fsb);
> > >  }
> > >  
> > ...
> > > +/*
> > > + * Punch out all the delalloc blocks in the range given except for those that
> > > + * have dirty data still pending in the page cache - those are going to be
> > > + * written and so must still retain the delalloc backing for writeback.
> > > + *
> > > + * As we are scanning the page cache for data, we don't need to reimplement the
> > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> > > + * start and end of data ranges correctly even for sub-folio block sizes. This
> > > + * byte range based iteration is especially convenient because it means we don't
> > > + * have to care about variable size folios, nor where the start or end of the
> > > + * data range lies within a folio, if they lie within the same folio or even if
> > > + * there are multiple discontiguous data ranges within the folio.
> > > + */
> > > +static int
> > > +xfs_buffered_write_delalloc_release(
> > > +	struct inode		*inode,
> > > +	loff_t			start_byte,
> > > +	loff_t			end_byte)
> > > +{
> > > +	loff_t			punch_start_byte = start_byte;
> > > +	int			error = 0;
> > > +
> > > +	/*
> > > +	 * Lock the mapping to avoid races with page faults re-instantiating
> > > +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > > +	 * cache and perform delalloc extent removal. Failing to do this can
> > > +	 * leave dirty pages with no space reservation in the cache.
> > > +	 */
> > > +	filemap_invalidate_lock(inode->i_mapping);
> > > +	while (start_byte < end_byte) {
> > > +		loff_t		data_end;
> > > +
> > > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > > +				start_byte, end_byte, SEEK_DATA);
> > 
> > FWIW, the fact that mapping seek data is based on uptodate status means
> > that seek behavior can change based on prior reads.
> 
> Yup. It should be obvious that any page cache scan based algorithm
> will change based on changing page cache residency.
> 
> > For example, see how
> > seek hole/data presents reads of unwritten ranges as data [1]. The same
> > thing isn't observable for holes because iomap doesn't check the mapping
> > in that case, but underlying iop state is the same and that is what this
> > code is looking at.
> 
> Well, yes.  That's the fundamental, underlying issue that this
> patchset is addressing for the write() operation: that the page
> cache contents and the underlying filesystem extent map are not
> guaranteed to be coherent and can be changed independently of each
> other.
> 
> The whole problem with looking exclusively at filesystem level
> extent state (and hence FIEMAP) is that the extent state doesn't
> tell us whether the is uncommitted data over the range of the extent
> in the page cache.  The filesystem extent state and page cache data
> *can't be coherent* in a writeback caching environment. This is the
> fundamental difference between what the filesystem extent map tells
> us (FIEMAP) and what querying the page cache tells us
> (SEEK_DATA/SEEK_HOLE).
> 
> This is also the underlying problem with iomap_truncate_page() - it
> fails to query the page cache for data over unwritten extents, so
> fails to zero the post-EOF part of dirty folios over unwritten
> extents and so it all goes wrong...
> 
> > The filtering being done here means we essentially only care about dirty
> > pages backed by delalloc blocks. That means if you get here with a dirty
> > page and the portion of the page affected by this failed write is
> > uptodate, this won't punch an underlying delalloc block even though
> > nothing else may have written to it in the meantime.
> 
> Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap
> will not span ranges that have pre-existing *dirty* data in the
> page cache. Those *must* already have (del)allocated extents, hence
> the iomap for the newly allocated delalloc extent will always end
> before pre-existing dirty data in the page cache starts.
> 
> Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> precludes stepping into ranges that have pre-existing cached dirty
> data.
> 
> We also can't get a racing write() to the same range right now
> because this is all under IOLOCK_EXCL, hence we only ever see dirty
> folios as a result of race with page faults. page faults zero the
> entire folio they insert into the page cache and
> iomap_folio_mkwrite_iter() asserts that the entire folio is marked
> up to date. Hence if we find a dirty folio outside the range the
> write() dirtied, we are guaranteed that the entire dirty folio is up
> to date....
> 
> Yes, there can be pre-existing *clean* folios (and clean partially
> up to date folios) in the page cache, but we won't have dirty
> partially up to date pages in the middle of the range we are
> scanning. Hence we only need to care about the edge cases (folios
> that overlap start and ends). We skip the partially written start
> block, and we always punch up to the end block if it is different
> from the last block we punched up to. If the end of the data spans
> into a dirty folio, we know that dirty range is up to date because
> the seek scan only returns ranges that are up to date. Hence we
> don't punch those partial blocks out....
> 
> Regardless, let's assume we have a racing write that has partially
> updated and dirtied a folio (because we've moved to
> XFS_IOLOCK_SHARED locking for writes). This case is already handled
> by the mapping_seek_hole_data() based iteration.
> 
> That is, the mapping_seek_hole_data() loop provides us with
> *discrete ranges of up to date data* that are independent of folio
> size, up-to-date range granularity, dirty range tracking, filesystem
> block size, etc.
> 
> Hence if the next discrete range we discover is in the same dirty
> folio as the previous discrete range of up to date data, we know we
> have a sub-folio sized hole in the data that is not up to date.
> Because there is no data over this range, we have to punch out the
> underlying delalloc extent over that range. 
> 
> IOWs, the dirty state of the folio and/or the granularity of the
> dirty range tracking is irrelevant here - we know there was no data
> in the cache (dlean or dirty) over this range because it is
> discontiguous with the previous range of data returned.
> 
> IOWs, if we have this "up to date" map on a dirty folio like this:
> 
> Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Then the unrolled iteration and punching we do would look like this:
> 
> First iteration of the range:
> 
> punch_start:
> 		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> 
> SEEK_DATA:		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:			^
> Data range:		+UUUUUUU+
> Punch range:	+-------+
> Extent map:	+-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Second iteration:
> 
> punch_start			V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA:				V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:					^
> Data range:				+UUUUUUU+
> Punch range:			+-------+
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> 
> Third iteration:
> 
> punch_start					V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA: - moves into next folio in cache
> ....
> Punch range:					+-------+ ......
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+-------+ ......
> 			(to end of scan range or start of next data)
> 
> As you can see, this scan does not care about folio size, sub-folio
> range granularity or filesystem block sizes.  It also matches
> exactly how writeback handles dirty, partially up to date folios, so
> there's no stray delalloc blocks left around to be tripped over
> after failed or short writes occur.
> 
> Indeed, if we move to sub-folio dirty range tracking, we can simply
> add a mapping_seek_hole_data() variant that walks dirty ranges in
> the page cache rather than up to date ranges. Then we can remove the
> inner loop from this code that looks up folios to determine dirty
> state. The above algorithm does not change - we just walk from
> discrete range to discrete range punching the gaps between them....
> 
> IOWs, the algorithm is largely future proof - the only thing that
> needs to change if we change iomap to track sub-folio dirty ranges
> is how we check the data range for being dirty. That should be no
> surprise, really, the surprise should be that we can make some
> simple mods to page cache seek to remove the need for checking dirty
> state in this code altogether....
> 
> > That sort of state
> > can be created by a prior read of the range on a sub-page block size fs,
> > or perhaps a racing async readahead (via read fault of a lower
> > offset..?), etc.
> 
> Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> write vs write() case. This test, specifically, was the reason I
> moved to using mapping_seek_hole_data() - g/346 found an endless
> stream of bugs in the sub-multi-page-folio range iteration code I
> kept trying to write....
> 
> > I suspect this is not a serious error because the page is dirty
> > and writeback will thus convert the block. The only exception to
> > that I can see is if the block is beyond EOF (consider a mapped
> > read to a page that straddles EOF, followed by a post-eof write
> > that fails), writeback won't actually map the block directly.
> 
> I don't think that can happen. iomap_write_failed() calls
> truncate_pagecache_range() to remove any newly instantiated cached
> data beyond the original EOF. Hence the delalloc punch will remove
> everything beyond the original EOF that was allocated for the failed
> write. Hence when we get to writeback we're not going to find any
> up-to-date data beyond the EOF block in the page cache, nor any
> stray delalloc blocks way beyond EOF....
> 

It can happen if the page straddles eof. I don't think much of the above
relates to the behavior I'm describing. This doesn't require racing
writes, theoretical shared write locking, the IOMAP_F_NEW flag or core
iteration algorithm are not major factors, etc.

It's easier to just use examples. Consider the following sequence on a
bsize=1k fs on a kernel with this patch series. Note that my xfs_io is
hacked up to turn 'pwrite -f' into a "fail" flag that triggers the
-EFAULT error check in iomap_write_iter() by passing a bad user buffer:

# set eof and make sure entire page is uptodate (even posteof)
$XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k" $file
# do a delalloc and open/close cycle to set dirty release
$XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file
# fail an appending write to a posteof, discontig, uptodate range on already dirty page
$XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c "fiemap -v" $file

wrote 1/1 bytes at offset 0
1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec)
pwrite: Bad address
/mnt/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..1]:          22..23               2   0x0
   1: [2..3]:          hole                 2
   2: [4..5]:          0..1                 2   0x7

So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block
there that hasn't been written to. I call this a landmine because it's an
unexpected/untracked instance of post-eof delalloc. I.e., post-eof block
reclamation won't pick it up because the inode was never tagged for speculative
preallocation:

$ xfs_spaceman -c "prealloc -s" /mnt
$ xfs_io -c "fiemap -v" /mnt/file 
/mnt/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..1]:          22..23               2   0x0
   1: [2..3]:          hole                 2
   2: [4..5]:          0..1                 2   0x7

At the same time, it's not an observable problem in XFS because inode
reclaim will eventually truncate that block off due to the delalloc
block check, which will prevent any sort of space accounting corruption.
I also don't think anybody will miss a single block dangling off like
that in the meantime. The only other odd side effect I can think of is
if this happened and the user performed an explicit post-eof fallocate,
reclaim could now lop that off due to the last resort delalloc block
check. That can still technically happen anyways as a side effect of
speculative prealloc so I don't see that as a major problem either.

The more important point here is that just because XFS can handle this
situation reasonably gracefully doesn't mean every other filesystem
expects it.

> > It may convert if contiguous with delalloc blocks inside EOF (and
> > sufficiently sized physical extents exist), or even if not, should
> > still otherwise be cleaned up by the various other means we
> > already have to manage post-eof blocks.
> >
> > So IOW there's a tradeoff being made here for possible spurious
> > allocation and I/O and a subtle dependency on writeback that
> > should probably be documented somewhere.
> 
> As per above, I don't think there is any spurious/stale allocation
> left behind by the punch code, nor is there any dependency on
> writeback to ignore it such issues.
> 

The same sort of example as above (but within eof) demonstrates the
writeback dependency. I.e., with this series alone the punch code leaves
behind an in-eof uptodate delalloc block, but writeback comes along and
maps it regardless because it also only has uptodate granularity within
the dirty folio.

When writeback changes to only map dirty sub-ranges, clearly that no
longer happens and the landmine turns into a stale delalloc leak. This
is also easy to reproduce with the last dirty range RFC patches pulled
on top of this series.

The changes you describe above to seek out and punch all !dirty delalloc
ranges should address both of these problems, because the error handling
code doesn't consider i_size the way writeback does.

> > The larger concern is that if
> > writeback eventually changes based on dirty range tracking in a way that
> > breaks this dependency, that introduces yet another stale delalloc block
> > landmine associated with this error handling code (regardless of whether
> > you want to call that a bug in this code, seek data, whatever), and
> > those problems are difficult enough to root cause as it is.
> 
> If iomap changes how it tracks dirty ranges, this punch code only
> needs small changes to work with that correctly. There aren't any
> unknown landmines here - if we change dirty tracking, we know that
> we have to update the code that depends on the existing dirty
> tracking mechanisms to work correctly with the new infrastructure...
> 

I'm just pointing out there are side effects / corner cases that don't
depend on the complicated shared locking / racing write fault scenarios
described above. Therefore, this code probably needs to be fixed at the
same time dirty range tracking is enabled to not introduce stale
delalloc problems for XFS.

In the meantime, I don't think any fs developer just looking to use or
debug the iomap layer and considering this punch error handling
mechanism should be expected to immediately connect the dots between it
using pagecache seek, that depending on Uptodate, the pagecache seek
behavior implications of that, writeback potentially not mapping
post-eof ranges, and therefore possibly needing to know that the fs in
question may need its own custom post-eof error handling if dangling
post-eof blocks aren't expected anywhere else for that fs.

It may be possible to address that by doing more aggressive punching
past the i_size boundary here, but that comes with other tradeoffs.
Short of that, I was really just asking for a more descriptive comment
for the punch mechanism so these quirks are not lost before they can be
properly fixed up or are apparent to any new users besides XFS this
might attract in the meantime. For example, something like the following
added to the _punch_delalloc() comment (from v3):

"
...

Note that since this depends on pagecache seek data, and that depends on
folio Uptodate state to identify data regions, the scanning behavior
here can be non-deterministic based on the state of cache at the time of
the failed operation. For example, if the target folio of a failed write
was already dirty and the target (sub-folio) range was previously an
Uptodate hole before the failing write may have performed delayed
allocation, we have no choice but to skip the punch for that sub-range
of the folio because we can't tell whether sub-folio Uptodate ranges
were actually written to or not. This means that delalloc block
allocations may persist for writes that never succeeded.

This can similarly skip punches beyond EOF for failed post-eof writes,
except note that writeback will not explicitly map blocks beyond i_size.
Therefore, filesystems that cannot gracefully accommodate dangling
post-eof blocks via other means may need to check for and handle that
case directly after calling this function.

These quirks should all be addressed with proper sub-folio dirty range
tracking, at which point we can deterministically scan and punch
non-dirty delalloc ranges at the time of write failure.
"

... but I also don't want to go in circles just over a comment, so I'm
content that it's at least documented on the list here.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-18 17:20       ` Brian Foster
@ 2022-11-21 23:13         ` Dave Chinner
  2022-11-23 17:25           ` Brian Foster
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-11-21 23:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Fri, Nov 18, 2022 at 12:20:45PM -0500, Brian Foster wrote:
> On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > ...
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> ---
> > > > fs/xfs/xfs_iomap.c | 151
> > > > ++++++++++++++++++++++++++++++++++++++++++--- 1 file
> > > > changed, 141 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@
> > > > xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); }
> > > >  
> > > ...
> > > > +/* + * Punch out all the delalloc blocks in the range given
> > > > except for those that + * have dirty data still pending in
> > > > the page cache - those are going to be + * written and so
> > > > must still retain the delalloc backing for writeback.  + * +
> > > > * As we are scanning the page cache for data, we don't need
> > > > to reimplement the + * wheel - mapping_seek_hole_data() does
> > > > exactly what we need to identify the + * start and end of
> > > > data ranges correctly even for sub-folio block sizes. This +
> > > > * byte range based iteration is especially convenient
> > > > because it means we don't + * have to care about variable
> > > > size folios, nor where the start or end of the + * data
> > > > range lies within a folio, if they lie within the same folio
> > > > or even if + * there are multiple discontiguous data ranges
> > > > within the folio.  + */ +static int
> > > > +xfs_buffered_write_delalloc_release( +	struct inode
> > > > *inode, +	loff_t			start_byte, +
> > > > loff_t			end_byte) +{ +	loff_t
> > > > punch_start_byte = start_byte; +	int
> > > > error = 0; + +	/* +	 * Lock the mapping to avoid races
> > > > with page faults re-instantiating +	 * folios and
> > > > dirtying them via ->page_mkwrite whilst we walk the +	 *
> > > > cache and perform delalloc extent removal. Failing to do
> > > > this can +	 * leave dirty pages with no space
> > > > reservation in the cache.  +	 */ +
> > > > filemap_invalidate_lock(inode->i_mapping); +	while
> > > > (start_byte < end_byte) { +		loff_t
> > > > data_end; + +		start_byte =
> > > > mapping_seek_hole_data(inode->i_mapping, +
> > > > start_byte, end_byte, SEEK_DATA);
> > > 
> > > FWIW, the fact that mapping seek data is based on uptodate
> > > status means that seek behavior can change based on prior
> > > reads.
> > 
> > Yup. It should be obvious that any page cache scan based
> > algorithm will change based on changing page cache residency.
> > 
> > > For example, see how seek hole/data presents reads of
> > > unwritten ranges as data [1]. The same thing isn't observable
> > > for holes because iomap doesn't check the mapping in that
> > > case, but underlying iop state is the same and that is what
> > > this code is looking at.
> > 
> > Well, yes.  That's the fundamental, underlying issue that this
> > patchset is addressing for the write() operation: that the page
> > cache contents and the underlying filesystem extent map are not
> > guaranteed to be coherent and can be changed independently of
> > each other.
> > 
> > The whole problem with looking exclusively at filesystem level
> > extent state (and hence FIEMAP) is that the extent state doesn't
> > tell us whether the is uncommitted data over the range of the
> > extent in the page cache.  The filesystem extent state and page
> > cache data *can't be coherent* in a writeback caching
> > environment. This is the fundamental difference between what the
> > filesystem extent map tells us (FIEMAP) and what querying the
> > page cache tells us (SEEK_DATA/SEEK_HOLE).
> > 
> > This is also the underlying problem with iomap_truncate_page() -
> > it fails to query the page cache for data over unwritten
> > extents, so fails to zero the post-EOF part of dirty folios over
> > unwritten extents and so it all goes wrong...
> > 
> > > The filtering being done here means we essentially only care
> > > about dirty pages backed by delalloc blocks. That means if you
> > > get here with a dirty page and the portion of the page
> > > affected by this failed write is uptodate, this won't punch an
> > > underlying delalloc block even though nothing else may have
> > > written to it in the meantime.
> > 
> > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc
> > iomap will not span ranges that have pre-existing *dirty* data
> > in the page cache. Those *must* already have (del)allocated
> > extents, hence the iomap for the newly allocated delalloc extent
> > will always end before pre-existing dirty data in the page cache
> > starts.
> > 
> > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> > precludes stepping into ranges that have pre-existing cached
> > dirty data.
> > 
> > We also can't get a racing write() to the same range right now
> > because this is all under IOLOCK_EXCL, hence we only ever see
> > dirty folios as a result of race with page faults. page faults
> > zero the entire folio they insert into the page cache and
> > iomap_folio_mkwrite_iter() asserts that the entire folio is
> > marked up to date. Hence if we find a dirty folio outside the
> > range the write() dirtied, we are guaranteed that the entire
> > dirty folio is up to date....
> > 
> > Yes, there can be pre-existing *clean* folios (and clean
> > partially up to date folios) in the page cache, but we won't
> > have dirty partially up to date pages in the middle of the range
> > we are scanning. Hence we only need to care about the edge cases
> > (folios that overlap start and ends). We skip the partially
> > written start block, and we always punch up to the end block if
> > it is different from the last block we punched up to. If the end
> > of the data spans into a dirty folio, we know that dirty range
> > is up to date because the seek scan only returns ranges that are
> > up to date. Hence we don't punch those partial blocks out....
> > 
> > Regardless, let's assume we have a racing write that has
> > partially updated and dirtied a folio (because we've moved to
> > XFS_IOLOCK_SHARED locking for writes). This case is already
> > handled by the mapping_seek_hole_data() based iteration.
> > 
> > That is, the mapping_seek_hole_data() loop provides us with
> > *discrete ranges of up to date data* that are independent of
> > folio size, up-to-date range granularity, dirty range tracking,
> > filesystem block size, etc.
> > 
> > Hence if the next discrete range we discover is in the same
> > dirty folio as the previous discrete range of up to date data,
> > we know we have a sub-folio sized hole in the data that is not
> > up to date.  Because there is no data over this range, we have
> > to punch out the underlying delalloc extent over that range. 
> > 
> > IOWs, the dirty state of the folio and/or the granularity of the
> > dirty range tracking is irrelevant here - we know there was no
> > data in the cache (dlean or dirty) over this range because it is
> > discontiguous with the previous range of data returned.
> > 
> > IOWs, if we have this "up to date" map on a dirty folio like
> > this:
> > 
> > Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> > Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > 
> > Then the unrolled iteration and punching we do would look like
> > this:
> > 
> > First iteration of the range:
> > 
> > punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > 
> > SEEK_DATA:		V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > SEEK_HOLE:			^ Data range:		+UUUUUUU+
> > Punch range:	+-------+ Extent map:
> > +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > 
> > Second iteration:
> > 
> > punch_start			V
> > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA:
> > V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE:
> > ^ Data range:				+UUUUUUU+ Punch
> > range:			+-------+ Extent map:
> > +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> > 
> > Third iteration:
> > 
> > punch_start					V
> > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves
> > into next folio in cache ....  Punch range:
> > +-------+ ......  Extent map:
> > +-------+DDDDDDD+-------+DDDDDDD+-------+ ......  (to end of
> > scan range or start of next data)
> > 
> > As you can see, this scan does not care about folio size,
> > sub-folio range granularity or filesystem block sizes.  It also
> > matches exactly how writeback handles dirty, partially up to
> > date folios, so there's no stray delalloc blocks left around to
> > be tripped over after failed or short writes occur.
> > 
> > Indeed, if we move to sub-folio dirty range tracking, we can
> > simply add a mapping_seek_hole_data() variant that walks dirty
> > ranges in the page cache rather than up to date ranges. Then we
> > can remove the inner loop from this code that looks up folios to
> > determine dirty state. The above algorithm does not change - we
> > just walk from discrete range to discrete range punching the
> > gaps between them....
> > 
> > IOWs, the algorithm is largely future proof - the only thing
> > that needs to change if we change iomap to track sub-folio dirty
> > ranges is how we check the data range for being dirty. That
> > should be no surprise, really, the surprise should be that we
> > can make some simple mods to page cache seek to remove the need
> > for checking dirty state in this code altogether....
> > 
> > > That sort of state can be created by a prior read of the range
> > > on a sub-page block size fs, or perhaps a racing async
> > > readahead (via read fault of a lower offset..?), etc.
> > 
> > Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> > write vs write() case. This test, specifically, was the reason I
> > moved to using mapping_seek_hole_data() - g/346 found an endless
> > stream of bugs in the sub-multi-page-folio range iteration code
> > I kept trying to write....
> > 
> > > I suspect this is not a serious error because the page is
> > > dirty and writeback will thus convert the block. The only
> > > exception to that I can see is if the block is beyond EOF
> > > (consider a mapped read to a page that straddles EOF, followed
> > > by a post-eof write that fails), writeback won't actually map
> > > the block directly.
> > 
> > I don't think that can happen. iomap_write_failed() calls
> > truncate_pagecache_range() to remove any newly instantiated
> > cached data beyond the original EOF. Hence the delalloc punch
> > will remove everything beyond the original EOF that was
> > allocated for the failed write. Hence when we get to writeback
> > we're not going to find any up-to-date data beyond the EOF block
> > in the page cache, nor any stray delalloc blocks way beyond
> > EOF....
> > 
> 
> It can happen if the page straddles eof. I don't think much of the
> above relates to the behavior I'm describing. This doesn't require
> racing writes, theoretical shared write locking, the IOMAP_F_NEW
> flag or core iteration algorithm are not major factors, etc.
> 
> It's easier to just use examples. Consider the following sequence
> on a bsize=1k fs on a kernel with this patch series. Note that my
> xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that
> triggers the -EFAULT error check in iomap_write_iter() by passing
> a bad user buffer:
> 
> # set eof and make sure entire page is uptodate (even posteof)
> $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k"
> $file # do a delalloc and open/close cycle to set dirty release
> $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file #
> fail an appending write to a posteof, discontig, uptodate range on
> already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c
> "fiemap -v" $file
> 
Firstly, can you tell me which patchset you reproduced this on? Just
a vanilla kernel, thev version of the patchset in this thread or the
latest one I sent out?

Secondly, can you please send a patch with that failed write
functionality upstream for xfs_io - it is clearly useful.

> wrote 1/1 bytes at offset 0
> 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec)
> pwrite: Bad address
> /mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>    0: [0..1]:          22..23               2   0x0
>    1: [2..3]:          hole                 2
>    2: [4..5]:          0..1                 2   0x7
> 
> So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block
> there that hasn't been written to.

Ok, so this is failing before even iomap_write_begin() is called,
so it's a failed write that has not touched page cache state, but
has allocated a delalloc block beyond EOF.

This is a case that writeback handles because writeback
is supposed skip writeback for pages and blocks beyond EOF. i.e.
iomap_writepage_map() is passed the current EOF to terminate the
writeback mapping loop when the folio being written back spans EOF.
Hence iomap_writepage_map never walks past EOF and so does not ever
iterate filesystem blocks that contain up to date data beyond in the
page cache beyond EOF. That's why writeback leaves that stray
delalloc block beyond, even though it's covered by up to date data.

IOWs, it appears that the problem here is that the page cache
considers data in the folio beyond EOF to be up to date and contain
data, whilst most filesystems do not consider data beyond EOF to be
valid. This means there is a boundary condition that
mapping_seek_hole_data() doesn't handle correctly - it completely
ignores EOF and so SEEK_DATA for an offset past EOF will report data
beyond EOF if the file has been mmap()d and the last folio faulted
in.

This specific boundary condition behaviour is hidden by
iomap_seek_data() and shmem_file_llseek() by not allowing seeks to
start and/or end beyond EOF. Hence they never calls
mapping_seek_hole_data() with a start byte beyond EOF and so will
not ever try to find data in the page caceh beyond EOF.

That looks easy enough to fix - just consider any data range
returned that is beyond i_size_read() to require punching, and then
the semantics match writeback ignoring anything beyond EOF....

> I call this a landmine because it's an
> unexpected/untracked instance of post-eof delalloc. I.e., post-eof block
> reclamation won't pick it up because the inode was never tagged for speculative
> preallocation:

Of course - this wasn't allocated by speculative preallocation -
that doesn't kick in until the file is >64kB in length, so
xfs_bmapi_reserve_delalloc() doesn't set the flag saying there's
post-eof preallocation to trim away on the inode.

> > > It may convert if contiguous with delalloc blocks inside EOF (and
> > > sufficiently sized physical extents exist), or even if not, should
> > > still otherwise be cleaned up by the various other means we
> > > already have to manage post-eof blocks.
> > >
> > > So IOW there's a tradeoff being made here for possible spurious
> > > allocation and I/O and a subtle dependency on writeback that
> > > should probably be documented somewhere.
> > 
> > As per above, I don't think there is any spurious/stale allocation
> > left behind by the punch code, nor is there any dependency on
> > writeback to ignore it such issues.
> > 
> 
> The same sort of example as above (but within eof) demonstrates the
> writeback dependency. I.e., with this series alone the punch code leaves
> behind an in-eof uptodate delalloc block, but writeback comes along and
> maps it regardless because it also only has uptodate granularity within
> the dirty folio.

This "unflushable delalloc extent" case can't happen within EOF. The
page fault initialises the folio full of zeroes, so when we get the
failed write, the only change is that we now have a single delalloc
block within EOF that writeback will iterate over, allocate and
write all the up to date zeros to it.

Sure, this error path might cause a small amount of extra IO over
the dirty folio, but it does not cause any other sort of issue.
THere are no stray delalloc blocks, there are no data corruption
issues, there are no stale data exposure issues, etc. Leaving a
delalloc block under up-to-date page cache data should not cause
a user vsible issue.

> When writeback changes to only map dirty sub-ranges, clearly that no
> longer happens and the landmine turns into a stale delalloc leak. This
> is also easy to reproduce with the last dirty range RFC patches pulled
> on top of this series.

No, not even then. If we have sub-folio dirty ranges, we know this
failed write range is *still clean* and so we'll punch it out,
regardless of whether it is inside or outside EOF.

> It may be possible to address that by doing more aggressive punching
> past the i_size boundary here, but that comes with other tradeoffs.

Tradeoffs such as .... ?

> For example, if the target folio of a failed write
> was already dirty and the target (sub-folio) range was previously an
> Uptodate hole before the failing write may have performed delayed
> allocation, we have no choice but to skip the punch for that sub-range
> of the folio because we can't tell whether sub-folio Uptodate ranges
> were actually written to or not. This means that delalloc block
> allocations may persist for writes that never succeeded.

I can add something like that, though I'd need to make sure it is
clear that this is not a data corruption vector, just an side effect
of a highly unlikely set of corner case conditions we only care
about handling without corruption, not efficiency or performance.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] xfs: buffered write failure should not truncate the page cache
  2022-11-21 23:13         ` Dave Chinner
@ 2022-11-23 17:25           ` Brian Foster
  0 siblings, 0 replies; 32+ messages in thread
From: Brian Foster @ 2022-11-23 17:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 22, 2022 at 10:13:04AM +1100, Dave Chinner wrote:
> On Fri, Nov 18, 2022 at 12:20:45PM -0500, Brian Foster wrote:
> > On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > ...
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> ---
> > > > > fs/xfs/xfs_iomap.c | 151
> > > > > ++++++++++++++++++++++++++++++++++++++++++--- 1 file
> > > > > changed, 141 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > > 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@
> > > > > xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); }
> > > > >  
> > > > ...
> > > > > +/* + * Punch out all the delalloc blocks in the range given
> > > > > except for those that + * have dirty data still pending in
> > > > > the page cache - those are going to be + * written and so
> > > > > must still retain the delalloc backing for writeback.  + * +
> > > > > * As we are scanning the page cache for data, we don't need
> > > > > to reimplement the + * wheel - mapping_seek_hole_data() does
> > > > > exactly what we need to identify the + * start and end of
> > > > > data ranges correctly even for sub-folio block sizes. This +
> > > > > * byte range based iteration is especially convenient
> > > > > because it means we don't + * have to care about variable
> > > > > size folios, nor where the start or end of the + * data
> > > > > range lies within a folio, if they lie within the same folio
> > > > > or even if + * there are multiple discontiguous data ranges
> > > > > within the folio.  + */ +static int
> > > > > +xfs_buffered_write_delalloc_release( +	struct inode
> > > > > *inode, +	loff_t			start_byte, +
> > > > > loff_t			end_byte) +{ +	loff_t
> > > > > punch_start_byte = start_byte; +	int
> > > > > error = 0; + +	/* +	 * Lock the mapping to avoid races
> > > > > with page faults re-instantiating +	 * folios and
> > > > > dirtying them via ->page_mkwrite whilst we walk the +	 *
> > > > > cache and perform delalloc extent removal. Failing to do
> > > > > this can +	 * leave dirty pages with no space
> > > > > reservation in the cache.  +	 */ +
> > > > > filemap_invalidate_lock(inode->i_mapping); +	while
> > > > > (start_byte < end_byte) { +		loff_t
> > > > > data_end; + +		start_byte =
> > > > > mapping_seek_hole_data(inode->i_mapping, +
> > > > > start_byte, end_byte, SEEK_DATA);
> > > > 
> > > > FWIW, the fact that mapping seek data is based on uptodate
> > > > status means that seek behavior can change based on prior
> > > > reads.
> > > 
> > > Yup. It should be obvious that any page cache scan based
> > > algorithm will change based on changing page cache residency.
> > > 
> > > > For example, see how seek hole/data presents reads of
> > > > unwritten ranges as data [1]. The same thing isn't observable
> > > > for holes because iomap doesn't check the mapping in that
> > > > case, but underlying iop state is the same and that is what
> > > > this code is looking at.
> > > 
> > > Well, yes.  That's the fundamental, underlying issue that this
> > > patchset is addressing for the write() operation: that the page
> > > cache contents and the underlying filesystem extent map are not
> > > guaranteed to be coherent and can be changed independently of
> > > each other.
> > > 
> > > The whole problem with looking exclusively at filesystem level
> > > extent state (and hence FIEMAP) is that the extent state doesn't
> > > tell us whether the is uncommitted data over the range of the
> > > extent in the page cache.  The filesystem extent state and page
> > > cache data *can't be coherent* in a writeback caching
> > > environment. This is the fundamental difference between what the
> > > filesystem extent map tells us (FIEMAP) and what querying the
> > > page cache tells us (SEEK_DATA/SEEK_HOLE).
> > > 
> > > This is also the underlying problem with iomap_truncate_page() -
> > > it fails to query the page cache for data over unwritten
> > > extents, so fails to zero the post-EOF part of dirty folios over
> > > unwritten extents and so it all goes wrong...
> > > 
> > > > The filtering being done here means we essentially only care
> > > > about dirty pages backed by delalloc blocks. That means if you
> > > > get here with a dirty page and the portion of the page
> > > > affected by this failed write is uptodate, this won't punch an
> > > > underlying delalloc block even though nothing else may have
> > > > written to it in the meantime.
> > > 
> > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc
> > > iomap will not span ranges that have pre-existing *dirty* data
> > > in the page cache. Those *must* already have (del)allocated
> > > extents, hence the iomap for the newly allocated delalloc extent
> > > will always end before pre-existing dirty data in the page cache
> > > starts.
> > > 
> > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> > > precludes stepping into ranges that have pre-existing cached
> > > dirty data.
> > > 
> > > We also can't get a racing write() to the same range right now
> > > because this is all under IOLOCK_EXCL, hence we only ever see
> > > dirty folios as a result of race with page faults. page faults
> > > zero the entire folio they insert into the page cache and
> > > iomap_folio_mkwrite_iter() asserts that the entire folio is
> > > marked up to date. Hence if we find a dirty folio outside the
> > > range the write() dirtied, we are guaranteed that the entire
> > > dirty folio is up to date....
> > > 
> > > Yes, there can be pre-existing *clean* folios (and clean
> > > partially up to date folios) in the page cache, but we won't
> > > have dirty partially up to date pages in the middle of the range
> > > we are scanning. Hence we only need to care about the edge cases
> > > (folios that overlap start and ends). We skip the partially
> > > written start block, and we always punch up to the end block if
> > > it is different from the last block we punched up to. If the end
> > > of the data spans into a dirty folio, we know that dirty range
> > > is up to date because the seek scan only returns ranges that are
> > > up to date. Hence we don't punch those partial blocks out....
> > > 
> > > Regardless, let's assume we have a racing write that has
> > > partially updated and dirtied a folio (because we've moved to
> > > XFS_IOLOCK_SHARED locking for writes). This case is already
> > > handled by the mapping_seek_hole_data() based iteration.
> > > 
> > > That is, the mapping_seek_hole_data() loop provides us with
> > > *discrete ranges of up to date data* that are independent of
> > > folio size, up-to-date range granularity, dirty range tracking,
> > > filesystem block size, etc.
> > > 
> > > Hence if the next discrete range we discover is in the same
> > > dirty folio as the previous discrete range of up to date data,
> > > we know we have a sub-folio sized hole in the data that is not
> > > up to date.  Because there is no data over this range, we have
> > > to punch out the underlying delalloc extent over that range. 
> > > 
> > > IOWs, the dirty state of the folio and/or the granularity of the
> > > dirty range tracking is irrelevant here - we know there was no
> > > data in the cache (dlean or dirty) over this range because it is
> > > discontiguous with the previous range of data returned.
> > > 
> > > IOWs, if we have this "up to date" map on a dirty folio like
> > > this:
> > > 
> > > Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> > > Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > > 
> > > Then the unrolled iteration and punching we do would look like
> > > this:
> > > 
> > > First iteration of the range:
> > > 
> > > punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > > 
> > > SEEK_DATA:		V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > > SEEK_HOLE:			^ Data range:		+UUUUUUU+
> > > Punch range:	+-------+ Extent map:
> > > +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > > 
> > > Second iteration:
> > > 
> > > punch_start			V
> > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA:
> > > V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE:
> > > ^ Data range:				+UUUUUUU+ Punch
> > > range:			+-------+ Extent map:
> > > +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> > > 
> > > Third iteration:
> > > 
> > > punch_start					V
> > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves
> > > into next folio in cache ....  Punch range:
> > > +-------+ ......  Extent map:
> > > +-------+DDDDDDD+-------+DDDDDDD+-------+ ......  (to end of
> > > scan range or start of next data)
> > > 
> > > As you can see, this scan does not care about folio size,
> > > sub-folio range granularity or filesystem block sizes.  It also
> > > matches exactly how writeback handles dirty, partially up to
> > > date folios, so there's no stray delalloc blocks left around to
> > > be tripped over after failed or short writes occur.
> > > 
> > > Indeed, if we move to sub-folio dirty range tracking, we can
> > > simply add a mapping_seek_hole_data() variant that walks dirty
> > > ranges in the page cache rather than up to date ranges. Then we
> > > can remove the inner loop from this code that looks up folios to
> > > determine dirty state. The above algorithm does not change - we
> > > just walk from discrete range to discrete range punching the
> > > gaps between them....
> > > 
> > > IOWs, the algorithm is largely future proof - the only thing
> > > that needs to change if we change iomap to track sub-folio dirty
> > > ranges is how we check the data range for being dirty. That
> > > should be no surprise, really, the surprise should be that we
> > > can make some simple mods to page cache seek to remove the need
> > > for checking dirty state in this code altogether....
> > > 
> > > > That sort of state can be created by a prior read of the range
> > > > on a sub-page block size fs, or perhaps a racing async
> > > > readahead (via read fault of a lower offset..?), etc.
> > > 
> > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> > > write vs write() case. This test, specifically, was the reason I
> > > moved to using mapping_seek_hole_data() - g/346 found an endless
> > > stream of bugs in the sub-multi-page-folio range iteration code
> > > I kept trying to write....
> > > 
> > > > I suspect this is not a serious error because the page is
> > > > dirty and writeback will thus convert the block. The only
> > > > exception to that I can see is if the block is beyond EOF
> > > > (consider a mapped read to a page that straddles EOF, followed
> > > > by a post-eof write that fails), writeback won't actually map
> > > > the block directly.
> > > 
> > > I don't think that can happen. iomap_write_failed() calls
> > > truncate_pagecache_range() to remove any newly instantiated
> > > cached data beyond the original EOF. Hence the delalloc punch
> > > will remove everything beyond the original EOF that was
> > > allocated for the failed write. Hence when we get to writeback
> > > we're not going to find any up-to-date data beyond the EOF block
> > > in the page cache, nor any stray delalloc blocks way beyond
> > > EOF....
> > > 
> > 
> > It can happen if the page straddles eof. I don't think much of the
> > above relates to the behavior I'm describing. This doesn't require
> > racing writes, theoretical shared write locking, the IOMAP_F_NEW
> > flag or core iteration algorithm are not major factors, etc.
> > 
> > It's easier to just use examples. Consider the following sequence
> > on a bsize=1k fs on a kernel with this patch series. Note that my
> > xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that
> > triggers the -EFAULT error check in iomap_write_iter() by passing
> > a bad user buffer:
> > 
> > # set eof and make sure entire page is uptodate (even posteof)
> > $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k"
> > $file # do a delalloc and open/close cycle to set dirty release
> > $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file #
> > fail an appending write to a posteof, discontig, uptodate range on
> > already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c
> > "fiemap -v" $file
> > 
> Firstly, can you tell me which patchset you reproduced this on? Just
> a vanilla kernel, thev version of the patchset in this thread or the
> latest one I sent out?
> 

This series.

> Secondly, can you please send a patch with that failed write
> functionality upstream for xfs_io - it is clearly useful.
> 

Sure. It trivially passes a NULL buffer, but I can post an RFC for now
at least.

> > wrote 1/1 bytes at offset 0
> > 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec)
> > pwrite: Bad address
> > /mnt/file:
> >  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> >    0: [0..1]:          22..23               2   0x0
> >    1: [2..3]:          hole                 2
> >    2: [4..5]:          0..1                 2   0x7
> > 
> > So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block
> > there that hasn't been written to.
> 
> Ok, so this is failing before even iomap_write_begin() is called,
> so it's a failed write that has not touched page cache state, but
> has allocated a delalloc block beyond EOF.
> 

Yes.

> This is a case that writeback handles because writeback
> is supposed skip writeback for pages and blocks beyond EOF. i.e.
> iomap_writepage_map() is passed the current EOF to terminate the
> writeback mapping loop when the folio being written back spans EOF.
> Hence iomap_writepage_map never walks past EOF and so does not ever
> iterate filesystem blocks that contain up to date data beyond in the
> page cache beyond EOF. That's why writeback leaves that stray
> delalloc block beyond, even though it's covered by up to date data.
> 
> IOWs, it appears that the problem here is that the page cache
> considers data in the folio beyond EOF to be up to date and contain
> data, whilst most filesystems do not consider data beyond EOF to be
> valid. This means there is a boundary condition that
> mapping_seek_hole_data() doesn't handle correctly - it completely
> ignores EOF and so SEEK_DATA for an offset past EOF will report data
> beyond EOF if the file has been mmap()d and the last folio faulted
> in.
> 
> This specific boundary condition behaviour is hidden by
> iomap_seek_data() and shmem_file_llseek() by not allowing seeks to
> start and/or end beyond EOF. Hence they never calls
> mapping_seek_hole_data() with a start byte beyond EOF and so will
> not ever try to find data in the page caceh beyond EOF.
> 
> That looks easy enough to fix - just consider any data range
> returned that is beyond i_size_read() to require punching, and then
> the semantics match writeback ignoring anything beyond EOF....
> 
> > I call this a landmine because it's an
> > unexpected/untracked instance of post-eof delalloc. I.e., post-eof block
> > reclamation won't pick it up because the inode was never tagged for speculative
> > preallocation:
> 
> Of course - this wasn't allocated by speculative preallocation -
> that doesn't kick in until the file is >64kB in length, so
> xfs_bmapi_reserve_delalloc() doesn't set the flag saying there's
> post-eof preallocation to trim away on the inode.
> 
> > > > It may convert if contiguous with delalloc blocks inside EOF (and
> > > > sufficiently sized physical extents exist), or even if not, should
> > > > still otherwise be cleaned up by the various other means we
> > > > already have to manage post-eof blocks.
> > > >
> > > > So IOW there's a tradeoff being made here for possible spurious
> > > > allocation and I/O and a subtle dependency on writeback that
> > > > should probably be documented somewhere.
> > > 
> > > As per above, I don't think there is any spurious/stale allocation
> > > left behind by the punch code, nor is there any dependency on
> > > writeback to ignore it such issues.
> > > 
> > 
> > The same sort of example as above (but within eof) demonstrates the
> > writeback dependency. I.e., with this series alone the punch code leaves
> > behind an in-eof uptodate delalloc block, but writeback comes along and
> > maps it regardless because it also only has uptodate granularity within
> > the dirty folio.
> 
> This "unflushable delalloc extent" case can't happen within EOF. The
> page fault initialises the folio full of zeroes, so when we get the
> failed write, the only change is that we now have a single delalloc
> block within EOF that writeback will iterate over, allocate and
> write all the up to date zeros to it.
> 
> Sure, this error path might cause a small amount of extra IO over
> the dirty folio, but it does not cause any other sort of issue.
> THere are no stray delalloc blocks, there are no data corruption
> issues, there are no stale data exposure issues, etc. Leaving a
> delalloc block under up-to-date page cache data should not cause
> a user vsible issue.
> 

This is why I suggested a comment might be sufficient.

> > When writeback changes to only map dirty sub-ranges, clearly that no
> > longer happens and the landmine turns into a stale delalloc leak. This
> > is also easy to reproduce with the last dirty range RFC patches pulled
> > on top of this series.
> 
> No, not even then. If we have sub-folio dirty ranges, we know this
> failed write range is *still clean* and so we'll punch it out,
> regardless of whether it is inside or outside EOF.
> 

At this point I suspect we're talking past eachother and based on
previous discussion, I think we're on the same page wrt to what actually
needs to happen and why.

> > It may be possible to address that by doing more aggressive punching
> > past the i_size boundary here, but that comes with other tradeoffs.
> 
> Tradeoffs such as .... ?
> 

When I wrote that I was thinking about the potential for odd
interactions with speculative preallocation. For example, consider the
case where the write that allocates speculative prealloc happens to
fail, how xfs_can_free_eofblocks() is implemented, etc.

I'm not explicitly saying it's some catastrophic problem, corruption
vector, etc., or anything like that. I'm just saying it's replacing one
potential wonky interaction with another, and thus introduces different
considerations for things that probably don't have great test coverage
to begin with.

I suppose it might make some sense for the iomap mechanism to invoke the
callback for post-eof ranges, but then have the fs handle that
particular case specially if appropriate. So for example if an XFS inode
sees a post-eof punch and is tagged with post-eof prealloc, it might
make more sense to either punch the whole of it out or just leave it
around to be cleaned up later vs. just punch out a subset of it. But I
don't have a strong opinion either way and still no objection to simply
documenting the quirk for the time being.

> > For example, if the target folio of a failed write
> > was already dirty and the target (sub-folio) range was previously an
> > Uptodate hole before the failing write may have performed delayed
> > allocation, we have no choice but to skip the punch for that sub-range
> > of the folio because we can't tell whether sub-folio Uptodate ranges
> > were actually written to or not. This means that delalloc block
> > allocations may persist for writes that never succeeded.
> 
> I can add something like that, though I'd need to make sure it is
> clear that this is not a data corruption vector, just an side effect
> of a highly unlikely set of corner case conditions we only care
> about handling without corruption, not efficiency or performance.
> 

I couldn't think of any such issues for upstream XFS, at least. One
reason is that delalloc blocks convert to unwritten extents by default,
so anything converted that wasn't directly targeted with writes should
remain in that state. Another fallback is that iomap explicitly zeroes
cache pages for post-eof blocks on reads, regardless of extent type. Of
course those were just initial observations and don't necessarily mean
problems don't exist. :)

FWIW, the more subtle reason I commented on these quirks, but isn't so
relevant upstream, is that if this work is going to get backported to
older/stable/distro kernels it might be possible for this to land in
kernels without one or both of those previously mentioned behaviors (at
which point there very well might be a vector for stale data exposure or
a subtle feature dependency, depending on context, what the final patch
series looks like, etc.).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-11-23 17:25 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-15  1:30 [PATCH v2 0/9] xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
2022-11-15  1:30 ` [PATCH 1/9] mm: export mapping_seek_hole_data() Dave Chinner
2022-11-15  8:40   ` Christoph Hellwig
2022-11-15  1:30 ` [PATCH 2/9] xfs: write page faults in iomap are not buffered writes Dave Chinner
2022-11-15  1:30 ` [PATCH 3/9] xfs: punching delalloc extents on write failure is racy Dave Chinner
2022-11-15  8:41   ` Christoph Hellwig
2022-11-15 23:53   ` Darrick J. Wong
2022-11-15  1:30 ` [PATCH 4/9] xfs: use byte ranges for write cleanup ranges Dave Chinner
2022-11-15  8:42   ` Christoph Hellwig
2022-11-15 23:57   ` Darrick J. Wong
2022-11-15  1:30 ` [PATCH 5/9] xfs: buffered write failure should not truncate the page cache Dave Chinner
2022-11-15  8:43   ` Christoph Hellwig
2022-11-16  0:48   ` Darrick J. Wong
2022-11-17  1:06     ` Dave Chinner
2022-11-16 13:57   ` Brian Foster
2022-11-17  0:41     ` Dave Chinner
2022-11-17 18:28       ` Darrick J. Wong
2022-11-18 17:20       ` Brian Foster
2022-11-21 23:13         ` Dave Chinner
2022-11-23 17:25           ` Brian Foster
2022-11-15  1:30 ` [PATCH 6/9] xfs: xfs_bmap_punch_delalloc_range() should take a byte range Dave Chinner
2022-11-15  8:44   ` Christoph Hellwig
2022-11-15 23:48     ` Darrick J. Wong
2022-11-16  0:57       ` Dave Chinner
2022-11-16  5:46         ` Christoph Hellwig
2022-11-15  1:30 ` [PATCH 7/9] iomap: write iomap validity checks Dave Chinner
2022-11-15  8:45   ` Christoph Hellwig
2022-11-15  1:30 ` [PATCH 8/9] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-11-15  8:49   ` Christoph Hellwig
2022-11-15 23:26     ` Darrick J. Wong
2022-11-16  0:10     ` Dave Chinner
2022-11-15  1:30 ` [PATCH 9/9] xfs: drop write error injection is unfixable, remove it Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).