[PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof
@ 2024-04-25 13:13 Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 1/9] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

Changes since v4:
 - For zeroing range in xfs, move the delalloc check to before searching
   the COW fork when zeroing range. Only modify patch 04, please see it
   for details, not modify other patches.

Changes since v3:
 - Improve some git message comments and do some minor code cleanup, no
   logic changes.

Changes since v2:
 - Merge the patch for dropping of xfs_convert_blocks() and the patch
   for modifying xfs_bmapi_convert_delalloc().
 - Reword the commit message of the second patch.

Changes since v1:
 - Make xfs_bmapi_convert_delalloc() to allocate the target offset and
   drop the writeback helper xfs_convert_blocks().
 - Don't use xfs_iomap_write_direct() to convert delalloc blocks for
   zeroing posteof case, use xfs_bmapi_convert_delalloc() instead.
 - Fix two off-by-one issues when converting delalloc blocks.
 - Add a separate patch to drop the buffered write failure handle in
   zeroing and unsharing.
 - Add a comments do emphasize updating i_size should under folio lock.
 - Make iomap_write_end() to return a boolean, and do some cleanups in
   buffered write begin path.

This patch series fix a problem of exposing zeroed data on xfs since the
non-atomic clone operation. This problem was found while I was
developing ext4 buffered IO iomap conversion (ext4 is relying on this
fix [1]), the root cause of this problem and the discussion about the
solution please see [2]. After fix the problem, iomap_zero_range()
doesn't need to update i_size so that ext4 can use it to zero partial
block, e.g. truncate eof block [3].

[1] https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com/
[3] https://lore.kernel.org/linux-ext4/9c9f1831-a772-299b-072b-1c8116c3fb35@huaweicloud.com/

Thanks,
Yi.

Zhang Yi (9):
  xfs: match lock mode in xfs_buffered_write_iomap_begin()
  xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional
  xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset
  xfs: convert delayed extents to unwritten when zeroing post eof blocks
  iomap: drop the write failure handles when unsharing and zeroing
  iomap: don't increase i_size if it's not a write operation
  iomap: use a new variable to handle the written bytes in
    iomap_write_iter()
  iomap: make iomap_write_end() return a boolean
  iomap: do some small logical cleanup in buffered write

 fs/iomap/buffered-io.c   | 105 ++++++++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_bmap.c |  40 +++++++++++++--
 fs/xfs/xfs_aops.c        |  54 ++++++--------------
 fs/xfs/xfs_iomap.c       |  39 +++++++++++++--
 4 files changed, 144 insertions(+), 94 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/9] xfs: match lock mode in xfs_buffered_write_iomap_begin()
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 2/9] xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional Zhang Yi
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Commit 1aa91d9c9933 ("xfs: Add async buffered write support") replace
xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the
writing inode, and a new variable lockmode is used to indicate the lock
mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still
better to use this variable instead of useing XFS_ILOCK_EXCL directly
when unlocking the inode.

Fixes: 1aa91d9c9933 ("xfs: Add async buffered write support")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4087af7f3c9f..236ee78aa75b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1158,13 +1158,13 @@ xfs_buffered_write_iomap_begin(
 	 * them out if the write happens to fail.
 	 */
 	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
 
 found_imap:
 	seq = xfs_iomap_inode_sequence(ip, 0);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 
 found_cow:
@@ -1174,17 +1174,17 @@ xfs_buffered_write_iomap_begin(
 		if (error)
 			goto out_unlock;
 		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		xfs_iunlock(ip, lockmode);
 		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
 					 IOMAP_F_SHARED, seq);
 	}
 
 	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
 
 out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return error;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/9] xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 1/9] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 3/9] xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset Zhang Yi
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Allow callers to pass a NULLL seq argument if they don't care about
the fork sequence number.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..9783ddb337d2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4657,7 +4657,8 @@ xfs_bmapi_convert_delalloc(
 	if (!isnullstartblock(bma.got.br_startblock)) {
 		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
 				xfs_iomap_inode_sequence(ip, flags));
-		*seq = READ_ONCE(ifp->if_seq);
+		if (seq)
+			*seq = READ_ONCE(ifp->if_seq);
 		goto out_trans_cancel;
 	}
 
@@ -4708,7 +4709,8 @@ xfs_bmapi_convert_delalloc(
 	ASSERT(!isnullstartblock(bma.got.br_startblock));
 	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
 				xfs_iomap_inode_sequence(ip, flags));
-	*seq = READ_ONCE(ifp->if_seq);
+	if (seq)
+		*seq = READ_ONCE(ifp->if_seq);
 
 	if (whichfork == XFS_COW_FORK)
 		xfs_refcount_alloc_cow_extent(tp, bma.blkno, bma.length);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/9] xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 1/9] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 2/9] xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Since xfs_bmapi_convert_delalloc() only attempts to allocate the entire
delalloc extent and require multiple invocations to allocate the target
offset. So xfs_convert_blocks() add a loop to do this job and we call it
in the write back path, but xfs_convert_blocks() isn't a common helper.
Let's do it in xfs_bmapi_convert_delalloc() and drop
xfs_convert_blocks(), preparing for the post EOF delalloc blocks
converting in the buffered write begin path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c | 34 +++++++++++++++++++++++--
 fs/xfs/xfs_aops.c        | 54 +++++++++++-----------------------------
 2 files changed, 46 insertions(+), 42 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9783ddb337d2..22757ad9ac4c 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4599,8 +4599,8 @@ xfs_bmapi_write(
  * invocations to allocate the target offset if a large enough physical extent
  * is not available.
  */
-int
-xfs_bmapi_convert_delalloc(
+static int
+xfs_bmapi_convert_one_delalloc(
 	struct xfs_inode	*ip,
 	int			whichfork,
 	xfs_off_t		offset,
@@ -4733,6 +4733,36 @@ xfs_bmapi_convert_delalloc(
 	return error;
 }
 
+/*
+ * Pass in a dellalloc extent and convert it to real extents, return the real
+ * extent that maps offset_fsb in iomap.
+ */
+int
+xfs_bmapi_convert_delalloc(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	loff_t			offset,
+	struct iomap		*iomap,
+	unsigned int		*seq)
+{
+	int			error;
+
+	/*
+	 * Attempt to allocate whatever delalloc extent currently backs offset
+	 * and put the result into iomap.  Allocate in a loop because it may
+	 * take several attempts to allocate real blocks for a contiguous
+	 * delalloc extent if free space is sufficiently fragmented.
+	 */
+	do {
+		error = xfs_bmapi_convert_one_delalloc(ip, whichfork, offset,
+					iomap, seq);
+		if (error)
+			return error;
+	} while (iomap->offset + iomap->length <= offset);
+
+	return 0;
+}
+
 int
 xfs_bmapi_remap(
 	struct xfs_trans	*tp,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3f428620ebf2..f7520f375cee 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -233,45 +233,6 @@ xfs_imap_valid(
 	return true;
 }
 
-/*
- * Pass in a dellalloc extent and convert it to real extents, return the real
- * extent that maps offset_fsb in wpc->iomap.
- *
- * The current page is held locked so nothing could have removed the block
- * backing offset_fsb, although it could have moved from the COW to the data
- * fork by another thread.
- */
-static int
-xfs_convert_blocks(
-	struct iomap_writepage_ctx *wpc,
-	struct xfs_inode	*ip,
-	int			whichfork,
-	loff_t			offset)
-{
-	int			error;
-	unsigned		*seq;
-
-	if (whichfork == XFS_COW_FORK)
-		seq = &XFS_WPC(wpc)->cow_seq;
-	else
-		seq = &XFS_WPC(wpc)->data_seq;
-
-	/*
-	 * Attempt to allocate whatever delalloc extent currently backs offset
-	 * and put the result into wpc->iomap.  Allocate in a loop because it
-	 * may take several attempts to allocate real blocks for a contiguous
-	 * delalloc extent if free space is sufficiently fragmented.
-	 */
-	do {
-		error = xfs_bmapi_convert_delalloc(ip, whichfork, offset,
-				&wpc->iomap, seq);
-		if (error)
-			return error;
-	} while (wpc->iomap.offset + wpc->iomap.length <= offset);
-
-	return 0;
-}
-
 static int
 xfs_map_blocks(
 	struct iomap_writepage_ctx *wpc,
@@ -290,6 +251,7 @@ xfs_map_blocks(
 	struct xfs_iext_cursor	icur;
 	int			retries = 0;
 	int			error = 0;
+	unsigned int		*seq;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -387,7 +349,19 @@ xfs_map_blocks(
 	trace_xfs_map_blocks_found(ip, offset, count, whichfork, &imap);
 	return 0;
 allocate_blocks:
-	error = xfs_convert_blocks(wpc, ip, whichfork, offset);
+	/*
+	 * Convert a dellalloc extent to a real one. The current page is held
+	 * locked so nothing could have removed the block backing offset_fsb,
+	 * although it could have moved from the COW to the data fork by another
+	 * thread.
+	 */
+	if (whichfork == XFS_COW_FORK)
+		seq = &XFS_WPC(wpc)->cow_seq;
+	else
+		seq = &XFS_WPC(wpc)->data_seq;
+
+	error = xfs_bmapi_convert_delalloc(ip, whichfork, offset,
+				&wpc->iomap, seq);
 	if (error) {
 		/*
 		 * If we failed to find the extent in the COW fork we might have
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (2 preceding siblings ...)
  2024-04-25 13:13 ` [PATCH v5 3/9] xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 18:29   ` Darrick J. Wong
  2024-04-25 13:13 ` [PATCH v5 5/9] iomap: drop the write failure handles when unsharing and zeroing Zhang Yi
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Current clone operation could be non-atomic if the destination of a file
is beyond EOF, user could get a file with corrupted (zeroed) data on
crash.

The problem is about preallocations. If you write some data into a file:

	[A...B)

and XFS decides to preallocate some post-eof blocks, then it can create
a delayed allocation reservation:

	[A.........D)

The writeback path tries to convert delayed extents to real ones by
allocating blocks. If there aren't enough contiguous free space, we can
end up with two extents, the first real and the second still delalloc:

	[A....C)[C.D)

After that, both the in-memory and the on-disk file sizes are still B.
If we clone into the range [E...F) from another file:

	[A....C)[C.D)      [E...F)

then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
extent, its pagecache will be zeroed and both the in-memory and on-disk
size will be updated to D after flushing but before cloning. This is
wrong, because the user can see the size change and read the zeroes
while the clone operation is ongoing.

We need to keep the in-memory and on-disk size before the clone
operation starts, so instead of writing zeroes through the page cache
for delayed ranges beyond EOF, we convert these ranges to unwritten and
invalidate any cached data over that range beyond EOF.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
Changes since v4:

Move the delalloc converting hunk before searching the COW fork. Because
if the file has been reflinked and copied on write,
xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
the writeback, there might be some unwritten extents left over in the
COW fork that overlaps the delalloc extent we found in data fork.

  data fork  ...wwww|dddddddddd...
  cow fork          |uuuuuuuuuu...
                    ^
                  i_size

In my v4, we search the COW fork before checking the delalloc extent,
goto found_cow tag and return unconverted delalloc srcmap in the above
case, so the delayed extent in the data fork will have no chance to
convert to unwritten, it will lead to delalloc extent residue and break
generic/522 after merging patch 6.

 fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 236ee78aa75b..2857ef1b0272 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
 		goto out_unlock;
 	}
 
+	/*
+	 * For zeroing, trim a delalloc extent that extends beyond the EOF
+	 * block.  If it starts beyond the EOF block, convert it to an
+	 * unwritten extent.
+	 */
+	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
+	    isnullstartblock(imap.br_startblock)) {
+		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
+
+		if (offset_fsb >= eof_fsb)
+			goto convert_delay;
+		if (end_fsb > eof_fsb) {
+			end_fsb = eof_fsb;
+			xfs_trim_extent(&imap, offset_fsb,
+					end_fsb - offset_fsb);
+		}
+	}
+
 	/*
 	 * Search the COW fork extent list even if we did not find a data fork
 	 * extent.  This serves two purposes: first this implements the
@@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
 	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 
+convert_delay:
+	xfs_iunlock(ip, lockmode);
+	truncate_pagecache(inode, offset);
+	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
+					   iomap, NULL);
+	if (error)
+		return error;
+
+	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
+	return 0;
+
 found_cow:
 	seq = xfs_iomap_inode_sequence(ip, 0);
 	if (imap.br_startoff <= offset_fsb) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 5/9] iomap: drop the write failure handles when unsharing and zeroing
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (3 preceding siblings ...)
  2024-04-25 13:13 ` [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 6/9] iomap: don't increase i_size if it's not a write operation Zhang Yi
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Unsharing and zeroing can only happen within EOF, so there is never a
need to perform posteof pagecache truncation if write begin fails, also
partial write could never theoretically happened from iomap_write_end(),
so remove both of them.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..433eaae39966 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -824,7 +824,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 
 out_unlock:
 	__iomap_put_folio(iter, pos, 0, folio);
-	iomap_write_failed(iter->inode, pos, len);
 
 	return status;
 }
@@ -901,8 +900,6 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 
 	if (old_size < pos)
 		pagecache_isize_extended(iter->inode, old_size, pos);
-	if (ret < len)
-		iomap_write_failed(iter->inode, pos + ret, len - ret);
 	return ret;
 }
 
@@ -950,8 +947,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		}
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (unlikely(status))
+		if (unlikely(status)) {
+			iomap_write_failed(iter->inode, pos, bytes);
 			break;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -965,6 +964,9 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
+		if (status < bytes)
+			iomap_write_failed(iter->inode, pos + status,
+					   bytes - status);
 		if (unlikely(copied != status))
 			iov_iter_revert(i, copied - status);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 6/9] iomap: don't increase i_size if it's not a write operation
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (4 preceding siblings ...)
  2024-04-25 13:13 ` [PATCH v5 5/9] iomap: drop the write failure handles when unsharing and zeroing Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 7/9] iomap: use a new variable to handle the written bytes in iomap_write_iter() Zhang Yi
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Increase i_size in iomap_zero_range() and iomap_unshare_iter() is not
needed, the caller should handle it. Especially, when truncate partial
block, we should not increase i_size beyond the new EOF here. It doesn't
affect xfs and gfs2 now because they set the new file size after zero
out, it doesn't matter that a transient increase in i_size, but it will
affect ext4 because it set file size before truncate. So move the i_size
updating logic to iomap_write_iter().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 50 +++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 433eaae39966..63d94189e568 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -875,32 +875,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	loff_t old_size = iter->inode->i_size;
-	size_t ret;
-
-	if (srcmap->type == IOMAP_INLINE) {
-		ret = iomap_write_end_inline(iter, folio, pos, copied);
-	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
-		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
-				copied, &folio->page, NULL);
-	} else {
-		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
-	}
-
-	/*
-	 * Update the in-memory inode size after copying the data into the page
-	 * cache.  It's up to the file system to write the updated size to disk,
-	 * preferably after I/O completion so that no stale data is exposed.
-	 */
-	if (pos + ret > old_size) {
-		i_size_write(iter->inode, pos + ret);
-		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
-	}
-	__iomap_put_folio(iter, pos, ret, folio);
 
-	if (old_size < pos)
-		pagecache_isize_extended(iter->inode, old_size, pos);
-	return ret;
+	if (srcmap->type == IOMAP_INLINE)
+		return iomap_write_end_inline(iter, folio, pos, copied);
+	if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
+		return block_write_end(NULL, iter->inode->i_mapping, pos, len,
+				       copied, &folio->page, NULL);
+	return __iomap_write_end(iter->inode, pos, len, copied, folio);
 }
 
 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
@@ -915,6 +896,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 	do {
 		struct folio *folio;
+		loff_t old_size;
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
@@ -964,6 +946,22 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
+		/*
+		 * Update the in-memory inode size after copying the data into
+		 * the page cache.  It's up to the file system to write the
+		 * updated size to disk, preferably after I/O completion so that
+		 * no stale data is exposed.  Only once that's done can we
+		 * unlock and release the folio.
+		 */
+		old_size = iter->inode->i_size;
+		if (pos + status > old_size) {
+			i_size_write(iter->inode, pos + status);
+			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
+		}
+		__iomap_put_folio(iter, pos, status, folio);
+
+		if (old_size < pos)
+			pagecache_isize_extended(iter->inode, old_size, pos);
 		if (status < bytes)
 			iomap_write_failed(iter->inode, pos + status,
 					   bytes - status);
@@ -1336,6 +1334,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 			bytes = folio_size(folio) - offset;
 
 		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		__iomap_put_folio(iter, pos, bytes, folio);
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
@@ -1400,6 +1399,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		folio_mark_accessed(folio);
 
 		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		__iomap_put_folio(iter, pos, bytes, folio);
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 7/9] iomap: use a new variable to handle the written bytes in iomap_write_iter()
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (5 preceding siblings ...)
  2024-04-25 13:13 ` [PATCH v5 6/9] iomap: don't increase i_size if it's not a write operation Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 8/9] iomap: make iomap_write_end() return a boolean Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 9/9] iomap: do some small logical cleanup in buffered write Zhang Yi
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

In iomap_write_iter(), the status variable used to receive the return
value from iomap_write_end() is confusing, replace it with a new written
variable to represent the written bytes in each cycle, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 63d94189e568..854fa3d4b1c8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -889,7 +889,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 	loff_t length = iomap_length(iter);
 	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
 	loff_t pos = iter->pos;
-	ssize_t written = 0;
+	ssize_t total_written = 0;
 	long status = 0;
 	struct address_space *mapping = iter->inode->i_mapping;
 	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
@@ -900,6 +900,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
+		size_t written;		/* Bytes have been written */
 
 		bytes = iov_iter_count(i);
 retry:
@@ -944,7 +945,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			flush_dcache_folio(folio);
 
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
-		status = iomap_write_end(iter, pos, bytes, copied, folio);
+		written = iomap_write_end(iter, pos, bytes, copied, folio);
 
 		/*
 		 * Update the in-memory inode size after copying the data into
@@ -954,22 +955,22 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		 * unlock and release the folio.
 		 */
 		old_size = iter->inode->i_size;
-		if (pos + status > old_size) {
-			i_size_write(iter->inode, pos + status);
+		if (pos + written > old_size) {
+			i_size_write(iter->inode, pos + written);
 			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
 		}
-		__iomap_put_folio(iter, pos, status, folio);
+		__iomap_put_folio(iter, pos, written, folio);
 
 		if (old_size < pos)
 			pagecache_isize_extended(iter->inode, old_size, pos);
-		if (status < bytes)
-			iomap_write_failed(iter->inode, pos + status,
-					   bytes - status);
-		if (unlikely(copied != status))
-			iov_iter_revert(i, copied - status);
+		if (written < bytes)
+			iomap_write_failed(iter->inode, pos + written,
+					   bytes - written);
+		if (unlikely(copied != written))
+			iov_iter_revert(i, copied - written);
 
 		cond_resched();
-		if (unlikely(status == 0)) {
+		if (unlikely(written == 0)) {
 			/*
 			 * A short copy made iomap_write_end() reject the
 			 * thing entirely.  Might be memory poisoning
@@ -983,17 +984,17 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 				goto retry;
 			}
 		} else {
-			pos += status;
-			written += status;
-			length -= status;
+			pos += written;
+			total_written += written;
+			length -= written;
 		}
 	} while (iov_iter_count(i) && length);
 
 	if (status == -EAGAIN) {
-		iov_iter_revert(i, written);
+		iov_iter_revert(i, total_written);
 		return -EAGAIN;
 	}
-	return written ? written : status;
+	return total_written ? total_written : status;
 }
 
 ssize_t
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 8/9] iomap: make iomap_write_end() return a boolean
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (6 preceding siblings ...)
  2024-04-25 13:13 ` [PATCH v5 7/9] iomap: use a new variable to handle the written bytes in iomap_write_iter() Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  2024-04-25 13:13 ` [PATCH v5 9/9] iomap: do some small logical cleanup in buffered write Zhang Yi
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

For now, we can make sure iomap_write_end() always return 0 or copied
bytes, so instead of return written bytes, convert to return a boolean
to indicate the copied bytes have been written to the pagecache.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 48 +++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 854fa3d4b1c8..176a9ea502ba 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -828,7 +828,7 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	return status;
 }
 
-static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+static bool __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	flush_dcache_folio(folio);
@@ -845,14 +845,14 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 * redo the whole thing.
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
-		return 0;
+		return false;
 	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
 	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
-	return copied;
+	return true;
 }
 
-static size_t iomap_write_end_inline(const struct iomap_iter *iter,
+static void iomap_write_end_inline(const struct iomap_iter *iter,
 		struct folio *folio, loff_t pos, size_t copied)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -867,20 +867,31 @@ static size_t iomap_write_end_inline(const struct iomap_iter *iter,
 	kunmap_local(addr);
 
 	mark_inode_dirty(iter->inode);
-	return copied;
 }
 
-/* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
-static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
+/*
+ * Returns true if all copied bytes have been written to the pagecache,
+ * otherwise return false.
+ */
+static bool iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 
-	if (srcmap->type == IOMAP_INLINE)
-		return iomap_write_end_inline(iter, folio, pos, copied);
-	if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
-		return block_write_end(NULL, iter->inode->i_mapping, pos, len,
-				       copied, &folio->page, NULL);
+	if (srcmap->type == IOMAP_INLINE) {
+		iomap_write_end_inline(iter, folio, pos, copied);
+		return true;
+	}
+
+	if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
+		size_t bh_written;
+
+		bh_written = block_write_end(NULL, iter->inode->i_mapping, pos,
+					len, copied, &folio->page, NULL);
+		WARN_ON_ONCE(bh_written != copied && bh_written != 0);
+		return bh_written == copied;
+	}
+
 	return __iomap_write_end(iter->inode, pos, len, copied, folio);
 }
 
@@ -945,7 +956,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			flush_dcache_folio(folio);
 
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
-		written = iomap_write_end(iter, pos, bytes, copied, folio);
+		written = iomap_write_end(iter, pos, bytes, copied, folio) ?
+			  copied : 0;
 
 		/*
 		 * Update the in-memory inode size after copying the data into
@@ -1323,6 +1335,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		int status;
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
+		bool ret;
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (unlikely(status))
@@ -1334,9 +1347,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
-		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
 		__iomap_put_folio(iter, pos, bytes, folio);
-		if (WARN_ON_ONCE(bytes == 0))
+		if (WARN_ON_ONCE(!ret))
 			return -EIO;
 
 		cond_resched();
@@ -1385,6 +1398,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		int status;
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
+		bool ret;
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (status)
@@ -1399,9 +1413,9 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
-		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
 		__iomap_put_folio(iter, pos, bytes, folio);
-		if (WARN_ON_ONCE(bytes == 0))
+		if (WARN_ON_ONCE(!ret))
 			return -EIO;
 
 		pos += bytes;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 9/9] iomap: do some small logical cleanup in buffered write
  2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (7 preceding siblings ...)
  2024-04-25 13:13 ` [PATCH v5 8/9] iomap: make iomap_write_end() return a boolean Zhang Yi
@ 2024-04-25 13:13 ` Zhang Yi
  8 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-25 13:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, tytso,
	jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Since iomap_write_end() can never return a partial write length, the
comparison between written, copied and bytes becomes useless, just
merge them with the unwritten branch.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 176a9ea502ba..0926d216a5af 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -975,11 +975,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 		if (old_size < pos)
 			pagecache_isize_extended(iter->inode, old_size, pos);
-		if (written < bytes)
-			iomap_write_failed(iter->inode, pos + written,
-					   bytes - written);
-		if (unlikely(copied != written))
-			iov_iter_revert(i, copied - written);
 
 		cond_resched();
 		if (unlikely(written == 0)) {
@@ -989,6 +984,9 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
+			iomap_write_failed(iter->inode, pos, bytes);
+			iov_iter_revert(i, copied);
+
 			if (chunk > PAGE_SIZE)
 				chunk /= 2;
 			if (copied) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-25 13:13 ` [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
@ 2024-04-25 18:29   ` Darrick J. Wong
  2024-04-26  6:24     ` Zhang Yi
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2024-04-25 18:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	chandanbabu, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Thu, Apr 25, 2024 at 09:13:30PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Current clone operation could be non-atomic if the destination of a file
> is beyond EOF, user could get a file with corrupted (zeroed) data on
> crash.
> 
> The problem is about preallocations. If you write some data into a file:
> 
> 	[A...B)
> 
> and XFS decides to preallocate some post-eof blocks, then it can create
> a delayed allocation reservation:
> 
> 	[A.........D)
> 
> The writeback path tries to convert delayed extents to real ones by
> allocating blocks. If there aren't enough contiguous free space, we can
> end up with two extents, the first real and the second still delalloc:
> 
> 	[A....C)[C.D)
> 
> After that, both the in-memory and the on-disk file sizes are still B.
> If we clone into the range [E...F) from another file:
> 
> 	[A....C)[C.D)      [E...F)
> 
> then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
> range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
> extent, its pagecache will be zeroed and both the in-memory and on-disk
> size will be updated to D after flushing but before cloning. This is
> wrong, because the user can see the size change and read the zeroes
> while the clone operation is ongoing.
> 
> We need to keep the in-memory and on-disk size before the clone
> operation starts, so instead of writing zeroes through the page cache
> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> invalidate any cached data over that range beyond EOF.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> Changes since v4:
> 
> Move the delalloc converting hunk before searching the COW fork. Because
> if the file has been reflinked and copied on write,
> xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
> the writeback, there might be some unwritten extents left over in the
> COW fork that overlaps the delalloc extent we found in data fork.
> 
>   data fork  ...wwww|dddddddddd...
>   cow fork          |uuuuuuuuuu...
>                     ^
>                   i_size
> 
> In my v4, we search the COW fork before checking the delalloc extent,
> goto found_cow tag and return unconverted delalloc srcmap in the above
> case, so the delayed extent in the data fork will have no chance to
> convert to unwritten, it will lead to delalloc extent residue and break
> generic/522 after merging patch 6.

Hmmm.  I suppose that works, but it feels a little funny to convert the
delalloc mapping in the data fork to unwritten /while/ there's unwritten
extents in the cow fork too.  Would it make more sense to remap the cow
fork extents here?

OTOH unwritten extents in the cow fork get changed to written ones by
all the cow remapping functions.  Soooo maybe we don't want to go
digging /that/ deep into the system.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 236ee78aa75b..2857ef1b0272 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
>  		goto out_unlock;
>  	}
>  
> +	/*
> +	 * For zeroing, trim a delalloc extent that extends beyond the EOF
> +	 * block.  If it starts beyond the EOF block, convert it to an
> +	 * unwritten extent.
> +	 */
> +	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
> +	    isnullstartblock(imap.br_startblock)) {
> +		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> +
> +		if (offset_fsb >= eof_fsb)
> +			goto convert_delay;
> +		if (end_fsb > eof_fsb) {
> +			end_fsb = eof_fsb;
> +			xfs_trim_extent(&imap, offset_fsb,
> +					end_fsb - offset_fsb);
> +		}
> +	}
> +
>  	/*
>  	 * Search the COW fork extent list even if we did not find a data fork
>  	 * extent.  This serves two purposes: first this implements the
> @@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
>  	xfs_iunlock(ip, lockmode);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>  
> +convert_delay:
> +	xfs_iunlock(ip, lockmode);
> +	truncate_pagecache(inode, offset);
> +	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
> +					   iomap, NULL);
> +	if (error)
> +		return error;
> +
> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> +	return 0;
> +
>  found_cow:
>  	seq = xfs_iomap_inode_sequence(ip, 0);
>  	if (imap.br_startoff <= offset_fsb) {
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-25 18:29   ` Darrick J. Wong
@ 2024-04-26  6:24     ` Zhang Yi
  2024-04-26  6:33       ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Zhang Yi @ 2024-04-26  6:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	chandanbabu, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/4/26 2:29, Darrick J. Wong wrote:
> On Thu, Apr 25, 2024 at 09:13:30PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Current clone operation could be non-atomic if the destination of a file
>> is beyond EOF, user could get a file with corrupted (zeroed) data on
>> crash.
>>
>> The problem is about preallocations. If you write some data into a file:
>>
>> 	[A...B)
>>
>> and XFS decides to preallocate some post-eof blocks, then it can create
>> a delayed allocation reservation:
>>
>> 	[A.........D)
>>
>> The writeback path tries to convert delayed extents to real ones by
>> allocating blocks. If there aren't enough contiguous free space, we can
>> end up with two extents, the first real and the second still delalloc:
>>
>> 	[A....C)[C.D)
>>
>> After that, both the in-memory and the on-disk file sizes are still B.
>> If we clone into the range [E...F) from another file:
>>
>> 	[A....C)[C.D)      [E...F)
>>
>> then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
>> range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
>> extent, its pagecache will be zeroed and both the in-memory and on-disk
>> size will be updated to D after flushing but before cloning. This is
>> wrong, because the user can see the size change and read the zeroes
>> while the clone operation is ongoing.
>>
>> We need to keep the in-memory and on-disk size before the clone
>> operation starts, so instead of writing zeroes through the page cache
>> for delayed ranges beyond EOF, we convert these ranges to unwritten and
>> invalidate any cached data over that range beyond EOF.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> Changes since v4:
>>
>> Move the delalloc converting hunk before searching the COW fork. Because
>> if the file has been reflinked and copied on write,
>> xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
>> the writeback, there might be some unwritten extents left over in the
>> COW fork that overlaps the delalloc extent we found in data fork.
>>
>>   data fork  ...wwww|dddddddddd...
>>   cow fork          |uuuuuuuuuu...
>>                     ^
>>                   i_size
>>
>> In my v4, we search the COW fork before checking the delalloc extent,
>> goto found_cow tag and return unconverted delalloc srcmap in the above
>> case, so the delayed extent in the data fork will have no chance to
>> convert to unwritten, it will lead to delalloc extent residue and break
>> generic/522 after merging patch 6.
> 
> Hmmm.  I suppose that works, but it feels a little funny to convert the
> delalloc mapping in the data fork to unwritten /while/ there's unwritten
> extents in the cow fork too.  Would it make more sense to remap the cow
> fork extents here?
> 

Yeah, it looks more reasonable. But from the original scene, the
xfs_bmap_extsize_align() aligned the new extent that added to the cow fork
could overlaps the unreflinked range, IIUC, I guess that spare range is
useless exactly, is there any situation that would use it?

> OTOH unwritten extents in the cow fork get changed to written ones by
> all the cow remapping functions.  Soooo maybe we don't want to go
> digging /that/ deep into the system.
> 

Yeah, I think it's okay now unless there's some strong claims.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
>>
>>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>>  1 file changed, 29 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 236ee78aa75b..2857ef1b0272 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
>>  		goto out_unlock;
>>  	}
>>  
>> +	/*
>> +	 * For zeroing, trim a delalloc extent that extends beyond the EOF
>> +	 * block.  If it starts beyond the EOF block, convert it to an
>> +	 * unwritten extent.
>> +	 */
>> +	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
>> +	    isnullstartblock(imap.br_startblock)) {
>> +		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
>> +
>> +		if (offset_fsb >= eof_fsb)
>> +			goto convert_delay;
>> +		if (end_fsb > eof_fsb) {
>> +			end_fsb = eof_fsb;
>> +			xfs_trim_extent(&imap, offset_fsb,
>> +					end_fsb - offset_fsb);
>> +		}
>> +	}
>> +
>>  	/*
>>  	 * Search the COW fork extent list even if we did not find a data fork
>>  	 * extent.  This serves two purposes: first this implements the
>> @@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
>>  	xfs_iunlock(ip, lockmode);
>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>>  
>> +convert_delay:
>> +	xfs_iunlock(ip, lockmode);
>> +	truncate_pagecache(inode, offset);
>> +	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
>> +					   iomap, NULL);
>> +	if (error)
>> +		return error;
>> +
>> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
>> +	return 0;
>> +
>>  found_cow:
>>  	seq = xfs_iomap_inode_sequence(ip, 0);
>>  	if (imap.br_startoff <= offset_fsb) {
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-26  6:24     ` Zhang Yi
@ 2024-04-26  6:33       ` Christoph Hellwig
  2024-04-26  7:18         ` Zhang Yi
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2024-04-26  6:33 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel, hch,
	brauner, david, chandanbabu, tytso, jack, yi.zhang, chengzhihao1,
	yukuai3

On Fri, Apr 26, 2024 at 02:24:19PM +0800, Zhang Yi wrote:
> Yeah, it looks more reasonable. But from the original scene, the
> xfs_bmap_extsize_align() aligned the new extent that added to the cow fork
> could overlaps the unreflinked range, IIUC, I guess that spare range is
> useless exactly, is there any situation that would use it?

I've just started staring at this (again) half an hour ago, and I fail
to understand the (pre-existing) logic in xfs_reflink_zero_posteof.

We obviously need to ensure data between i_size and the end of the
block that i_size sits in is zeroed (but IIRC we already do that
in write and truncate anyway).  But what is the point of zeroing
any speculative preallocation beyond the last block that actually
contains data?  Just truncating the preallocation and freeing
the delalloc and unwritten blocks seems like it would be way
more efficient.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-26  6:33       ` Christoph Hellwig
@ 2024-04-26  7:18         ` Zhang Yi
  2024-04-27  6:59           ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Zhang Yi @ 2024-04-26  7:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel, brauner,
	david, chandanbabu, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/4/26 14:33, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 02:24:19PM +0800, Zhang Yi wrote:
>> Yeah, it looks more reasonable. But from the original scene, the
>> xfs_bmap_extsize_align() aligned the new extent that added to the cow fork
>> could overlaps the unreflinked range, IIUC, I guess that spare range is
>> useless exactly, is there any situation that would use it?
> 
> I've just started staring at this (again) half an hour ago, and I fail
> to understand the (pre-existing) logic in xfs_reflink_zero_posteof.
> 
> We obviously need to ensure data between i_size and the end of the
> block that i_size sits in is zeroed (but IIRC we already do that
> in write and truncate anyway).  But what is the point of zeroing
> any speculative preallocation beyond the last block that actually
> contains data?  Just truncating the preallocation and freeing
> the delalloc and unwritten blocks seems like it would be way
> more efficient.
> 

I've had the same idea before, I asked Dave and he explained that Linux
could leak data beyond EOF page for some cases, e.g. mmap() can write to
the EOF page beyond EOF without failing, and the data in that EOF page
could be non-zeroed by mmap(), so the zeroing is still needed now.

OTOH, if we free the delalloc and unwritten blocks beyond EOF blocks, he
said it could lead to some performance problems and make thinks
complicated to deal with the trimming of EOF block. Please see [1]
for details and maybe Dave could explain more.

[1] https://lore.kernel.org/linux-xfs/ZeERAob9Imwh01bG@dread.disaster.area/

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-26  7:18         ` Zhang Yi
@ 2024-04-27  6:59           ` Christoph Hellwig
  2024-04-28  3:26             ` Zhang Yi
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2024-04-27  6:59 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, tytso, jack, yi.zhang,
	chengzhihao1, yukuai3

On Fri, Apr 26, 2024 at 03:18:17PM +0800, Zhang Yi wrote:
> I've had the same idea before, I asked Dave and he explained that Linux
> could leak data beyond EOF page for some cases, e.g. mmap() can write to
> the EOF page beyond EOF without failing, and the data in that EOF page
> could be non-zeroed by mmap(), so the zeroing is still needed now.
> 
> OTOH, if we free the delalloc and unwritten blocks beyond EOF blocks, he
> said it could lead to some performance problems and make thinks
> complicated to deal with the trimming of EOF block. Please see [1]
> for details and maybe Dave could explain more.

Oh well.  Given that we're full in on the speculative allocations
we might as well deal with it.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-27  6:59           ` Christoph Hellwig
@ 2024-04-28  3:26             ` Zhang Yi
  2024-04-29  4:41               ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Zhang Yi @ 2024-04-28  3:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel, brauner,
	david, chandanbabu, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/4/27 14:59, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 03:18:17PM +0800, Zhang Yi wrote:
>> I've had the same idea before, I asked Dave and he explained that Linux
>> could leak data beyond EOF page for some cases, e.g. mmap() can write to
>> the EOF page beyond EOF without failing, and the data in that EOF page
>> could be non-zeroed by mmap(), so the zeroing is still needed now.
>>
>> OTOH, if we free the delalloc and unwritten blocks beyond EOF blocks, he
>> said it could lead to some performance problems and make thinks
>> complicated to deal with the trimming of EOF block. Please see [1]
>> for details and maybe Dave could explain more.
> 
> Oh well.  Given that we're full in on the speculative allocations
> we might as well deal with it.
> 

Let me confirm, so you also think the preallocations in the COW fork that
overlaps the unreflinked range is useless, we should avoid allocating
this range, is that right? If so, I suppose we can do this improvement in
another patch(set), this one works fine now.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-28  3:26             ` Zhang Yi
@ 2024-04-29  4:41               ` Christoph Hellwig
  2024-04-29  7:11                 ` Zhang Yi
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2024-04-29  4:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, tytso, jack, yi.zhang,
	chengzhihao1, yukuai3

On Sun, Apr 28, 2024 at 11:26:00AM +0800, Zhang Yi wrote:
> > 
> > Oh well.  Given that we're full in on the speculative allocations
> > we might as well deal with it.
> > 
> 
> Let me confirm, so you also think the preallocations in the COW fork that
> overlaps the unreflinked range is useless, we should avoid allocating
> this range, is that right? If so, I suppose we can do this improvement in
> another patch(set), this one works fine now.

Well, not stop allocating it, but not actually convert it to a real
allocation when we're just truncating it and replacing the blocks with
reflinked blocks.

But yes, this is a bigger project.

For now for this patch to go ahead:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-04-29  4:41               ` Christoph Hellwig
@ 2024-04-29  7:11                 ` Zhang Yi
  0 siblings, 0 replies; 18+ messages in thread
From: Zhang Yi @ 2024-04-29  7:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel, brauner,
	david, chandanbabu, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/4/29 12:41, Christoph Hellwig wrote:
> On Sun, Apr 28, 2024 at 11:26:00AM +0800, Zhang Yi wrote:
>>>
>>> Oh well.  Given that we're full in on the speculative allocations
>>> we might as well deal with it.
>>>
>>
>> Let me confirm, so you also think the preallocations in the COW fork that
>> overlaps the unreflinked range is useless, we should avoid allocating
>> this range, is that right? If so, I suppose we can do this improvement in
>> another patch(set), this one works fine now.
> 
> Well, not stop allocating it, but not actually convert it to a real
> allocation when we're just truncating it and replacing the blocks with
> reflinked blocks.

OK, it looks fine, too.

Thanks,
Yi.

> 
> But yes, this is a bigger project.>
> For now for this patch to go ahead:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-04-29  7:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-25 13:13 [PATCH v5 0/9] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
2024-04-25 13:13 ` [PATCH v5 1/9] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
2024-04-25 13:13 ` [PATCH v5 2/9] xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional Zhang Yi
2024-04-25 13:13 ` [PATCH v5 3/9] xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset Zhang Yi
2024-04-25 13:13 ` [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
2024-04-25 18:29   ` Darrick J. Wong
2024-04-26  6:24     ` Zhang Yi
2024-04-26  6:33       ` Christoph Hellwig
2024-04-26  7:18         ` Zhang Yi
2024-04-27  6:59           ` Christoph Hellwig
2024-04-28  3:26             ` Zhang Yi
2024-04-29  4:41               ` Christoph Hellwig
2024-04-29  7:11                 ` Zhang Yi
2024-04-25 13:13 ` [PATCH v5 5/9] iomap: drop the write failure handles when unsharing and zeroing Zhang Yi
2024-04-25 13:13 ` [PATCH v5 6/9] iomap: don't increase i_size if it's not a write operation Zhang Yi
2024-04-25 13:13 ` [PATCH v5 7/9] iomap: use a new variable to handle the written bytes in iomap_write_iter() Zhang Yi
2024-04-25 13:13 ` [PATCH v5 8/9] iomap: make iomap_write_end() return a boolean Zhang Yi
2024-04-25 13:13 ` [PATCH v5 9/9] iomap: do some small logical cleanup in buffered write Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).