[PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof
@ 2024-03-11 12:22 Zhang Yi
  2024-03-11 12:22 ` [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Zhang Yi @ 2024-03-11 12:22 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, tytso, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

This patch series fix a problem of exposing zeroed data on xfs since the
non-atomic clone operation. This problem was found while I was
developing ext4 buffered IO iomap conversion (ext4 is relying on this
fix [1]), the root cause of this problem and the discussion about the
solution please see [2]. After fix the problem, iomap_zero_range()
doesn't need to update i_size so that ext4 can use it to zero partial
block, e.g. truncate eof block [3].

[1] https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com/
[3] https://lore.kernel.org/linux-ext4/9c9f1831-a772-299b-072b-1c8116c3fb35@huaweicloud.com/

Thanks,
Yi.

Zhang Yi (4):
  xfs: match lock mode in xfs_buffered_write_iomap_begin()
  xfs: convert delayed extents to unwritten when zeroing post eof blocks
  iomap: don't increase i_size if it's not a write operation
  iomap: cleanup iomap_write_iter()

 fs/iomap/buffered-io.c | 78 +++++++++++++++++++++---------------------
 fs/xfs/xfs_iomap.c     | 39 ++++++++++++++++++---
 2 files changed, 73 insertions(+), 44 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin()
  2024-03-11 12:22 [PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
@ 2024-03-11 12:22 ` Zhang Yi
  2024-03-11 15:34   ` Darrick J. Wong
  2024-03-12 12:16   ` Christoph Hellwig
  2024-03-11 12:22 ` [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Zhang Yi @ 2024-03-11 12:22 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, tytso, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Commit 1aa91d9c9933 ("xfs: Add async buffered write support") replace
xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the
writing inode, and a new variable lockmode is used to indicate the lock
mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still
better to use this variable instead of useing XFS_ILOCK_EXCL directly
when unlocking the inode.

Fixes: 1aa91d9c9933 ("xfs: Add async buffered write support")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iomap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..ccf83e72d8ca 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1149,13 +1149,13 @@ xfs_buffered_write_iomap_begin(
 	 * them out if the write happens to fail.
 	 */
 	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
 
 found_imap:
 	seq = xfs_iomap_inode_sequence(ip, 0);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 
 found_cow:
@@ -1165,17 +1165,17 @@ xfs_buffered_write_iomap_begin(
 		if (error)
 			goto out_unlock;
 		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		xfs_iunlock(ip, lockmode);
 		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
 					 IOMAP_F_SHARED, seq);
 	}
 
 	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
 
 out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return error;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-11 12:22 [PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
  2024-03-11 12:22 ` [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
@ 2024-03-11 12:22 ` Zhang Yi
  2024-03-11 15:37   ` Darrick J. Wong
  2024-03-11 12:22 ` [PATCH 3/4] iomap: don't increase i_size if it's not a write operation Zhang Yi
  2024-03-11 12:22 ` [PATCH 4/4] iomap: cleanup iomap_write_iter() Zhang Yi
  3 siblings, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-11 12:22 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, tytso, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Current clone operation could be non-atomic if the destination of a file
is beyond EOF, user could get a file with corrupted (zeroed) data on
crash.

The problem is about to pre-alloctions. If you write some data into a
file [A, B) (the position letters are increased one by one), and xfs
could pre-allocate some blocks, then we get a delayed extent [A, D).
Then the writeback path allocate blocks and convert this delayed extent
[A, C) since lack of enough contiguous physical blocks, so the extent
[C, D) is still delayed. After that, both the in-memory and the on-disk
file size are B. If we clone file range into [E, F) from another file,
xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
extent, it will be zeroed and the file's in-memory && on-disk size will
be updated to D after flushing and before doing the clone operation.
This is wrong, because user can user can see the size change and read
zeros in the middle of the clone operation.

We need to keep the in-memory and on-disk size before the clone
operation starts, so instead of writing zeroes through the page cache
for delayed ranges beyond EOF, we convert these ranges to unwritten and
invalidating any cached data over that range beyond EOF.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ccf83e72d8ca..2b2aace25355 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
+	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
 	struct xfs_bmbt_irec	imap, cmap;
 	struct xfs_iext_cursor	icur, ccur;
 	xfs_fsblock_t		prealloc_blocks = 0;
@@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
 	}

 	if (imap.br_startoff <= offset_fsb) {
+		/*
+		 * For zeroing out delayed allocation extent, we trim it if
+		 * it's partial beyonds EOF block, or convert it to unwritten
+		 * extent if it's all beyonds EOF block.
+		 */
+		if ((flags & IOMAP_ZERO) &&
+		    isnullstartblock(imap.br_startblock)) {
+			if (offset_fsb > eof_fsb)
+				goto convert_delay;
+			if (end_fsb > eof_fsb) {
+				end_fsb = eof_fsb + 1;
+				xfs_trim_extent(&imap, offset_fsb,
+						end_fsb - offset_fsb);
+			}
+		}
+
 		/*
 		 * For reflink files we may need a delalloc reservation when
 		 * overwriting shared extents.   This includes zeroing of
@@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
 	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);

+convert_delay:
+	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
+	xfs_iunlock(ip, lockmode);
+	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
+	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
+				       flags, &imap, &seq);
+	if (error)
+		return error;
+
+	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
+
 found_cow:
 	seq = xfs_iomap_inode_sequence(ip, 0);
 	if (imap.br_startoff <= offset_fsb) {
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/4] iomap: don't increase i_size if it's not a write operation
  2024-03-11 12:22 [PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
  2024-03-11 12:22 ` [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
  2024-03-11 12:22 ` [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
@ 2024-03-11 12:22 ` Zhang Yi
  2024-03-11 15:48   ` Darrick J. Wong
  2024-03-11 12:22 ` [PATCH 4/4] iomap: cleanup iomap_write_iter() Zhang Yi
  3 siblings, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-11 12:22 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, tytso, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Increase i_size in iomap_zero_range() and iomap_unshare_iter() is not
needed, the caller should handle it. Especially, when truncate partial
block, we could not increase i_size beyond the new EOF here. It doesn't
affect xfs and gfs2 now because they set the new file size after zero
out, it doesn't matter that a transient increase in i_size, but it will
affect ext4 because it set file size before truncate. At the same time,
iomap_write_failed() is also not needed for above two cases too, so
factor them out and move them to iomap_write_iter() and
iomap_zero_iter().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 59 +++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 093c4515b22a..19f91324c690 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -786,7 +786,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 
 out_unlock:
 	__iomap_put_folio(iter, pos, 0, folio);
-	iomap_write_failed(iter->inode, pos, len);
 
 	return status;
 }
@@ -838,34 +837,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	loff_t old_size = iter->inode->i_size;
-	size_t ret;
-
-	if (srcmap->type == IOMAP_INLINE) {
-		ret = iomap_write_end_inline(iter, folio, pos, copied);
-	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
-		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
-				copied, &folio->page, NULL);
-	} else {
-		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
-	}
 
-	/*
-	 * Update the in-memory inode size after copying the data into the page
-	 * cache.  It's up to the file system to write the updated size to disk,
-	 * preferably after I/O completion so that no stale data is exposed.
-	 */
-	if (pos + ret > old_size) {
-		i_size_write(iter->inode, pos + ret);
-		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
-	}
-	__iomap_put_folio(iter, pos, ret, folio);
-
-	if (old_size < pos)
-		pagecache_isize_extended(iter->inode, old_size, pos);
-	if (ret < len)
-		iomap_write_failed(iter->inode, pos + ret, len - ret);
-	return ret;
+	if (srcmap->type == IOMAP_INLINE)
+		return iomap_write_end_inline(iter, folio, pos, copied);
+	if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
+		return block_write_end(NULL, iter->inode->i_mapping, pos, len,
+				       copied, &folio->page, NULL);
+	return __iomap_write_end(iter->inode, pos, len, copied, folio);
 }
 
 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
@@ -880,6 +858,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 	do {
 		struct folio *folio;
+		loff_t old_size;
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
@@ -912,8 +891,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		}
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (unlikely(status))
+		if (unlikely(status)) {
+			iomap_write_failed(iter->inode, pos, bytes);
 			break;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -927,6 +908,24 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
+		/*
+		 * Update the in-memory inode size after copying the data into
+		 * the page cache.  It's up to the file system to write the
+		 * updated size to disk, preferably after I/O completion so that
+		 * no stale data is exposed.
+		 */
+		old_size = iter->inode->i_size;
+		if (pos + status > old_size) {
+			i_size_write(iter->inode, pos + status);
+			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
+		}
+		__iomap_put_folio(iter, pos, status, folio);
+
+		if (old_size < pos)
+			pagecache_isize_extended(iter->inode, old_size, pos);
+		if (status < bytes)
+			iomap_write_failed(iter->inode, pos + status,
+					   bytes - status);
 		if (unlikely(copied != status))
 			iov_iter_revert(i, copied - status);
 
@@ -1296,6 +1295,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 			bytes = folio_size(folio) - offset;
 
 		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		__iomap_put_folio(iter, pos, bytes, folio);
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
@@ -1360,6 +1360,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		folio_mark_accessed(folio);
 
 		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		__iomap_put_folio(iter, pos, bytes, folio);
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/4] iomap: cleanup iomap_write_iter()
  2024-03-11 12:22 [PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
                   ` (2 preceding siblings ...)
  2024-03-11 12:22 ` [PATCH 3/4] iomap: don't increase i_size if it's not a write operation Zhang Yi
@ 2024-03-11 12:22 ` Zhang Yi
  2024-03-11 16:07   ` Darrick J. Wong
  3 siblings, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-11 12:22 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, tytso, jack, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

The status variable in iomap_write_iter() is confusing and
iomap_write_end() always return 0 or copied bytes, so replace it with a
new written variable to represent the written bytes in each cycle, and
also do some cleanup, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 19f91324c690..767af6e67ed4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -851,7 +851,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 	loff_t length = iomap_length(iter);
 	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
 	loff_t pos = iter->pos;
-	ssize_t written = 0;
+	ssize_t total_written = 0;
 	long status = 0;
 	struct address_space *mapping = iter->inode->i_mapping;
 	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
@@ -862,6 +862,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
+		size_t written;		/* Bytes have been written */
 
 		bytes = iov_iter_count(i);
 retry:
@@ -906,7 +907,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			flush_dcache_folio(folio);
 
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
-		status = iomap_write_end(iter, pos, bytes, copied, folio);
+		written = iomap_write_end(iter, pos, bytes, copied, folio);
 
 		/*
 		 * Update the in-memory inode size after copying the data into
@@ -915,28 +916,26 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		 * no stale data is exposed.
 		 */
 		old_size = iter->inode->i_size;
-		if (pos + status > old_size) {
-			i_size_write(iter->inode, pos + status);
+		if (pos + written > old_size) {
+			i_size_write(iter->inode, pos + written);
 			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
 		}
-		__iomap_put_folio(iter, pos, status, folio);
+		__iomap_put_folio(iter, pos, written, folio);
 
 		if (old_size < pos)
 			pagecache_isize_extended(iter->inode, old_size, pos);
-		if (status < bytes)
-			iomap_write_failed(iter->inode, pos + status,
-					   bytes - status);
-		if (unlikely(copied != status))
-			iov_iter_revert(i, copied - status);
 
 		cond_resched();
-		if (unlikely(status == 0)) {
+		if (unlikely(written == 0)) {
 			/*
 			 * A short copy made iomap_write_end() reject the
 			 * thing entirely.  Might be memory poisoning
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
+			iomap_write_failed(iter->inode, pos, bytes);
+			iov_iter_revert(i, copied);
+
 			if (chunk > PAGE_SIZE)
 				chunk /= 2;
 			if (copied) {
@@ -944,17 +943,17 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 				goto retry;
 			}
 		} else {
-			pos += status;
-			written += status;
-			length -= status;
+			pos += written;
+			total_written += written;
+			length -= written;
 		}
 	} while (iov_iter_count(i) && length);
 
 	if (status == -EAGAIN) {
-		iov_iter_revert(i, written);
+		iov_iter_revert(i, total_written);
 		return -EAGAIN;
 	}
-	return written ? written : status;
+	return total_written ? total_written : status;
 }
 
 ssize_t
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin()
  2024-03-11 12:22 ` [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
@ 2024-03-11 15:34   ` Darrick J. Wong
  2024-03-12  8:18     ` Zhang Yi
  2024-03-12 12:16   ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-11 15:34 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 08:22:52PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Commit 1aa91d9c9933 ("xfs: Add async buffered write support") replace
> xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the
> writing inode, and a new variable lockmode is used to indicate the lock
> mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still
> better to use this variable instead of useing XFS_ILOCK_EXCL directly
> when unlocking the inode.
> 
> Fixes: 1aa91d9c9933 ("xfs: Add async buffered write support")

AFAICT, xfs_ilock_for_iomap can change lockmode from SHARED->EXCL, but
never changed away from EXCL, right?  And xfs_buffered_write_iomap_begin
sets it to EXCL (and never changes it), right?

This seems like more of a code cleanup/logic bomb removal than an actual
defect that someone could actually hit, correct?

If the answers are {yes, yes, yes} then:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..ccf83e72d8ca 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1149,13 +1149,13 @@ xfs_buffered_write_iomap_begin(
>  	 * them out if the write happens to fail.
>  	 */
>  	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iunlock(ip, lockmode);
>  	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
>  
>  found_imap:
>  	seq = xfs_iomap_inode_sequence(ip, 0);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iunlock(ip, lockmode);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>  
>  found_cow:
> @@ -1165,17 +1165,17 @@ xfs_buffered_write_iomap_begin(
>  		if (error)
>  			goto out_unlock;
>  		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		xfs_iunlock(ip, lockmode);
>  		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
>  					 IOMAP_F_SHARED, seq);
>  	}
>  
>  	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iunlock(ip, lockmode);
>  	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iunlock(ip, lockmode);
>  	return error;
>  }
>  
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-11 12:22 ` [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
@ 2024-03-11 15:37   ` Darrick J. Wong
  2024-03-12 12:21     ` Christoph Hellwig
  2024-03-12 12:31     ` Zhang Yi
  0 siblings, 2 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-11 15:37 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Current clone operation could be non-atomic if the destination of a file
> is beyond EOF, user could get a file with corrupted (zeroed) data on
> crash.
> 
> The problem is about to pre-alloctions. If you write some data into a
> file [A, B) (the position letters are increased one by one), and xfs
> could pre-allocate some blocks, then we get a delayed extent [A, D).
> Then the writeback path allocate blocks and convert this delayed extent
> [A, C) since lack of enough contiguous physical blocks, so the extent
> [C, D) is still delayed. After that, both the in-memory and the on-disk
> file size are B. If we clone file range into [E, F) from another file,
> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
> extent, it will be zeroed and the file's in-memory && on-disk size will
> be updated to D after flushing and before doing the clone operation.
> This is wrong, because user can user can see the size change and read
> zeros in the middle of the clone operation.
> 
> We need to keep the in-memory and on-disk size before the clone
> operation starts, so instead of writing zeroes through the page cache
> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> invalidating any cached data over that range beyond EOF.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ccf83e72d8ca..2b2aace25355 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
>  	struct xfs_mount	*mp = ip->i_mount;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
> +	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
>  	struct xfs_bmbt_irec	imap, cmap;
>  	struct xfs_iext_cursor	icur, ccur;
>  	xfs_fsblock_t		prealloc_blocks = 0;
> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
>  	}
>  
>  	if (imap.br_startoff <= offset_fsb) {
> +		/*
> +		 * For zeroing out delayed allocation extent, we trim it if
> +		 * it's partial beyonds EOF block, or convert it to unwritten
> +		 * extent if it's all beyonds EOF block.
> +		 */
> +		if ((flags & IOMAP_ZERO) &&
> +		    isnullstartblock(imap.br_startblock)) {
> +			if (offset_fsb > eof_fsb)
> +				goto convert_delay;
> +			if (end_fsb > eof_fsb) {
> +				end_fsb = eof_fsb + 1;
> +				xfs_trim_extent(&imap, offset_fsb,
> +						end_fsb - offset_fsb);
> +			}
> +		}
> +
>  		/*
>  		 * For reflink files we may need a delalloc reservation when
>  		 * overwriting shared extents.   This includes zeroing of
> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
>  	xfs_iunlock(ip, lockmode);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>  
> +convert_delay:
> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
> +	xfs_iunlock(ip, lockmode);
> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> +				       flags, &imap, &seq);

I expected this to be a direct call to xfs_bmapi_convert_delalloc.
What was the reason not for using that?

--D

> +	if (error)
> +		return error;
> +
> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> +	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
> +
>  found_cow:
>  	seq = xfs_iomap_inode_sequence(ip, 0);
>  	if (imap.br_startoff <= offset_fsb) {
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] iomap: don't increase i_size if it's not a write operation
  2024-03-11 12:22 ` [PATCH 3/4] iomap: don't increase i_size if it's not a write operation Zhang Yi
@ 2024-03-11 15:48   ` Darrick J. Wong
  2024-03-12 12:22     ` Christoph Hellwig
  2024-03-12 12:59     ` Zhang Yi
  0 siblings, 2 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-11 15:48 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 08:22:54PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Increase i_size in iomap_zero_range() and iomap_unshare_iter() is not
> needed, the caller should handle it. Especially, when truncate partial
> block, we could not increase i_size beyond the new EOF here. It doesn't
> affect xfs and gfs2 now because they set the new file size after zero
> out, it doesn't matter that a transient increase in i_size, but it will
> affect ext4 because it set file size before truncate.

>                                                       At the same time,
> iomap_write_failed() is also not needed for above two cases too, so
> factor them out and move them to iomap_write_iter() and
> iomap_zero_iter().

This change should be a separate patch with its own justification.
Which is, AFAICT, something along the lines of:

"Unsharing and zeroing can only happen within EOF, so there is never a
need to perform posteof pagecache truncation if write begin fails."

> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Doesn't this patch fix a bug in ext4?

> ---
>  fs/iomap/buffered-io.c | 59 +++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 093c4515b22a..19f91324c690 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -786,7 +786,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>  
>  out_unlock:
>  	__iomap_put_folio(iter, pos, 0, folio);
> -	iomap_write_failed(iter->inode, pos, len);
>  
>  	return status;
>  }
> @@ -838,34 +837,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  		size_t copied, struct folio *folio)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> -	loff_t old_size = iter->inode->i_size;
> -	size_t ret;
> -
> -	if (srcmap->type == IOMAP_INLINE) {
> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> -				copied, &folio->page, NULL);
> -	} else {
> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> -	}
>  
> -	/*
> -	 * Update the in-memory inode size after copying the data into the page
> -	 * cache.  It's up to the file system to write the updated size to disk,
> -	 * preferably after I/O completion so that no stale data is exposed.
> -	 */
> -	if (pos + ret > old_size) {
> -		i_size_write(iter->inode, pos + ret);
> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> -	}
> -	__iomap_put_folio(iter, pos, ret, folio);
> -
> -	if (old_size < pos)
> -		pagecache_isize_extended(iter->inode, old_size, pos);
> -	if (ret < len)
> -		iomap_write_failed(iter->inode, pos + ret, len - ret);
> -	return ret;
> +	if (srcmap->type == IOMAP_INLINE)
> +		return iomap_write_end_inline(iter, folio, pos, copied);
> +	if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
> +		return block_write_end(NULL, iter->inode->i_mapping, pos, len,
> +				       copied, &folio->page, NULL);
> +	return __iomap_write_end(iter->inode, pos, len, copied, folio);
>  }
>  
>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> @@ -880,6 +858,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  
>  	do {
>  		struct folio *folio;
> +		loff_t old_size;
>  		size_t offset;		/* Offset into folio */
>  		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
> @@ -912,8 +891,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		}
>  
>  		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (unlikely(status))
> +		if (unlikely(status)) {
> +			iomap_write_failed(iter->inode, pos, bytes);
>  			break;
> +		}
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> @@ -927,6 +908,24 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
>  
> +		/*
> +		 * Update the in-memory inode size after copying the data into
> +		 * the page cache.  It's up to the file system to write the
> +		 * updated size to disk, preferably after I/O completion so that
> +		 * no stale data is exposed.
> +		 */
> +		old_size = iter->inode->i_size;
> +		if (pos + status > old_size) {
> +			i_size_write(iter->inode, pos + status);
> +			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> +		}
> +		__iomap_put_folio(iter, pos, status, folio);

Why is it necessary to hoist the __iomap_put_folio calls from
iomap_write_end into iomap_write_iter, iomap_unshare_iter, and
iomap_zero_iter?  None of those functions seem to use it, and it makes
more sense to me that iomap_write_end releases the folio that
iomap_write_begin returned.

--D

> +
> +		if (old_size < pos)
> +			pagecache_isize_extended(iter->inode, old_size, pos);
> +		if (status < bytes)
> +			iomap_write_failed(iter->inode, pos + status,
> +					   bytes - status);
>  		if (unlikely(copied != status))
>  			iov_iter_revert(i, copied - status);
>  
> @@ -1296,6 +1295,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
>  			bytes = folio_size(folio) - offset;
>  
>  		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
> +		__iomap_put_folio(iter, pos, bytes, folio);
>  		if (WARN_ON_ONCE(bytes == 0))
>  			return -EIO;
>  
> @@ -1360,6 +1360,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		folio_mark_accessed(folio);
>  
>  		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
> +		__iomap_put_folio(iter, pos, bytes, folio);
>  		if (WARN_ON_ONCE(bytes == 0))
>  			return -EIO;
>  
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iomap: cleanup iomap_write_iter()
  2024-03-11 12:22 ` [PATCH 4/4] iomap: cleanup iomap_write_iter() Zhang Yi
@ 2024-03-11 16:07   ` Darrick J. Wong
  2024-03-12 12:24     ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-11 16:07 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 08:22:55PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The status variable in iomap_write_iter() is confusing and
> iomap_write_end() always return 0 or copied bytes, so replace it with a
> new written variable to represent the written bytes in each cycle, and
> also do some cleanup, no logic changes.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/iomap/buffered-io.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 19f91324c690..767af6e67ed4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -851,7 +851,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  	loff_t length = iomap_length(iter);
>  	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
>  	loff_t pos = iter->pos;
> -	ssize_t written = 0;
> +	ssize_t total_written = 0;
>  	long status = 0;
>  	struct address_space *mapping = iter->inode->i_mapping;
>  	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> @@ -862,6 +862,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		size_t offset;		/* Offset into folio */
>  		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
> +		size_t written;		/* Bytes have been written */
>  
>  		bytes = iov_iter_count(i);
>  retry:
> @@ -906,7 +907,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			flush_dcache_folio(folio);
>  
>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> -		status = iomap_write_end(iter, pos, bytes, copied, folio);
> +		written = iomap_write_end(iter, pos, bytes, copied, folio);
>  
>  		/*
>  		 * Update the in-memory inode size after copying the data into
> @@ -915,28 +916,26 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		 * no stale data is exposed.
>  		 */
>  		old_size = iter->inode->i_size;
> -		if (pos + status > old_size) {
> -			i_size_write(iter->inode, pos + status);
> +		if (pos + written > old_size) {
> +			i_size_write(iter->inode, pos + written);
>  			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
>  		}
> -		__iomap_put_folio(iter, pos, status, folio);
> +		__iomap_put_folio(iter, pos, written, folio);
>  
>  		if (old_size < pos)
>  			pagecache_isize_extended(iter->inode, old_size, pos);
> -		if (status < bytes)
> -			iomap_write_failed(iter->inode, pos + status,
> -					   bytes - status);
> -		if (unlikely(copied != status))
> -			iov_iter_revert(i, copied - status);

I wish you'd made the variable renaming and the function reorganization
separate patches.  The renaming looks correct to me, but moving these
calls adds a logic bomb.

If at some point iomap_write_end actually starts returning partial write
completions (e.g. you wrote 250 bytes, but for some reason the pagecache
only acknowledges 100 bytes were written) then this code no longer
reverts the iter or truncates posteof pagecache correctly...

>  
>  		cond_resched();
> -		if (unlikely(status == 0)) {
> +		if (unlikely(written == 0)) {
>  			/*
>  			 * A short copy made iomap_write_end() reject the
>  			 * thing entirely.  Might be memory poisoning
>  			 * halfway through, might be a race with munmap,
>  			 * might be severe memory pressure.
>  			 */
> +			iomap_write_failed(iter->inode, pos, bytes);
> +			iov_iter_revert(i, copied);

...because now we only do that if the pagecache refuses to acknowledge
any bytes written at all.  I think it actually works correctly with
today's kernel since __iomap_write_end only returns copied or 0, but the
size_t return type implies that a short acknowledgement is theoretically
possible.

IOWs, doesn't this adds a logic bomb?

--D

> +
>  			if (chunk > PAGE_SIZE)
>  				chunk /= 2;
>  			if (copied) {
> @@ -944,17 +943,17 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  				goto retry;
>  			}
>  		} else {
> -			pos += status;
> -			written += status;
> -			length -= status;
> +			pos += written;
> +			total_written += written;
> +			length -= written;
>  		}
>  	} while (iov_iter_count(i) && length);
>  
>  	if (status == -EAGAIN) {
> -		iov_iter_revert(i, written);
> +		iov_iter_revert(i, total_written);
>  		return -EAGAIN;
>  	}
> -	return written ? written : status;
> +	return total_written ? total_written : status;
>  }
>  
>  ssize_t
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin()
  2024-03-11 15:34   ` Darrick J. Wong
@ 2024-03-12  8:18     ` Zhang Yi
  0 siblings, 0 replies; 25+ messages in thread
From: Zhang Yi @ 2024-03-12  8:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/11 23:34, Darrick J. Wong wrote:
> On Mon, Mar 11, 2024 at 08:22:52PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Commit 1aa91d9c9933 ("xfs: Add async buffered write support") replace
>> xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the
>> writing inode, and a new variable lockmode is used to indicate the lock
>> mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still
>> better to use this variable instead of useing XFS_ILOCK_EXCL directly
>> when unlocking the inode.
>>
>> Fixes: 1aa91d9c9933 ("xfs: Add async buffered write support")
> 
> AFAICT, xfs_ilock_for_iomap can change lockmode from SHARED->EXCL, but
> never changed away from EXCL, right?  

Yes.

> And xfs_buffered_write_iomap_begin
> sets it to EXCL (and never changes it), right?

Yes.

> 
> This seems like more of a code cleanup/logic bomb removal than an actual
> defect that someone could actually hit, correct?
> 

Yes, it's not a real problem.

> If the answers are {yes, yes, yes} then:
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
> 
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..ccf83e72d8ca 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1149,13 +1149,13 @@ xfs_buffered_write_iomap_begin(
>>  	 * them out if the write happens to fail.
>>  	 */
>>  	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
>> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	xfs_iunlock(ip, lockmode);
>>  	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
>>  
>>  found_imap:
>>  	seq = xfs_iomap_inode_sequence(ip, 0);
>> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	xfs_iunlock(ip, lockmode);
>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>>  
>>  found_cow:
>> @@ -1165,17 +1165,17 @@ xfs_buffered_write_iomap_begin(
>>  		if (error)
>>  			goto out_unlock;
>>  		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
>> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +		xfs_iunlock(ip, lockmode);
>>  		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
>>  					 IOMAP_F_SHARED, seq);
>>  	}
>>  
>>  	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
>> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	xfs_iunlock(ip, lockmode);
>>  	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
>>  
>>  out_unlock:
>> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	xfs_iunlock(ip, lockmode);
>>  	return error;
>>  }
>>  
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin()
  2024-03-11 12:22 ` [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
  2024-03-11 15:34   ` Darrick J. Wong
@ 2024-03-12 12:16   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-03-12 12:16 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, tytso, jack, yi.zhang, chengzhihao1, yukuai3

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-11 15:37   ` Darrick J. Wong
@ 2024-03-12 12:21     ` Christoph Hellwig
  2024-03-12 12:44       ` Zhang Yi
  2024-03-12 12:31     ` Zhang Yi
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-03-12 12:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, hch, brauner,
	david, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 08:37:37AM -0700, Darrick J. Wong wrote:
> > +convert_delay:
> > +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
> > +	xfs_iunlock(ip, lockmode);
> > +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
> > +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> > +				       flags, &imap, &seq);
> 
> I expected this to be a direct call to xfs_bmapi_convert_delalloc.
> What was the reason not for using that?

Same here.  The fact that we even convert delalloc reservations in
xfs_iomap_write_direct is something that doesn't make much sense
given that we're punching out delalloc reservations before starting
direct I/O.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] iomap: don't increase i_size if it's not a write operation
  2024-03-11 15:48   ` Darrick J. Wong
@ 2024-03-12 12:22     ` Christoph Hellwig
  2024-03-12 12:59     ` Zhang Yi
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-03-12 12:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, hch, brauner,
	david, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 08:48:29AM -0700, Darrick J. Wong wrote:
> This change should be a separate patch with its own justification.
> Which is, AFAICT, something along the lines of:
> 
> "Unsharing and zeroing can only happen within EOF, so there is never a
> need to perform posteof pagecache truncation if write begin fails."

Agreed.

> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Doesn't this patch fix a bug in ext4?

Only with the not yet merged ext4 conversion to iomap for buffered I/O,
which is not merged yet.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iomap: cleanup iomap_write_iter()
  2024-03-11 16:07   ` Darrick J. Wong
@ 2024-03-12 12:24     ` Christoph Hellwig
  2024-03-12 16:27       ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-03-12 12:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, hch, brauner,
	david, tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, Mar 11, 2024 at 09:07:39AM -0700, Darrick J. Wong wrote:
> If at some point iomap_write_end actually starts returning partial write
> completions (e.g. you wrote 250 bytes, but for some reason the pagecache
> only acknowledges 100 bytes were written) then this code no longer
> reverts the iter or truncates posteof pagecache correctly...

I don't think it makes sense to return a partial write from
iomap_write_end.  But to make that clear it really should not return
a byte count by a boolean.  I've been wanting to make that cleanup
for a while, but it would reach all the way into buffer.c.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-11 15:37   ` Darrick J. Wong
  2024-03-12 12:21     ` Christoph Hellwig
@ 2024-03-12 12:31     ` Zhang Yi
  2024-03-12 16:21       ` Darrick J. Wong
  1 sibling, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-12 12:31 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/11 23:37, Darrick J. Wong wrote:
> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Current clone operation could be non-atomic if the destination of a file
>> is beyond EOF, user could get a file with corrupted (zeroed) data on
>> crash.
>>
>> The problem is about to pre-alloctions. If you write some data into a
>> file [A, B) (the position letters are increased one by one), and xfs
>> could pre-allocate some blocks, then we get a delayed extent [A, D).
>> Then the writeback path allocate blocks and convert this delayed extent
>> [A, C) since lack of enough contiguous physical blocks, so the extent
>> [C, D) is still delayed. After that, both the in-memory and the on-disk
>> file size are B. If we clone file range into [E, F) from another file,
>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
>> extent, it will be zeroed and the file's in-memory && on-disk size will
>> be updated to D after flushing and before doing the clone operation.
>> This is wrong, because user can user can see the size change and read
>> zeros in the middle of the clone operation.
>>
>> We need to keep the in-memory and on-disk size before the clone
>> operation starts, so instead of writing zeroes through the page cache
>> for delayed ranges beyond EOF, we convert these ranges to unwritten and
>> invalidating any cached data over that range beyond EOF.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>>  1 file changed, 29 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index ccf83e72d8ca..2b2aace25355 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
>>  	struct xfs_mount	*mp = ip->i_mount;
>>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
>> +	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
>>  	struct xfs_bmbt_irec	imap, cmap;
>>  	struct xfs_iext_cursor	icur, ccur;
>>  	xfs_fsblock_t		prealloc_blocks = 0;
>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
>>  	}
>>  
>>  	if (imap.br_startoff <= offset_fsb) {
>> +		/*
>> +		 * For zeroing out delayed allocation extent, we trim it if
>> +		 * it's partial beyonds EOF block, or convert it to unwritten
>> +		 * extent if it's all beyonds EOF block.
>> +		 */
>> +		if ((flags & IOMAP_ZERO) &&
>> +		    isnullstartblock(imap.br_startblock)) {
>> +			if (offset_fsb > eof_fsb)
>> +				goto convert_delay;
>> +			if (end_fsb > eof_fsb) {
>> +				end_fsb = eof_fsb + 1;
>> +				xfs_trim_extent(&imap, offset_fsb,
>> +						end_fsb - offset_fsb);
>> +			}
>> +		}
>> +
>>  		/*
>>  		 * For reflink files we may need a delalloc reservation when
>>  		 * overwriting shared extents.   This includes zeroing of
>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
>>  	xfs_iunlock(ip, lockmode);
>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>>  
>> +convert_delay:
>> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
>> +	xfs_iunlock(ip, lockmode);
>> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
>> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
>> +				       flags, &imap, &seq);
> 
> I expected this to be a direct call to xfs_bmapi_convert_delalloc.
> What was the reason not for using that?
> 

It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert
enough blocks once a time, it may convert insufficient blocks since lack
of enough contiguous free physical blocks. If we are going to use it, I
suppose we need to introduce a new helper something like
xfs_convert_blocks(), add a loop to do the conversion.

xfs_iomap_write_direct() has done all the work of converting, but the
name of this function is non-obviousness than xfs_bmapi_convert_delalloc(),
I can change to use it if you think xfs_bmapi_convert_delalloc() is
better. :)

Thanks,
Yi.

> --D
> 
>> +	if (error)
>> +		return error;
>> +
>> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
>> +	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
>> +
>>  found_cow:
>>  	seq = xfs_iomap_inode_sequence(ip, 0);
>>  	if (imap.br_startoff <= offset_fsb) {
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-12 12:21     ` Christoph Hellwig
@ 2024-03-12 12:44       ` Zhang Yi
  0 siblings, 0 replies; 25+ messages in thread
From: Zhang Yi @ 2024-03-12 12:44 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, brauner, david, tytso,
	jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/12 20:21, Christoph Hellwig wrote:
> On Mon, Mar 11, 2024 at 08:37:37AM -0700, Darrick J. Wong wrote:
>>> +convert_delay:
>>> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
>>> +	xfs_iunlock(ip, lockmode);
>>> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
>>> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
>>> +				       flags, &imap, &seq);
>>
>> I expected this to be a direct call to xfs_bmapi_convert_delalloc.
>> What was the reason not for using that?
> 
> Same here.  The fact that we even convert delalloc reservations in
> xfs_iomap_write_direct is something that doesn't make much sense
> given that we're punching out delalloc reservations before starting
> direct I/O.
> 

OK, sure, I will use xfs_bmapi_convert_delalloc() in my next iteration.

Thanks，
Yi.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] iomap: don't increase i_size if it's not a write operation
  2024-03-11 15:48   ` Darrick J. Wong
  2024-03-12 12:22     ` Christoph Hellwig
@ 2024-03-12 12:59     ` Zhang Yi
  2024-03-12 16:24       ` Darrick J. Wong
  1 sibling, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-12 12:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/11 23:48, Darrick J. Wong wrote:
> On Mon, Mar 11, 2024 at 08:22:54PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Increase i_size in iomap_zero_range() and iomap_unshare_iter() is not
>> needed, the caller should handle it. Especially, when truncate partial
>> block, we could not increase i_size beyond the new EOF here. It doesn't
>> affect xfs and gfs2 now because they set the new file size after zero
>> out, it doesn't matter that a transient increase in i_size, but it will
>> affect ext4 because it set file size before truncate.
> 
>>                                                       At the same time,
>> iomap_write_failed() is also not needed for above two cases too, so
>> factor them out and move them to iomap_write_iter() and
>> iomap_zero_iter().
> 
> This change should be a separate patch with its own justification.
> Which is, AFAICT, something along the lines of:
> 
> "Unsharing and zeroing can only happen within EOF, so there is never a
> need to perform posteof pagecache truncation if write begin fails."

Sure.

> 
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Doesn't this patch fix a bug in ext4?

Yeah, the same as Christoph answered.

> 
>> ---
>>  fs/iomap/buffered-io.c | 59 +++++++++++++++++++++---------------------
>>  1 file changed, 30 insertions(+), 29 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 093c4515b22a..19f91324c690 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -786,7 +786,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>>  
>>  out_unlock:
>>  	__iomap_put_folio(iter, pos, 0, folio);
>> -	iomap_write_failed(iter->inode, pos, len);
>>  
>>  	return status;
>>  }
>> @@ -838,34 +837,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>>  		size_t copied, struct folio *folio)
>>  {
>>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>> -	loff_t old_size = iter->inode->i_size;
>> -	size_t ret;
>> -
>> -	if (srcmap->type == IOMAP_INLINE) {
>> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
>> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
>> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
>> -				copied, &folio->page, NULL);
>> -	} else {
>> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
>> -	}
>>  
>> -	/*
>> -	 * Update the in-memory inode size after copying the data into the page
>> -	 * cache.  It's up to the file system to write the updated size to disk,
>> -	 * preferably after I/O completion so that no stale data is exposed.
>> -	 */
>> -	if (pos + ret > old_size) {
>> -		i_size_write(iter->inode, pos + ret);
>> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
>> -	}
>> -	__iomap_put_folio(iter, pos, ret, folio);
>> -
>> -	if (old_size < pos)
>> -		pagecache_isize_extended(iter->inode, old_size, pos);
>> -	if (ret < len)
>> -		iomap_write_failed(iter->inode, pos + ret, len - ret);
>> -	return ret;
>> +	if (srcmap->type == IOMAP_INLINE)
>> +		return iomap_write_end_inline(iter, folio, pos, copied);
>> +	if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
>> +		return block_write_end(NULL, iter->inode->i_mapping, pos, len,
>> +				       copied, &folio->page, NULL);
>> +	return __iomap_write_end(iter->inode, pos, len, copied, folio);
>>  }
>>  
>>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>> @@ -880,6 +858,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>  
>>  	do {
>>  		struct folio *folio;
>> +		loff_t old_size;
>>  		size_t offset;		/* Offset into folio */
>>  		size_t bytes;		/* Bytes to write to folio */
>>  		size_t copied;		/* Bytes copied from user */
>> @@ -912,8 +891,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>  		}
>>  
>>  		status = iomap_write_begin(iter, pos, bytes, &folio);
>> -		if (unlikely(status))
>> +		if (unlikely(status)) {
>> +			iomap_write_failed(iter->inode, pos, bytes);
>>  			break;
>> +		}
>>  		if (iter->iomap.flags & IOMAP_F_STALE)
>>  			break;
>>  
>> @@ -927,6 +908,24 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
>>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
>>  
>> +		/*
>> +		 * Update the in-memory inode size after copying the data into
>> +		 * the page cache.  It's up to the file system to write the
>> +		 * updated size to disk, preferably after I/O completion so that
>> +		 * no stale data is exposed.
>> +		 */
>> +		old_size = iter->inode->i_size;
>> +		if (pos + status > old_size) {
>> +			i_size_write(iter->inode, pos + status);
>> +			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
>> +		}
>> +		__iomap_put_folio(iter, pos, status, folio);
> 
> Why is it necessary to hoist the __iomap_put_folio calls from
> iomap_write_end into iomap_write_iter, iomap_unshare_iter, and
> iomap_zero_iter?  None of those functions seem to use it, and it makes
> more sense to me that iomap_write_end releases the folio that
> iomap_write_begin returned.
> 

Because we have to update i_size before __iomap_put_folio() in
iomap_write_iter(). If not, once we unlock folio, it could be raced
by the backgroud write back which could start writing back and call
folio_zero_segment() (please see iomap_writepage_handle_eof()) to
zero out the valid data beyond the not updated i_size. So we
have to move out __iomap_put_folio() out together with the i_size
updating.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-12 12:31     ` Zhang Yi
@ 2024-03-12 16:21       ` Darrick J. Wong
  2024-03-13  7:07         ` Zhang Yi
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:21 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote:
> On 2024/3/11 23:37, Darrick J. Wong wrote:
> > On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Current clone operation could be non-atomic if the destination of a file
> >> is beyond EOF, user could get a file with corrupted (zeroed) data on
> >> crash.
> >>
> >> The problem is about to pre-alloctions. If you write some data into a
> >> file [A, B) (the position letters are increased one by one), and xfs
> >> could pre-allocate some blocks, then we get a delayed extent [A, D).
> >> Then the writeback path allocate blocks and convert this delayed extent
> >> [A, C) since lack of enough contiguous physical blocks, so the extent
> >> [C, D) is still delayed. After that, both the in-memory and the on-disk
> >> file size are B. If we clone file range into [E, F) from another file,
> >> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
> >> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
> >> extent, it will be zeroed and the file's in-memory && on-disk size will
> >> be updated to D after flushing and before doing the clone operation.
> >> This is wrong, because user can user can see the size change and read
> >> zeros in the middle of the clone operation.
> >>
> >> We need to keep the in-memory and on-disk size before the clone
> >> operation starts, so instead of writing zeroes through the page cache
> >> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> >> invalidating any cached data over that range beyond EOF.
> >>
> >> Suggested-by: Dave Chinner <david@fromorbit.com>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
> >>  1 file changed, 29 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> index ccf83e72d8ca..2b2aace25355 100644
> >> --- a/fs/xfs/xfs_iomap.c
> >> +++ b/fs/xfs/xfs_iomap.c
> >> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
> >>  	struct xfs_mount	*mp = ip->i_mount;
> >>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> >>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
> >> +	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
> >>  	struct xfs_bmbt_irec	imap, cmap;
> >>  	struct xfs_iext_cursor	icur, ccur;
> >>  	xfs_fsblock_t		prealloc_blocks = 0;
> >> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
> >>  	}
> >>  
> >>  	if (imap.br_startoff <= offset_fsb) {
> >> +		/*
> >> +		 * For zeroing out delayed allocation extent, we trim it if
> >> +		 * it's partial beyonds EOF block, or convert it to unwritten
> >> +		 * extent if it's all beyonds EOF block.
> >> +		 */
> >> +		if ((flags & IOMAP_ZERO) &&
> >> +		    isnullstartblock(imap.br_startblock)) {
> >> +			if (offset_fsb > eof_fsb)
> >> +				goto convert_delay;
> >> +			if (end_fsb > eof_fsb) {
> >> +				end_fsb = eof_fsb + 1;
> >> +				xfs_trim_extent(&imap, offset_fsb,
> >> +						end_fsb - offset_fsb);
> >> +			}
> >> +		}
> >> +
> >>  		/*
> >>  		 * For reflink files we may need a delalloc reservation when
> >>  		 * overwriting shared extents.   This includes zeroing of
> >> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
> >>  	xfs_iunlock(ip, lockmode);
> >>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
> >>  
> >> +convert_delay:
> >> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
> >> +	xfs_iunlock(ip, lockmode);
> >> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
> >> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> >> +				       flags, &imap, &seq);
> > 
> > I expected this to be a direct call to xfs_bmapi_convert_delalloc.
> > What was the reason not for using that?
> > 
> 
> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert
> enough blocks once a time, it may convert insufficient blocks since lack
> of enough contiguous free physical blocks. If we are going to use it, I
> suppose we need to introduce a new helper something like
> xfs_convert_blocks(), add a loop to do the conversion.

I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent
that xfs_bmapi_allocate (or anyone else) allocated (bma.got).  If that
mapping is shorter, won't xfs_buffered_write_iomap_begin pass the
shortened mapping out to the iomap machinery?  In which case that
iomap_iter loop will call ->iomap_begin on the unfinished delalloc
conversion work?

> xfs_iomap_write_direct() has done all the work of converting, but the
> name of this function is non-obviousness than xfs_bmapi_convert_delalloc(),
> I can change to use it if you think xfs_bmapi_convert_delalloc() is
> better. :)

Yes.

--D

> Thanks,
> Yi.
> 
> > --D
> > 
> >> +	if (error)
> >> +		return error;
> >> +
> >> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> >> +	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
> >> +
> >>  found_cow:
> >>  	seq = xfs_iomap_inode_sequence(ip, 0);
> >>  	if (imap.br_startoff <= offset_fsb) {
> >> -- 
> >> 2.39.2
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] iomap: don't increase i_size if it's not a write operation
  2024-03-12 12:59     ` Zhang Yi
@ 2024-03-12 16:24       ` Darrick J. Wong
  2024-03-13  7:09         ` Zhang Yi
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:24 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Tue, Mar 12, 2024 at 08:59:15PM +0800, Zhang Yi wrote:
> On 2024/3/11 23:48, Darrick J. Wong wrote:
> > On Mon, Mar 11, 2024 at 08:22:54PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Increase i_size in iomap_zero_range() and iomap_unshare_iter() is not
> >> needed, the caller should handle it. Especially, when truncate partial
> >> block, we could not increase i_size beyond the new EOF here. It doesn't
> >> affect xfs and gfs2 now because they set the new file size after zero
> >> out, it doesn't matter that a transient increase in i_size, but it will
> >> affect ext4 because it set file size before truncate.
> > 
> >>                                                       At the same time,
> >> iomap_write_failed() is also not needed for above two cases too, so
> >> factor them out and move them to iomap_write_iter() and
> >> iomap_zero_iter().
> > 
> > This change should be a separate patch with its own justification.
> > Which is, AFAICT, something along the lines of:
> > 
> > "Unsharing and zeroing can only happen within EOF, so there is never a
> > need to perform posteof pagecache truncation if write begin fails."
> 
> Sure.
> 
> > 
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Doesn't this patch fix a bug in ext4?
> 
> Yeah, the same as Christoph answered.
> 
> > 
> >> ---
> >>  fs/iomap/buffered-io.c | 59 +++++++++++++++++++++---------------------
> >>  1 file changed, 30 insertions(+), 29 deletions(-)
> >>
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 093c4515b22a..19f91324c690 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -786,7 +786,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> >>  
> >>  out_unlock:
> >>  	__iomap_put_folio(iter, pos, 0, folio);
> >> -	iomap_write_failed(iter->inode, pos, len);
> >>  
> >>  	return status;
> >>  }
> >> @@ -838,34 +837,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
> >>  		size_t copied, struct folio *folio)
> >>  {
> >>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> >> -	loff_t old_size = iter->inode->i_size;
> >> -	size_t ret;
> >> -
> >> -	if (srcmap->type == IOMAP_INLINE) {
> >> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
> >> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> >> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> >> -				copied, &folio->page, NULL);
> >> -	} else {
> >> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> >> -	}
> >>  
> >> -	/*
> >> -	 * Update the in-memory inode size after copying the data into the page
> >> -	 * cache.  It's up to the file system to write the updated size to disk,
> >> -	 * preferably after I/O completion so that no stale data is exposed.
> >> -	 */
> >> -	if (pos + ret > old_size) {
> >> -		i_size_write(iter->inode, pos + ret);
> >> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> >> -	}
> >> -	__iomap_put_folio(iter, pos, ret, folio);
> >> -
> >> -	if (old_size < pos)
> >> -		pagecache_isize_extended(iter->inode, old_size, pos);
> >> -	if (ret < len)
> >> -		iomap_write_failed(iter->inode, pos + ret, len - ret);
> >> -	return ret;
> >> +	if (srcmap->type == IOMAP_INLINE)
> >> +		return iomap_write_end_inline(iter, folio, pos, copied);
> >> +	if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
> >> +		return block_write_end(NULL, iter->inode->i_mapping, pos, len,
> >> +				       copied, &folio->page, NULL);
> >> +	return __iomap_write_end(iter->inode, pos, len, copied, folio);
> >>  }
> >>  
> >>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >> @@ -880,6 +858,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >>  
> >>  	do {
> >>  		struct folio *folio;
> >> +		loff_t old_size;
> >>  		size_t offset;		/* Offset into folio */
> >>  		size_t bytes;		/* Bytes to write to folio */
> >>  		size_t copied;		/* Bytes copied from user */
> >> @@ -912,8 +891,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >>  		}
> >>  
> >>  		status = iomap_write_begin(iter, pos, bytes, &folio);
> >> -		if (unlikely(status))
> >> +		if (unlikely(status)) {
> >> +			iomap_write_failed(iter->inode, pos, bytes);
> >>  			break;
> >> +		}
> >>  		if (iter->iomap.flags & IOMAP_F_STALE)
> >>  			break;
> >>  
> >> @@ -927,6 +908,24 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> >>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
> >>  
> >> +		/*
> >> +		 * Update the in-memory inode size after copying the data into
> >> +		 * the page cache.  It's up to the file system to write the
> >> +		 * updated size to disk, preferably after I/O completion so that
> >> +		 * no stale data is exposed.
> >> +		 */
> >> +		old_size = iter->inode->i_size;
> >> +		if (pos + status > old_size) {
> >> +			i_size_write(iter->inode, pos + status);
> >> +			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> >> +		}
> >> +		__iomap_put_folio(iter, pos, status, folio);
> > 
> > Why is it necessary to hoist the __iomap_put_folio calls from
> > iomap_write_end into iomap_write_iter, iomap_unshare_iter, and
> > iomap_zero_iter?  None of those functions seem to use it, and it makes
> > more sense to me that iomap_write_end releases the folio that
> > iomap_write_begin returned.
> > 
> 
> Because we have to update i_size before __iomap_put_folio() in
> iomap_write_iter(). If not, once we unlock folio, it could be raced
> by the backgroud write back which could start writing back and call
> folio_zero_segment() (please see iomap_writepage_handle_eof()) to
> zero out the valid data beyond the not updated i_size. So we
> have to move out __iomap_put_folio() out together with the i_size
> updating.

Ahah.  Please make a note of that in the comment for dunces like me.

	/*
	 * Update the in-memory inode size after copying the data into
	 * the page cache.  It's up to the file system to write the
	 * updated size to disk, preferably after I/O completion so that
	 * no stale data is exposed.  Only once that's done can we
	 * unlock and release the folio.
	 */

--D

> Thanks,
> Yi.
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iomap: cleanup iomap_write_iter()
  2024-03-12 12:24     ` Christoph Hellwig
@ 2024-03-12 16:27       ` Darrick J. Wong
  2024-03-13  9:23         ` Zhang Yi
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On Tue, Mar 12, 2024 at 05:24:00AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 11, 2024 at 09:07:39AM -0700, Darrick J. Wong wrote:
> > If at some point iomap_write_end actually starts returning partial write
> > completions (e.g. you wrote 250 bytes, but for some reason the pagecache
> > only acknowledges 100 bytes were written) then this code no longer
> > reverts the iter or truncates posteof pagecache correctly...
> 
> I don't think it makes sense to return a partial write from
> iomap_write_end.  But to make that clear it really should not return
> a byte count by a boolean.  I've been wanting to make that cleanup
> for a while, but it would reach all the way into buffer.c.

For now, can we change the return types of iomap_write_end_inline and
__iomap_write_end?  Then iomap can WARN_ON if the block_write_end return
value isn't 0 or copied:

	bool ret;

	if (srcmap->type == IOMAP_INLINE) {
		ret = iomap_write_end_inline(iter, folio, pos, copied);
	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
		size_t bh_written;

		bh_written = block_write_end(NULL, iter->inode->i_mapping,
				pos, len, copied, &folio->page, NULL);

		WARN_ON(bh_written != copied && bh_written != 0);
		ret = bh_written == copied;
	} else {
		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
	}

	...

	return ret;

Some day later we can circle back to bufferheads, or maybe they'll die
before we get to it. ;)

--D

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-12 16:21       ` Darrick J. Wong
@ 2024-03-13  7:07         ` Zhang Yi
  2024-03-13 13:25           ` Zhang Yi
  0 siblings, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-13  7:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/13 0:21, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote:
>> On 2024/3/11 23:37, Darrick J. Wong wrote:
>>> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Current clone operation could be non-atomic if the destination of a file
>>>> is beyond EOF, user could get a file with corrupted (zeroed) data on
>>>> crash.
>>>>
>>>> The problem is about to pre-alloctions. If you write some data into a
>>>> file [A, B) (the position letters are increased one by one), and xfs
>>>> could pre-allocate some blocks, then we get a delayed extent [A, D).
>>>> Then the writeback path allocate blocks and convert this delayed extent
>>>> [A, C) since lack of enough contiguous physical blocks, so the extent
>>>> [C, D) is still delayed. After that, both the in-memory and the on-disk
>>>> file size are B. If we clone file range into [E, F) from another file,
>>>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
>>>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
>>>> extent, it will be zeroed and the file's in-memory && on-disk size will
>>>> be updated to D after flushing and before doing the clone operation.
>>>> This is wrong, because user can user can see the size change and read
>>>> zeros in the middle of the clone operation.
>>>>
>>>> We need to keep the in-memory and on-disk size before the clone
>>>> operation starts, so instead of writing zeroes through the page cache
>>>> for delayed ranges beyond EOF, we convert these ranges to unwritten and
>>>> invalidating any cached data over that range beyond EOF.
>>>>
>>>> Suggested-by: Dave Chinner <david@fromorbit.com>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>>>>  1 file changed, 29 insertions(+)
>>>>
>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>> index ccf83e72d8ca..2b2aace25355 100644
>>>> --- a/fs/xfs/xfs_iomap.c
>>>> +++ b/fs/xfs/xfs_iomap.c
>>>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
>>>>  	struct xfs_mount	*mp = ip->i_mount;
>>>>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>>>>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
>>>> +	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
>>>>  	struct xfs_bmbt_irec	imap, cmap;
>>>>  	struct xfs_iext_cursor	icur, ccur;
>>>>  	xfs_fsblock_t		prealloc_blocks = 0;
>>>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
>>>>  	}
>>>>  
>>>>  	if (imap.br_startoff <= offset_fsb) {
>>>> +		/*
>>>> +		 * For zeroing out delayed allocation extent, we trim it if
>>>> +		 * it's partial beyonds EOF block, or convert it to unwritten
>>>> +		 * extent if it's all beyonds EOF block.
>>>> +		 */
>>>> +		if ((flags & IOMAP_ZERO) &&
>>>> +		    isnullstartblock(imap.br_startblock)) {
>>>> +			if (offset_fsb > eof_fsb)
>>>> +				goto convert_delay;
>>>> +			if (end_fsb > eof_fsb) {
>>>> +				end_fsb = eof_fsb + 1;
>>>> +				xfs_trim_extent(&imap, offset_fsb,
>>>> +						end_fsb - offset_fsb);
>>>> +			}
>>>> +		}
>>>> +
>>>>  		/*
>>>>  		 * For reflink files we may need a delalloc reservation when
>>>>  		 * overwriting shared extents.   This includes zeroing of
>>>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
>>>>  	xfs_iunlock(ip, lockmode);
>>>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>>>>  
>>>> +convert_delay:
>>>> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
>>>> +	xfs_iunlock(ip, lockmode);
>>>> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
>>>> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
>>>> +				       flags, &imap, &seq);
>>>
>>> I expected this to be a direct call to xfs_bmapi_convert_delalloc.
>>> What was the reason not for using that?
>>>
>>
>> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert
>> enough blocks once a time, it may convert insufficient blocks since lack
>> of enough contiguous free physical blocks. If we are going to use it, I
>> suppose we need to introduce a new helper something like
>> xfs_convert_blocks(), add a loop to do the conversion.
> 
> I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent
> that xfs_bmapi_allocate (or anyone else) allocated (bma.got).  If that
> mapping is shorter, won't xfs_buffered_write_iomap_begin pass the
> shortened mapping out to the iomap machinery?  In which case that
> iomap_iter loop will call ->iomap_begin on the unfinished delalloc
> conversion work?

Yeah, make sense, it works, I forgot this loop in iomap_iter().

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] iomap: don't increase i_size if it's not a write operation
  2024-03-12 16:24       ` Darrick J. Wong
@ 2024-03-13  7:09         ` Zhang Yi
  0 siblings, 0 replies; 25+ messages in thread
From: Zhang Yi @ 2024-03-13  7:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/13 0:24, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 08:59:15PM +0800, Zhang Yi wrote:
>> On 2024/3/11 23:48, Darrick J. Wong wrote:
>>> On Mon, Mar 11, 2024 at 08:22:54PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
[...]
>>>> @@ -927,6 +908,24 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>>>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
>>>>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
>>>>  
>>>> +		/*
>>>> +		 * Update the in-memory inode size after copying the data into
>>>> +		 * the page cache.  It's up to the file system to write the
>>>> +		 * updated size to disk, preferably after I/O completion so that
>>>> +		 * no stale data is exposed.
>>>> +		 */
>>>> +		old_size = iter->inode->i_size;
>>>> +		if (pos + status > old_size) {
>>>> +			i_size_write(iter->inode, pos + status);
>>>> +			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
>>>> +		}
>>>> +		__iomap_put_folio(iter, pos, status, folio);
>>>
>>> Why is it necessary to hoist the __iomap_put_folio calls from
>>> iomap_write_end into iomap_write_iter, iomap_unshare_iter, and
>>> iomap_zero_iter?  None of those functions seem to use it, and it makes
>>> more sense to me that iomap_write_end releases the folio that
>>> iomap_write_begin returned.
>>>
>>
>> Because we have to update i_size before __iomap_put_folio() in
>> iomap_write_iter(). If not, once we unlock folio, it could be raced
>> by the backgroud write back which could start writing back and call
>> folio_zero_segment() (please see iomap_writepage_handle_eof()) to
>> zero out the valid data beyond the not updated i_size. So we
>> have to move out __iomap_put_folio() out together with the i_size
>> updating.
> 
> Ahah.  Please make a note of that in the comment for dunces like me.
> 
> 	/*
> 	 * Update the in-memory inode size after copying the data into
> 	 * the page cache.  It's up to the file system to write the
> 	 * updated size to disk, preferably after I/O completion so that
> 	 * no stale data is exposed.  Only once that's done can we
> 	 * unlock and release the folio.
> 	 */
> 

Sure.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] iomap: cleanup iomap_write_iter()
  2024-03-12 16:27       ` Darrick J. Wong
@ 2024-03-13  9:23         ` Zhang Yi
  0 siblings, 0 replies; 25+ messages in thread
From: Zhang Yi @ 2024-03-13  9:23 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, brauner, david, tytso,
	jack, yi.zhang, chengzhihao1, yukuai3

On 2024/3/13 0:27, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 05:24:00AM -0700, Christoph Hellwig wrote:
>> On Mon, Mar 11, 2024 at 09:07:39AM -0700, Darrick J. Wong wrote:
>>> If at some point iomap_write_end actually starts returning partial write
>>> completions (e.g. you wrote 250 bytes, but for some reason the pagecache
>>> only acknowledges 100 bytes were written) then this code no longer
>>> reverts the iter or truncates posteof pagecache correctly...
>>
>> I don't think it makes sense to return a partial write from
>> iomap_write_end.  But to make that clear it really should not return
>> a byte count by a boolean.  I've been wanting to make that cleanup
>> for a while, but it would reach all the way into buffer.c.
> 
> For now, can we change the return types of iomap_write_end_inline and
> __iomap_write_end?  Then iomap can WARN_ON if the block_write_end return
> value isn't 0 or copied:
> 
> 	bool ret;
> 
> 	if (srcmap->type == IOMAP_INLINE) {
> 		ret = iomap_write_end_inline(iter, folio, pos, copied);
> 	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> 		size_t bh_written;
> 
> 		bh_written = block_write_end(NULL, iter->inode->i_mapping,
> 				pos, len, copied, &folio->page, NULL);
> 
> 		WARN_ON(bh_written != copied && bh_written != 0);
> 		ret = bh_written == copied;
> 	} else {
> 		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> 	}
> 
> 	...
> 
> 	return ret;
> 
> Some day later we can circle back to bufferheads, or maybe they'll die
> before we get to it. ;)
> 

It looks great to me for now, we can revise iomap first.

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-13  7:07         ` Zhang Yi
@ 2024-03-13 13:25           ` Zhang Yi
  2024-03-13 20:05             ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Zhang Yi @ 2024-03-13 13:25 UTC (permalink / raw)
  To: Zhang Yi, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, chengzhihao1, yukuai3, yi.zhang

On 2024/3/13 15:07, Zhang Yi wrote:
> On 2024/3/13 0:21, Darrick J. Wong wrote:
>> On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote:
>>> On 2024/3/11 23:37, Darrick J. Wong wrote:
>>>> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> Current clone operation could be non-atomic if the destination of a file
>>>>> is beyond EOF, user could get a file with corrupted (zeroed) data on
>>>>> crash.
>>>>>
>>>>> The problem is about to pre-alloctions. If you write some data into a
>>>>> file [A, B) (the position letters are increased one by one), and xfs
>>>>> could pre-allocate some blocks, then we get a delayed extent [A, D).
>>>>> Then the writeback path allocate blocks and convert this delayed extent
>>>>> [A, C) since lack of enough contiguous physical blocks, so the extent
>>>>> [C, D) is still delayed. After that, both the in-memory and the on-disk
>>>>> file size are B. If we clone file range into [E, F) from another file,
>>>>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
>>>>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
>>>>> extent, it will be zeroed and the file's in-memory && on-disk size will
>>>>> be updated to D after flushing and before doing the clone operation.
>>>>> This is wrong, because user can user can see the size change and read
>>>>> zeros in the middle of the clone operation.
>>>>>
>>>>> We need to keep the in-memory and on-disk size before the clone
>>>>> operation starts, so instead of writing zeroes through the page cache
>>>>> for delayed ranges beyond EOF, we convert these ranges to unwritten and
>>>>> invalidating any cached data over that range beyond EOF.
>>>>>
>>>>> Suggested-by: Dave Chinner <david@fromorbit.com>
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ---
>>>>>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>>>>>  1 file changed, 29 insertions(+)
>>>>>
>>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>>> index ccf83e72d8ca..2b2aace25355 100644
>>>>> --- a/fs/xfs/xfs_iomap.c
>>>>> +++ b/fs/xfs/xfs_iomap.c
>>>>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
>>>>>  	struct xfs_mount	*mp = ip->i_mount;
>>>>>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>>>>>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
>>>>> +	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
>>>>>  	struct xfs_bmbt_irec	imap, cmap;
>>>>>  	struct xfs_iext_cursor	icur, ccur;
>>>>>  	xfs_fsblock_t		prealloc_blocks = 0;
>>>>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
>>>>>  	}
>>>>>  
>>>>>  	if (imap.br_startoff <= offset_fsb) {
>>>>> +		/*
>>>>> +		 * For zeroing out delayed allocation extent, we trim it if
>>>>> +		 * it's partial beyonds EOF block, or convert it to unwritten
>>>>> +		 * extent if it's all beyonds EOF block.
>>>>> +		 */
>>>>> +		if ((flags & IOMAP_ZERO) &&
>>>>> +		    isnullstartblock(imap.br_startblock)) {
>>>>> +			if (offset_fsb > eof_fsb)
>>>>> +				goto convert_delay;
>>>>> +			if (end_fsb > eof_fsb) {
>>>>> +				end_fsb = eof_fsb + 1;
>>>>> +				xfs_trim_extent(&imap, offset_fsb,
>>>>> +						end_fsb - offset_fsb);
>>>>> +			}
>>>>> +		}
>>>>> +
>>>>>  		/*
>>>>>  		 * For reflink files we may need a delalloc reservation when
>>>>>  		 * overwriting shared extents.   This includes zeroing of
>>>>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
>>>>>  	xfs_iunlock(ip, lockmode);
>>>>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>>>>>  
>>>>> +convert_delay:
>>>>> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
>>>>> +	xfs_iunlock(ip, lockmode);
>>>>> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
>>>>> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
>>>>> +				       flags, &imap, &seq);
>>>>
>>>> I expected this to be a direct call to xfs_bmapi_convert_delalloc.
>>>> What was the reason not for using that?
>>>>
>>>
>>> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert
>>> enough blocks once a time, it may convert insufficient blocks since lack
>>> of enough contiguous free physical blocks. If we are going to use it, I
>>> suppose we need to introduce a new helper something like
>>> xfs_convert_blocks(), add a loop to do the conversion.
>>
>> I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent
>> that xfs_bmapi_allocate (or anyone else) allocated (bma.got).  If that
>> mapping is shorter, won't xfs_buffered_write_iomap_begin pass the
>> shortened mapping out to the iomap machinery?  In which case that
>> iomap_iter loop will call ->iomap_begin on the unfinished delalloc
>> conversion work?
> 
> Yeah, make sense, it works, I forgot this loop in iomap_iter().

Sorry, I've found that it doesn't always work. Think about a special case,
If we have a file below:

	A          B           C                    D
	+wwwwwwwwww+DDDDDDDDDDD+dddddddddddddddddddd+
	          EOF         EOF
               (on disk)  (in memory)

where 'd' is a delalloc block with no data and 'D' is a delalloc
block with dirty folios over it.

xfs_bmapi_convert_delalloc() might only convert some blocks from B to B',

	A          B   B'       C                    D
	+wwwwwwwwww+UUU+DDDDDDD+dddddddddddddddddddd+
	          EOF         EOF
               (on disk)  (in memory)

After that, it will trigger below warning in iomap_iter_done():

 WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);

So I guess the loop is still needed, I plane to revise and use
xfs_convert_blocks() here.

Yi.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks
  2024-03-13 13:25           ` Zhang Yi
@ 2024-03-13 20:05             ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-13 20:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	tytso, jack, chengzhihao1, yukuai3, yi.zhang

On Wed, Mar 13, 2024 at 09:25:49PM +0800, Zhang Yi wrote:
> On 2024/3/13 15:07, Zhang Yi wrote:
> > On 2024/3/13 0:21, Darrick J. Wong wrote:
> >> On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote:
> >>> On 2024/3/11 23:37, Darrick J. Wong wrote:
> >>>> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote:
> >>>>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>>>
> >>>>> Current clone operation could be non-atomic if the destination of a file
> >>>>> is beyond EOF, user could get a file with corrupted (zeroed) data on
> >>>>> crash.
> >>>>>
> >>>>> The problem is about to pre-alloctions. If you write some data into a
> >>>>> file [A, B) (the position letters are increased one by one), and xfs
> >>>>> could pre-allocate some blocks, then we get a delayed extent [A, D).
> >>>>> Then the writeback path allocate blocks and convert this delayed extent
> >>>>> [A, C) since lack of enough contiguous physical blocks, so the extent
> >>>>> [C, D) is still delayed. After that, both the in-memory and the on-disk
> >>>>> file size are B. If we clone file range into [E, F) from another file,
> >>>>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
> >>>>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
> >>>>> extent, it will be zeroed and the file's in-memory && on-disk size will
> >>>>> be updated to D after flushing and before doing the clone operation.
> >>>>> This is wrong, because user can user can see the size change and read
> >>>>> zeros in the middle of the clone operation.
> >>>>>
> >>>>> We need to keep the in-memory and on-disk size before the clone
> >>>>> operation starts, so instead of writing zeroes through the page cache
> >>>>> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> >>>>> invalidating any cached data over that range beyond EOF.
> >>>>>
> >>>>> Suggested-by: Dave Chinner <david@fromorbit.com>
> >>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>>>> ---
> >>>>>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
> >>>>>  1 file changed, 29 insertions(+)
> >>>>>
> >>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >>>>> index ccf83e72d8ca..2b2aace25355 100644
> >>>>> --- a/fs/xfs/xfs_iomap.c
> >>>>> +++ b/fs/xfs/xfs_iomap.c
> >>>>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin(
> >>>>>  	struct xfs_mount	*mp = ip->i_mount;
> >>>>>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> >>>>>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, count);
> >>>>> +	xfs_fileoff_t		eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip));
> >>>>>  	struct xfs_bmbt_irec	imap, cmap;
> >>>>>  	struct xfs_iext_cursor	icur, ccur;
> >>>>>  	xfs_fsblock_t		prealloc_blocks = 0;
> >>>>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin(
> >>>>>  	}
> >>>>>  
> >>>>>  	if (imap.br_startoff <= offset_fsb) {
> >>>>> +		/*
> >>>>> +		 * For zeroing out delayed allocation extent, we trim it if
> >>>>> +		 * it's partial beyonds EOF block, or convert it to unwritten
> >>>>> +		 * extent if it's all beyonds EOF block.
> >>>>> +		 */
> >>>>> +		if ((flags & IOMAP_ZERO) &&
> >>>>> +		    isnullstartblock(imap.br_startblock)) {
> >>>>> +			if (offset_fsb > eof_fsb)
> >>>>> +				goto convert_delay;
> >>>>> +			if (end_fsb > eof_fsb) {
> >>>>> +				end_fsb = eof_fsb + 1;
> >>>>> +				xfs_trim_extent(&imap, offset_fsb,
> >>>>> +						end_fsb - offset_fsb);
> >>>>> +			}
> >>>>> +		}
> >>>>> +
> >>>>>  		/*
> >>>>>  		 * For reflink files we may need a delalloc reservation when
> >>>>>  		 * overwriting shared extents.   This includes zeroing of
> >>>>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin(
> >>>>>  	xfs_iunlock(ip, lockmode);
> >>>>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
> >>>>>  
> >>>>> +convert_delay:
> >>>>> +	end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
> >>>>> +	xfs_iunlock(ip, lockmode);
> >>>>> +	truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb));
> >>>>> +	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> >>>>> +				       flags, &imap, &seq);
> >>>>
> >>>> I expected this to be a direct call to xfs_bmapi_convert_delalloc.
> >>>> What was the reason not for using that?
> >>>>
> >>>
> >>> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert
> >>> enough blocks once a time, it may convert insufficient blocks since lack
> >>> of enough contiguous free physical blocks. If we are going to use it, I
> >>> suppose we need to introduce a new helper something like
> >>> xfs_convert_blocks(), add a loop to do the conversion.
> >>
> >> I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent
> >> that xfs_bmapi_allocate (or anyone else) allocated (bma.got).  If that
> >> mapping is shorter, won't xfs_buffered_write_iomap_begin pass the
> >> shortened mapping out to the iomap machinery?  In which case that
> >> iomap_iter loop will call ->iomap_begin on the unfinished delalloc
> >> conversion work?
> > 
> > Yeah, make sense, it works, I forgot this loop in iomap_iter().
> 
> Sorry, I've found that it doesn't always work. Think about a special case,
> If we have a file below:
> 
> 	A          B           C                    D
> 	+wwwwwwwwww+DDDDDDDDDDD+dddddddddddddddddddd+
> 	          EOF         EOF
>                (on disk)  (in memory)
> 
> where 'd' is a delalloc block with no data and 'D' is a delalloc
> block with dirty folios over it.
> 
> xfs_bmapi_convert_delalloc() might only convert some blocks from B to B',
> 
> 	A          B   B'       C                    D
> 	+wwwwwwwwww+UUU+DDDDDDD+dddddddddddddddddddd+
> 	          EOF         EOF
>                (on disk)  (in memory)
> 
> After that, it will trigger below warning in iomap_iter_done():
> 
>  WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
> 
> So I guess the loop is still needed, I plane to revise and use
> xfs_convert_blocks() here.

Ah, sounds good to me.  Though, I wouldn't work too hard to hammer that
writeback helper into a write helper.

--D

> Yi.
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-03-13 20:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-11 12:22 [PATCH 0/4] xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof Zhang Yi
2024-03-11 12:22 ` [PATCH 1/4] xfs: match lock mode in xfs_buffered_write_iomap_begin() Zhang Yi
2024-03-11 15:34   ` Darrick J. Wong
2024-03-12  8:18     ` Zhang Yi
2024-03-12 12:16   ` Christoph Hellwig
2024-03-11 12:22 ` [PATCH 2/4] xfs: convert delayed extents to unwritten when zeroing post eof blocks Zhang Yi
2024-03-11 15:37   ` Darrick J. Wong
2024-03-12 12:21     ` Christoph Hellwig
2024-03-12 12:44       ` Zhang Yi
2024-03-12 12:31     ` Zhang Yi
2024-03-12 16:21       ` Darrick J. Wong
2024-03-13  7:07         ` Zhang Yi
2024-03-13 13:25           ` Zhang Yi
2024-03-13 20:05             ` Darrick J. Wong
2024-03-11 12:22 ` [PATCH 3/4] iomap: don't increase i_size if it's not a write operation Zhang Yi
2024-03-11 15:48   ` Darrick J. Wong
2024-03-12 12:22     ` Christoph Hellwig
2024-03-12 12:59     ` Zhang Yi
2024-03-12 16:24       ` Darrick J. Wong
2024-03-13  7:09         ` Zhang Yi
2024-03-11 12:22 ` [PATCH 4/4] iomap: cleanup iomap_write_iter() Zhang Yi
2024-03-11 16:07   ` Darrick J. Wong
2024-03-12 12:24     ` Christoph Hellwig
2024-03-12 16:27       ` Darrick J. Wong
2024-03-13  9:23         ` Zhang Yi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.