linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Christoph Hellwig <hch@infradead.org>,
	djwong@kernel.org, Dave Chinner <david@fromorbit.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com,
	willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com,
	wangkefeng.wang@huawei.com
Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation
Date: Wed, 28 Feb 2024 16:53:32 +0800	[thread overview]
Message-ID: <9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com> (raw)
In-Reply-To: <ZcsCP4h-ExNOcdD6@infradead.org>

On 2024/2/13 13:46, Christoph Hellwig wrote:
> Wouldn't it make more sense to just move the size manipulation to the
> write-only code?  An untested version of that is below.  With this
> the naming of the status variable becomes even more confusing than
> it already is, maybe we need to do a cleanup of the *_write_end
> calling conventions as it always returns the passed in copied value
> or 0.
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3dab060aed6d7b..8401a9ca702fc0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  		size_t copied, struct folio *folio)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> -	loff_t old_size = iter->inode->i_size;
> -	size_t ret;
> -
> -	if (srcmap->type == IOMAP_INLINE) {
> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> -				copied, &folio->page, NULL);
> -	} else {
> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> -	}
> -
> -	/*
> -	 * Update the in-memory inode size after copying the data into the page
> -	 * cache.  It's up to the file system to write the updated size to disk,
> -	 * preferably after I/O completion so that no stale data is exposed.
> -	 */
> -	if (pos + ret > old_size) {
> -		i_size_write(iter->inode, pos + ret);
> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> -	}

I've recently discovered that if we don't increase i_size in
iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs
depends on iomap_zero_iter() to increase i_size in some cases.

 generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
 (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details)

 _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
 *** xfs_repair -n output ***
 Phase 1 - find and verify superblock...
 Phase 2 - using internal log
         - zero log...
         - scan filesystem freespace and inode maps...
 sb_fdblocks 10916, counted 10923
         - found root inode chunk
 ...

After debugging and analysis, I found the root cause of the problem is
related to the pre-allocations of xfs. xfs pre-allocates some blocks to
reduce fragmentation during buffer append writing, then if we write new
data or do file copy(reflink) after the end of the pre-allocating range,
xfs would zero-out and write back the pre-allocate space(e.g.
xfs_file_write_checks() -> xfs_zero_range()), so we have to update
i_size before writing back in iomap_zero_iter(), otherwise, it will
result in stale delayed extent.

For more details, let's think about this case,
1. Buffered write from range [A, B) of an empty file foo, and
   xfs_buffered_write_iomap_begin() prealloc blocks for it, then create
   a delayed extent from [A, D).
2. Write back process map blocks but only convert above delayed extent
   from [A, C) since the lack of a contiguous physical blocks, now we
   have a left over delayed extent from [C, D), and the file size is B.
3. Copy range from another file to range [E, F), then
   xfs_reflink_zero_posteof() would zero-out post eof range [B, E), it
   writes zero, dirty and write back [C, E).
4. Since we don't update i_size in iomap_zero_iter(),the writeback
   doesn't write anything back, also doesn't convert the delayed extent.
   After copy range, the file size will update to F.
5. Finally, the delayed extent becomes stale, and the free blocks count
   becomes incorrect.

So, we have to handle above case for xfs. I suppose we could keep
increasing i_size if the zeroed folio is entirely outside of i_size,
make sure we could write back and allocate blocks for the
zeroed & delayed extent, something like below, any suggestions ?


@@ -1390,6 +1390,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)

 	do {
 		struct folio *folio;
+		loff_t old_size;
 		int status;
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
@@ -1408,6 +1409,28 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		folio_mark_accessed(folio);

 		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+
+		/*
+		 * If folio is entirely outside of i_size, update the
+		 * in-memory inode size after zeroing the data in the page
+		 * cache to prevent the write-back process from not writing
+		 * back zeroed pages.
+		 */
+		old_size = iter->inode->i_size;
+		if (pos + bytes > old_size) {
+			size_t offset = offset_in_folio(folio, old_size);
+			pgoff_t end_index = old_size >> PAGE_SHIFT;
+
+			if (folio->index > end_index ||
+			    (folio->index == end_index && offset == 0)) {
+				i_size_write(iter->inode, pos + bytes);
+				iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
+			}
+		}
+		__iomap_put_folio(iter, pos, bytes, folio);
+		if (old_size < pos)
+			pagecache_isize_extended(iter->inode, old_size, pos);
+
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;

Thansk,
Yi.


  parent reply	other threads:[~2024-02-28  8:53 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-27  1:57 [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Zhang Yi
2024-01-27  1:58 ` [PATCH v3 01/26] ext4: refactor ext4_da_map_blocks() Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 02/26] ext4: convert to exclusive lock while inserting delalloc extents Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks() Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-05-09 15:16   ` Luis Henriques
2024-05-09 16:39     ` Theodore Ts'o
2024-05-09 17:23       ` Luis Henriques
2024-05-10  3:39         ` Zhang Yi
2024-05-10  9:41           ` Luis Henriques
2024-05-10 11:40             ` Zhang Yi
2024-01-27  1:58 ` [PATCH v3 04/26] ext4: add a hole extent entry in cache after punch Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 05/26] ext4: make ext4_map_blocks() distinguish delalloc only extent Zhang Yi
2024-02-03 17:57   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 06/26] ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type Zhang Yi
2024-02-03 17:57   ` Theodore Ts'o
2024-01-27  1:58 ` [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation Zhang Yi
2024-02-13  5:46   ` Christoph Hellwig
2024-02-17  8:55     ` Zhang Yi
2024-02-18 23:30       ` Dave Chinner
2024-02-19  1:14         ` Zhang Yi
2024-02-28  8:53     ` Zhang Yi [this message]
2024-02-28 22:13       ` Christoph Hellwig
2024-02-29  9:20         ` Zhang Yi
2024-02-28 22:25       ` Dave Chinner
2024-02-29  8:59         ` Zhang Yi
2024-02-29 23:19           ` Dave Chinner
2024-02-29 23:29             ` Darrick J. Wong
2024-03-01  3:26             ` Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 08/26] iomap: add pos and dirty_len into trace_iomap_writepage_map Zhang Yi
2024-02-12  6:02   ` Christoph Hellwig
2024-02-19  1:27     ` Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 09/26] ext4: allow inserting delalloc extents with multi-blocks Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 10/26] ext4: correct delalloc extent length Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 11/26] ext4: also mark extent as delalloc if it's been unwritten Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 12/26] ext4: factor out bh handles to ext4_da_get_block_prep() Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 13/26] ext4: use reserved metadata blocks when splitting extent in endio Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 14/26] ext4: factor out ext4_map_{create|query}_blocks() Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 15/26] ext4: introduce seq counter for extent entry Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 16/26] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 17/26] ext4: implement buffered read iomap path Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 18/26] ext4: implement buffered write " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 19/26] ext4: implement writeback " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 20/26] ext4: implement mmap " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 21/26] ext4: implement zero_range " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 22/26] ext4: writeback partial blocks before zero range Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 23/26] ext4: fall back to buffer_head path for defrag Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 24/26] ext4: partially enable iomap for regular file's buffered IO path Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 25/26] filemap: support disable large folios on active inode Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 26/26] ext4: enable large folio for regular file with iomap buffered IO path Zhang Yi
2024-02-12  6:18 ` [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Darrick J. Wong
2024-02-12  9:16   ` Ritesh Harjani
2024-02-12 10:24     ` Matthew Wilcox
2024-02-17  9:31   ` Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).