linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com,
	hch@infradead.org, djwong@kernel.org, david@fromorbit.com,
	willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com,
	wangkefeng.wang@huawei.com
Subject: Re: [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio
Date: Thu, 11 Apr 2024 09:12:04 +0800	[thread overview]
Message-ID: <88985780-ef76-74e6-468b-7ebb0998fb6b@huaweicloud.com> (raw)
In-Reply-To: <20240410142948.2817554-1-yi.zhang@huaweicloud.com>

On 2024/4/10 22:29, Zhang Yi wrote:
> Hello!
> 
> This is the fourth version of RFC patch series that convert ext4 regular
> file's buffered IO path to iomap and enable large folio. I've rebased it
> on 6.9-rc3, it also **depends on my xfs/iomap fix series** which has
> been reviewed but not merged yet[1]. Compared to the third vesion, this
> iteration fixes an issue discovered in current ext4 code, and contains
> another two main changes, 1) add bigalloc support and 2) simplify the
> updating logic of reserved delalloc data block, both changes could be
> sent out as preliminary patch series, besides these, others are some
> small code cleanups, performance optimize and commit log improvements.
> Please take a look at this series and any comments are welcome.
> 

I've uploaded this series and the dependency xfs/iomap changes to my
github repository, feel free to check it out.

https://github.com/zhangyi089/linux/commits/ext4_buffered_iomap_v4/

Thanks,
Yi.

> This series supports ext4 with the default features and mount
> options(bigalloc is also supported), doesn't support non-extent(ext3),
> inline_data, dax, fs_verity, fs_crypt and data=journal mode, ext4 would
> fall back to buffer_head path automatically if you enabled those
> features or options. Although it has many limitations now, it can satisfy
> the requirements of most common cases and bring a significant performance
> benefit for large IOs.
> 
> The iomap path would be simpler than the buffer_head path to some extent,
> please note that there are 4 major differences:
> 1. Always allocate unwritten extent for new blocks, it means that it's
>    not controlled by dioread_nolock mount option.
> 2. Since 1, there is no risk of exposing stale data during the append
>    write, so we don't need to write back data before metadata, it's time
>    to drop 'data = ordered' mode automatically.
> 3. Since 2, we don't need to reserve journal credits and use reserved
>    handle for the extent status conversion during writeback.
> 4. We could postpone updating the i_disksize to the endio, it could
>    avoid exposing zero data during append write and instantaneous power
>    failure.
> 
> Series details:
> Patch 1-9: this is the part 2 preparation series, it fix a problem
> first, and makes ext4_insert_delayed_block() call path support inserting
> multiple delalloc blocks (also support bigalloc), finally make
> ext4_da_map_blocks() buffer_head unaware, I've send it out separately[2]
> and hope this could be merged first.
> 
> Patch 10-19: this is the part 3 prepartory changes(picked out from my
> metadata reservation series[3], these are not a strong dependency
> patches, but I'd suggested these could be merged before the iomap
> conversion). These patches moves ext4_da_update_reserve_space() to
> ext4_es_insert_extent(), and always set EXT4_GET_BLOCKS_DELALLOC_RESERVE
> when allocating delalloc blocks, no matter it's from delayed allocate or
> non-delayed allocate (fallocate) path, it makes delalloc extents always
> delonly. These can make delalloc reservation simpler and cleaner than
> before.
> 
> Patch 20-34: These patches are the main implements of the buffered IO
> iomap conversion, It first introduce a sequence counter for extent
> status tree, then add a new iomap aops for read, write, mmap, replace
> current buffered_head path. Finally, enable iomap path besides inline
> data, non-extent, dax, fs_verity, fs_crypt, defrag and data=journal
> mode, if user specify "buffered_iomap" mount option, also enable large
> folio. Please look at the following patch for details.
> 
> About Tests:
>  - Pass kvm-xfstests in auto mode, and the keep running stress tests and
>    fault injection tests.
>  - A performance tests below (tested on my version 3 series,
>    theoretically there won't be much difference in this version).
> 
>    Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
>    with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.
> 
>    == buffer read ==
> 
>                   buffer head        iomap + large folio
>    type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>    ----------------------------------------------------
>    hole     4K    565k    2206       811k    3167
>    hole     64K   45.1k   2820       78.1k   4879
>    hole     1M    2744    2744       4890    4891
>    ramdisk  4K    436k    1703       554k    2163
>    ramdisk  64K   29.6k   1848       44.0k   2747
>    ramdisk  1M    1994    1995       2809    2809
>    nvme     4K    306k    1196       324k    1267
>    nvme     64K   19.3k   1208       24.3k   1517
>    nvme     1M    1694    1694       2256    2256
> 
>    == buffer write ==
> 
>                                         buffer head  iomap + large folio
>    type   Overwrite Sync Writeback bs   IOPS   BW    IOPS   BW
>    ------------------------------------------------------------
>    cache    N       N    N         4K   395k   1544  415k   1621
>    cache    N       N    N         64K  30.8k  1928  80.1k  5005
>    cache    N       N    N         1M   1963   1963  5641   5642
>    cache    Y       N    N         4K   423k   1652  443k   1730
>    cache    Y       N    N         64K  33.0k  2063  80.8k  5051
>    cache    Y       N    N         1M   2103   2103  5588   5589
>    ramdisk  N       N    Y         4K   362k   1416  307k   1198
>    ramdisk  N       N    Y         64K  22.4k  1399  64.8k  4050
>    ramdisk  N       N    Y         1M   1670   1670  4559   4560
>    ramdisk  N       Y    N         4K   9830   38.4  13.5k  52.8
>    ramdisk  N       Y    N         64K  5834   365   10.1k  629
>    ramdisk  N       Y    N         1M   1011   1011  2064   2064
>    ramdisk  Y       N    Y         4K   397k   1550  409k   1598
>    ramdisk  Y       N    Y         64K  29.2k  1827  73.6k  4597
>    ramdisk  Y       N    Y         1M   1837   1837  4985   4985
>    ramdisk  Y       Y    N         4K   173k   675   182k   710
>    ramdisk  Y       Y    N         64K  17.7k  1109  33.7k  2105
>    ramdisk  Y       Y    N         1M   1128   1129  1790   1791
>    nvme     N       N    Y         4K   298k   1164  290k   1134
>    nvme     N       N    Y         64K  21.5k  1343  57.4k  3590
>    nvme     N       N    Y         1M   1308   1308  3664   3664
>    nvme     N       Y    N         4K   10.7k  41.8  12.0k  46.9
>    nvme     N       Y    N         64K  5962   373   8598   537
>    nvme     N       Y    N         1M   676    677   1417   1418
>    nvme     Y       N    Y         4K   366k   1430  373k   1456
>    nvme     Y       N    Y         64K  26.7k  1670  56.8k  3547
>    nvme     Y       N    Y         1M   1745   1746  3586   3586
>    nvme     Y       Y    N         4K   59.0k  230   61.2k  239
>    nvme     Y       Y    N         64K  13.0k  813   21.0k  1311
>    nvme     Y       Y    N         1M   683    683   1368   1369
>  
> TODO
>  - Keep on doing stress tests and fixing.
>  - Reserve enough space for delalloc metadata blocks and try to drop
>    ext4_nonda_switch().
>  - First support defrag and then support other more unsupported features
>    and mount options.
> 
> Changes since v3:
>  - Drop the part 1 prepartory patches which have been merged [4].
>  - Drop the two iomap patches since I've submitted separately [1].
>  - Fix an incorrect reserved delalloc blocks count and incorrect extent
>    status cache issue found on current ext4 code.
>  - Pick out part 2 prepartory patch series [2], it make
>    ext4_insert_delayed_block() call path support inserting multiple
>    delalloc blocks (also support bigalloc )and make ext4_da_map_blocks()
>    buffer_head unaware.
>  - Adjust and simplify the reserved delalloc blocks updating logic,
>    preparing for reserving meta data blocks for delalloc.
>  - Drop datasync dirty check in ext4_set_iomap() for buffered
>    read/write, improves the concurrent performance on small I/Os.
>  - Prevent always hold invalid_lock in page_cache_ra_order(), add
>    lockless check.
>  - Disable iomap path by default since it's experimental new, add a
>    mount option "buffered_iomap" to enable it.
>  - Some other minor fixes and change log improvements.
> Changes since v2:
>  - Update patch 1-6 to v3.
>  - iomap_zero and iomap_unshare don't need to update i_size and call
>    iomap_write_failed(), introduce a new helper iomap_write_end_simple()
>    to avoid doing that.
>  - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
>    introduce a new helper ext4_iomap_map_one_extent() to allocate
>    delalloc blocks in writeback, which is always under i_data_sem in
>    write mode. This is done to prevent the writing back delalloc
>    extents become stale if it raced by truncate.
>  - Add a lock detection in mapping_clear_large_folios().
> Changes since v1:
>  - Introduce seq count for iomap buffered write and writeback to protect
>    races from extents changes, e.g. truncate, mwrite.
>  - Always allocate unwritten extents for new blocks, drop dioread_lock
>    mode, and make no distinctions between dioread_lock and
>    dioread_nolock.
>  - Don't add ditry data range to jinode, drop data=ordered mode, and
>    make no distinctions between data=ordered and data=writeback mode.
>  - Postpone updating i_disksize to endio.
>  - Allow splitting extents and use reserved space in endio.
>  - Instead of reimplement a new delayed mapping helper
>    ext4_iomap_da_map_blocks() for buffer write, try to reuse
>    ext4_da_map_blocks().
>  - Add support for disabling large folio on active inodes.
>  - Support online defragmentation, make file fall back to buffer_head
>    and disable large folio in ext4_move_extents().
>  - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
>  - Add dirty_len and pos trace info to trace_iomap_writepage_map().
>  - Update patch 1-6 to v2.
> 
> [1] https://lore.kernel.org/linux-xfs/20240320110548.2200662-1-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
> [3] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
> [4] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/
> 
> Thanks,
> Yi.
> 
> ---
> v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> 
> Zhang Yi (34):
>   ext4: factor out a common helper to query extent map
>   ext4: check the extent status again before inserting delalloc block
>   ext4: trim delalloc extent
>   ext4: drop iblock parameter
>   ext4: make ext4_es_insert_delayed_block() insert multi-blocks
>   ext4: make ext4_da_reserve_space() reserve multi-clusters
>   ext4: factor out check for whether a cluster is allocated
>   ext4: make ext4_insert_delayed_block() insert multi-blocks
>   ext4: make ext4_da_map_blocks() buffer_head unaware
>   ext4: factor out ext4_map_create_blocks() to allocate new blocks
>   ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set
>   ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks
>   ext4: let __revise_pending() return newly inserted pendings
>   ext4: count removed reserved blocks for delalloc only extent entry
>   ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
>   ext4: drop ext4_es_delayed_clu()
>   ext4: use ext4_map_query_blocks() in ext4_map_blocks()
>   ext4: drop ext4_es_is_delonly()
>   ext4: drop all delonly descriptions
>   ext4: use reserved metadata blocks when splitting extent on endio
>   ext4: introduce seq counter for the extent status entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: implement zero_range iomap path
>   ext4: writeback partial blocks before zeroing out range
>   ext4: fall back to buffer_head path for defrag
>   ext4: partial enable iomap for regular file's buffered IO path
>   filemap: support disable large folios on active inode
>   ext4: enable large folio for regular file with iomap buffered IO path
>   ext4: don't mark IOMAP_F_DIRTY for buffer write
>   ext4: add mount option for buffered IO iomap path
> 


  parent reply	other threads:[~2024-04-11  1:12 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-10 14:29 [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
2024-04-10 14:29 ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
2024-04-26 11:55   ` Ritesh Harjani
2024-04-10 14:29 ` [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block Zhang Yi
2024-04-26 12:31   ` Ritesh Harjani
2024-04-26 12:57     ` Ritesh Harjani
2024-04-26 13:19       ` Zhang Yi
2024-04-26 16:39         ` Ritesh Harjani
2024-04-28  3:00           ` Zhang Yi
2024-04-29 14:59             ` Ritesh Harjani
2024-05-07  3:15               ` Zhang Yi
2024-05-01  7:47           ` Dave Chinner
2024-05-01  6:51   ` Dave Chinner
2024-05-01 12:19     ` Ritesh Harjani
2024-05-01 22:49       ` Dave Chinner
2024-05-02  4:11         ` Ritesh Harjani
2024-05-06  3:49           ` Zhang Yi
2024-04-10 14:29 ` [PATCH v4 03/34] ext4: trim delalloc extent Zhang Yi
2024-05-01 14:31   ` Ritesh Harjani
2024-05-06  6:15     ` Zhang Yi
2024-04-10 14:29 ` [PATCH v4 04/34] ext4: drop iblock parameter Zhang Yi
2024-05-01 14:41   ` Ritesh Harjani
2024-04-10 14:29 ` [PATCH v4 05/34] ext4: make ext4_es_insert_delayed_block() insert multi-blocks Zhang Yi
2024-04-10 14:29 ` [PATCH v4 06/34] ext4: make ext4_da_reserve_space() reserve multi-clusters Zhang Yi
2024-04-10 14:29 ` [PATCH v4 07/34] ext4: factor out check for whether a cluster is allocated Zhang Yi
2024-04-10 14:29 ` [PATCH v4 08/34] ext4: make ext4_insert_delayed_block() insert multi-blocks Zhang Yi
2024-04-10 14:29 ` [PATCH v4 09/34] ext4: make ext4_da_map_blocks() buffer_head unaware Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 10/34] ext4: factor out ext4_map_create_blocks() to allocate new blocks Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 11/34] ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 12/34] ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 13/34] ext4: let __revise_pending() return newly inserted pendings Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 14/34] ext4: count removed reserved blocks for delalloc only extent entry Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 16/34] ext4: drop ext4_es_delayed_clu() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 18/34] ext4: drop ext4_es_is_delonly() Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 19/34] ext4: drop all delonly descriptions Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 20/34] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 21/34] ext4: introduce seq counter for the extent status entry Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 22/34] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 24/34] ext4: implement buffered write " Zhang Yi
2024-05-01  8:11   ` Dave Chinner
2024-05-01  8:33     ` Dave Chinner
2024-05-06 11:44       ` Zhang Yi
2024-05-06 23:19         ` Dave Chinner
2024-05-07  5:10           ` Zhang Yi
2024-05-06 11:21     ` Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 25/34] ext4: implement writeback " Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 26/34] ext4: implement mmap " Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 27/34] ext4: implement zero_range " Zhang Yi
2024-05-01  9:40   ` Dave Chinner
2024-05-06 12:33     ` Zhang Yi
2024-04-10 14:29 ` [RFC PATCH v4 28/34] ext4: writeback partial blocks before zeroing out range Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 29/34] ext4: fall back to buffer_head path for defrag Zhang Yi
2024-05-01  9:32   ` Dave Chinner
2024-05-06 13:05     ` Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 30/34] ext4: partial enable iomap for regular file's buffered IO path Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 31/34] filemap: support disable large folios on active inode Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 32/34] ext4: enable large folio for regular file with iomap buffered IO path Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 33/34] ext4: don't mark IOMAP_F_DIRTY for buffer write Zhang Yi
2024-05-01  9:27   ` Dave Chinner
2024-05-06 14:02     ` Zhang Yi
2024-04-10 15:03 ` [RFC PATCH v4 34/34] ext4: add mount option for buffered IO iomap path Zhang Yi
2024-04-11  1:12 ` Zhang Yi [this message]
2024-04-24  8:12 ` [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=88985780-ef76-74e6-468b-7ebb0998fb6b@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).