From: "Theodore Y. Ts'o" <firstname.lastname@example.org> To: Shyam Prasad N <email@example.com> Cc: David Howells <firstname.lastname@example.org>, Steve French <email@example.com>, firstname.lastname@example.org Subject: Re: Regarding ext4 extent allocation strategy Date: Tue, 13 Jul 2021 07:39:16 -0400 [thread overview] Message-ID: <YO17ZNOcq+9PajfQ@mit.edu> (raw) In-Reply-To: <CANT5p=o3i4kWQuMFF5zKQp04JnWEQnYuo+cvyH8asGMvTVBBkw@mail.gmail.com> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote: > > Our team in Microsoft, which works on the Linux SMB3 client kernel > filesystem has recently been exploring the use of fscache on top of > ext4 for caching the network filesystem data for some customer > workloads. > > However, the maintainer of fscache (David Howells) recently warned us > that a few other extent based filesystem developers pointed out a > theoretical bug in the current implementation of fscache/cachefiles. > It currently does not maintain a separate metadata for the cached data > it holds, but instead uses the sparseness of the underlying filesystem > to track the ranges of the data that is being cached. > The bug that has been pointed out with this is that the underlying > filesystems could bridge holes between data ranges with zeroes or > punch hole in data ranges that contain zeroes. (@David please add if I > missed something). > > David has already begun working on the fix to this by maintaining the > metadata of the cached ranges in fscache itself. > However, since it could take some time for this fix to be approved and > then backported by various distros, I'd like to understand if there is > a potential problem in using fscache on top of ext4 without the fix. > If ext4 doesn't do any such optimizations on the data ranges, or has a > way to disable such optimizations, I think we'll be okay to use the > older versions of fscache even without the fix mentioned above. Yes, the tuning knob you are looking for is: What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb Date: August 2012 Contact: "Theodore Ts'o" <email@example.com> Description: The maximum number of kilobytes which will be zeroed out in preference to creating a new uninitialized extent when manipulating an inode's extent tree. Note that using a larger value will increase the variability of time necessary to complete a random write operation (since a 4k random write might turn into a much larger write due to the zeroout operation). (From Documentation/ABI/testing/sysfs-fs-ext4) The basic idea here is that with a random workload, with HDD's, the cost of writing a 16k random write is not much more than the time to write a 4k random write; that is, the cost of HDD seeks dominates. There is also a cost in having a many additional entries in the extent tree. So if we have a fallocated region, e.g: +-------------+---+---+---+----------+---+---+---------+ ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ... +-------------+---+---+---+----------+---+---+---------+ It's more efficient to have the extent tree look like this +-------------+-----------+----------+---+---+---------+ ... + Uninit (U) | Written | Uninit | W | U | Written | ... +-------------+-----------+----------+---+---+---------+ And just simply write zeros to the first "U" in the above figure. The default value of extent_max_zeroout_kb is 32k. This optimization can be disabled by setting extent_max_zeroout_kb to 0. The downside of this is a potential degredation of a random write workload (using for example the fio benchmark program) on that file system. Cheers, - Ted
next prev parent reply other threads:[~2021-07-13 11:39 UTC|newest] Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-07-13 6:52 Shyam Prasad N 2021-07-13 11:39 ` Theodore Y. Ts'o [this message] 2021-07-13 12:57 ` Shyam Prasad N 2021-07-13 20:18 ` Theodore Y. Ts'o 2021-07-14 0:37 ` Shyam Prasad N 2022-02-18 3:18 ` Gao Xiang 2022-02-22 2:48 ` Gao Xiang
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=YO17ZNOcq+9PajfQ@mit.edu \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --subject='Re: Regarding ext4 extent allocation strategy' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.