All of
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <>
To: Shyam Prasad N <>
Cc: David Howells <>,
	Steve French <>,
Subject: Re: Regarding ext4 extent allocation strategy
Date: Tue, 13 Jul 2021 07:39:16 -0400	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> Our team in Microsoft, which works on the Linux SMB3 client kernel
> filesystem has recently been exploring the use of fscache on top of
> ext4 for caching the network filesystem data for some customer
> workloads.
> However, the maintainer of fscache (David Howells) recently warned us
> that a few other extent based filesystem developers pointed out a
> theoretical bug in the current implementation of fscache/cachefiles.
> It currently does not maintain a separate metadata for the cached data
> it holds, but instead uses the sparseness of the underlying filesystem
> to track the ranges of the data that is being cached.
> The bug that has been pointed out with this is that the underlying
> filesystems could bridge holes between data ranges with zeroes or
> punch hole in data ranges that contain zeroes. (@David please add if I
> missed something).
> David has already begun working on the fix to this by maintaining the
> metadata of the cached ranges in fscache itself.
> However, since it could take some time for this fix to be approved and
> then backported by various distros, I'd like to understand if there is
> a potential problem in using fscache on top of ext4 without the fix.
> If ext4 doesn't do any such optimizations on the data ranges, or has a
> way to disable such optimizations, I think we'll be okay to use the
> older versions of fscache even without the fix mentioned above.

Yes, the tuning knob you are looking for is:

What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
Date:		August 2012
Contact:	"Theodore Ts'o" <>
		The maximum number of kilobytes which will be zeroed
		out in preference to creating a new uninitialized
		extent when manipulating an inode's extent tree.  Note
		that using a larger value will increase the
		variability of time necessary to complete a random
		write operation (since a 4k random write might turn
		into a much larger write due to the zeroout

(From Documentation/ABI/testing/sysfs-fs-ext4)

The basic idea here is that with a random workload, with HDD's, the
cost of writing a 16k random write is not much more than the time to
write a 4k random write; that is, the cost of HDD seeks dominates.
There is also a cost in having a many additional entries in the extent
tree.  So if we have a fallocated region, e.g:

... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...

It's more efficient to have the extent tree look like this

... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...

And just simply write zeros to the first "U" in the above figure.

The default value of extent_max_zeroout_kb is 32k.  This optimization
can be disabled by setting extent_max_zeroout_kb to 0.  The downside
of this is a potential degredation of a random write workload (using
for example the fio benchmark program) on that file system.


						- Ted

  reply	other threads:[~2021-07-13 11:39 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-13  6:52 Shyam Prasad N
2021-07-13 11:39 ` Theodore Y. Ts'o [this message]
2021-07-13 12:57   ` Shyam Prasad N
2021-07-13 20:18     ` Theodore Y. Ts'o
2021-07-14  0:37       ` Shyam Prasad N
2022-02-18  3:18   ` Gao Xiang
2022-02-22  2:48     ` Gao Xiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \
    --subject='Re: Regarding ext4 extent allocation strategy' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.