All of
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <>
To: Shyam Prasad N <>
Cc: David Howells <>,
	Steve French <>,
Subject: Re: Regarding ext4 extent allocation strategy
Date: Tue, 13 Jul 2021 07:39:16 -0400	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> Our team in Microsoft, which works on the Linux SMB3 client kernel
> filesystem has recently been exploring the use of fscache on top of
> ext4 for caching the network filesystem data for some customer
> workloads.
> However, the maintainer of fscache (David Howells) recently warned us
> that a few other extent based filesystem developers pointed out a
> theoretical bug in the current implementation of fscache/cachefiles.
> It currently does not maintain a separate metadata for the cached data
> it holds, but instead uses the sparseness of the underlying filesystem
> to track the ranges of the data that is being cached.
> The bug that has been pointed out with this is that the underlying
> filesystems could bridge holes between data ranges with zeroes or
> punch hole in data ranges that contain zeroes. (@David please add if I
> missed something).
> David has already begun working on the fix to this by maintaining the
> metadata of the cached ranges in fscache itself.
> However, since it could take some time for this fix to be approved and
> then backported by various distros, I'd like to understand if there is
> a potential problem in using fscache on top of ext4 without the fix.
> If ext4 doesn't do any such optimizations on the data ranges, or has a
> way to disable such optimizations, I think we'll be okay to use the
> older versions of fscache even without the fix mentioned above.

Yes, the tuning knob you are looking for is:

What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
Date:		August 2012
Contact:	"Theodore Ts'o" <>
		The maximum number of kilobytes which will be zeroed
		out in preference to creating a new uninitialized
		extent when manipulating an inode's extent tree.  Note
		that using a larger value will increase the
		variability of time necessary to complete a random
		write operation (since a 4k random write might turn
		into a much larger write due to the zeroout

(From Documentation/ABI/testing/sysfs-fs-ext4)

The basic idea here is that with a random workload, with HDD's, the
cost of writing a 16k random write is not much more than the time to
write a 4k random write; that is, the cost of HDD seeks dominates.
There is also a cost in having a many additional entries in the extent
tree.  So if we have a fallocated region, e.g:

... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...

It's more efficient to have the extent tree look like this

... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...

And just simply write zeros to the first "U" in the above figure.

The default value of extent_max_zeroout_kb is 32k.  This optimization
can be disabled by setting extent_max_zeroout_kb to 0.  The downside
of this is a potential degredation of a random write workload (using
for example the fio benchmark program) on that file system.


						- Ted

  reply	other threads:[~2021-07-13 11:39 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-13  6:52 Regarding ext4 extent allocation strategy Shyam Prasad N
2021-07-13 11:39 ` Theodore Y. Ts'o [this message]
2021-07-13 12:57   ` Shyam Prasad N
2021-07-13 20:18     ` Theodore Y. Ts'o
2021-07-14  0:37       ` Shyam Prasad N
2022-02-18  3:18   ` Gao Xiang
2022-02-22  2:48     ` Gao Xiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.