All of lore.kernel.org
 help / color / mirror / Atom feed
* Regarding ext4 extent allocation strategy
@ 2021-07-13  6:52 Shyam Prasad N
  2021-07-13 11:39 ` Theodore Y. Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Shyam Prasad N @ 2021-07-13  6:52 UTC (permalink / raw)
  To: tytso, David Howells, Steve French, linux-ext4

Hi,

Our team in Microsoft, which works on the Linux SMB3 client kernel
filesystem has recently been exploring the use of fscache on top of
ext4 for caching the network filesystem data for some customer
workloads.

However, the maintainer of fscache (David Howells) recently warned us
that a few other extent based filesystem developers pointed out a
theoretical bug in the current implementation of fscache/cachefiles.
It currently does not maintain a separate metadata for the cached data
it holds, but instead uses the sparseness of the underlying filesystem
to track the ranges of the data that is being cached.
The bug that has been pointed out with this is that the underlying
filesystems could bridge holes between data ranges with zeroes or
punch hole in data ranges that contain zeroes. (@David please add if I
missed something).

David has already begun working on the fix to this by maintaining the
metadata of the cached ranges in fscache itself.
However, since it could take some time for this fix to be approved and
then backported by various distros, I'd like to understand if there is
a potential problem in using fscache on top of ext4 without the fix.
If ext4 doesn't do any such optimizations on the data ranges, or has a
way to disable such optimizations, I think we'll be okay to use the
older versions of fscache even without the fix mentioned above.

Opinions?

-- 
Regards,
Shyam

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regarding ext4 extent allocation strategy
  2021-07-13  6:52 Regarding ext4 extent allocation strategy Shyam Prasad N
@ 2021-07-13 11:39 ` Theodore Y. Ts'o
  2021-07-13 12:57   ` Shyam Prasad N
  2022-02-18  3:18   ` Gao Xiang
  0 siblings, 2 replies; 7+ messages in thread
From: Theodore Y. Ts'o @ 2021-07-13 11:39 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: David Howells, Steve French, linux-ext4

On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> 
> Our team in Microsoft, which works on the Linux SMB3 client kernel
> filesystem has recently been exploring the use of fscache on top of
> ext4 for caching the network filesystem data for some customer
> workloads.
> 
> However, the maintainer of fscache (David Howells) recently warned us
> that a few other extent based filesystem developers pointed out a
> theoretical bug in the current implementation of fscache/cachefiles.
> It currently does not maintain a separate metadata for the cached data
> it holds, but instead uses the sparseness of the underlying filesystem
> to track the ranges of the data that is being cached.
> The bug that has been pointed out with this is that the underlying
> filesystems could bridge holes between data ranges with zeroes or
> punch hole in data ranges that contain zeroes. (@David please add if I
> missed something).
> 
> David has already begun working on the fix to this by maintaining the
> metadata of the cached ranges in fscache itself.
> However, since it could take some time for this fix to be approved and
> then backported by various distros, I'd like to understand if there is
> a potential problem in using fscache on top of ext4 without the fix.
> If ext4 doesn't do any such optimizations on the data ranges, or has a
> way to disable such optimizations, I think we'll be okay to use the
> older versions of fscache even without the fix mentioned above.

Yes, the tuning knob you are looking for is:

What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
Date:		August 2012
Contact:	"Theodore Ts'o" <tytso@mit.edu>
Description:
		The maximum number of kilobytes which will be zeroed
		out in preference to creating a new uninitialized
		extent when manipulating an inode's extent tree.  Note
		that using a larger value will increase the
		variability of time necessary to complete a random
		write operation (since a 4k random write might turn
		into a much larger write due to the zeroout
		operation).

(From Documentation/ABI/testing/sysfs-fs-ext4)

The basic idea here is that with a random workload, with HDD's, the
cost of writing a 16k random write is not much more than the time to
write a 4k random write; that is, the cost of HDD seeks dominates.
There is also a cost in having a many additional entries in the extent
tree.  So if we have a fallocated region, e.g:

    +-------------+---+---+---+----------+---+---+---------+
... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
    +-------------+---+---+---+----------+---+---+---------+

It's more efficient to have the extent tree look like this

    +-------------+-----------+----------+---+---+---------+
... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
    +-------------+-----------+----------+---+---+---------+

And just simply write zeros to the first "U" in the above figure.

The default value of extent_max_zeroout_kb is 32k.  This optimization
can be disabled by setting extent_max_zeroout_kb to 0.  The downside
of this is a potential degredation of a random write workload (using
for example the fio benchmark program) on that file system.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regarding ext4 extent allocation strategy
  2021-07-13 11:39 ` Theodore Y. Ts'o
@ 2021-07-13 12:57   ` Shyam Prasad N
  2021-07-13 20:18     ` Theodore Y. Ts'o
  2022-02-18  3:18   ` Gao Xiang
  1 sibling, 1 reply; 7+ messages in thread
From: Shyam Prasad N @ 2021-07-13 12:57 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: David Howells, Steve French, linux-ext4

On Tue, Jul 13, 2021 at 5:09 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> >
> > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > filesystem has recently been exploring the use of fscache on top of
> > ext4 for caching the network filesystem data for some customer
> > workloads.
> >
> > However, the maintainer of fscache (David Howells) recently warned us
> > that a few other extent based filesystem developers pointed out a
> > theoretical bug in the current implementation of fscache/cachefiles.
> > It currently does not maintain a separate metadata for the cached data
> > it holds, but instead uses the sparseness of the underlying filesystem
> > to track the ranges of the data that is being cached.
> > The bug that has been pointed out with this is that the underlying
> > filesystems could bridge holes between data ranges with zeroes or
> > punch hole in data ranges that contain zeroes. (@David please add if I
> > missed something).
> >
> > David has already begun working on the fix to this by maintaining the
> > metadata of the cached ranges in fscache itself.
> > However, since it could take some time for this fix to be approved and
> > then backported by various distros, I'd like to understand if there is
> > a potential problem in using fscache on top of ext4 without the fix.
> > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > way to disable such optimizations, I think we'll be okay to use the
> > older versions of fscache even without the fix mentioned above.
>
> Yes, the tuning knob you are looking for is:
>
> What:           /sys/fs/ext4/<disk>/extent_max_zeroout_kb
> Date:           August 2012
> Contact:        "Theodore Ts'o" <tytso@mit.edu>
> Description:
>                 The maximum number of kilobytes which will be zeroed
>                 out in preference to creating a new uninitialized
>                 extent when manipulating an inode's extent tree.  Note
>                 that using a larger value will increase the
>                 variability of time necessary to complete a random
>                 write operation (since a 4k random write might turn
>                 into a much larger write due to the zeroout
>                 operation).
>
> (From Documentation/ABI/testing/sysfs-fs-ext4)
>
> The basic idea here is that with a random workload, with HDD's, the
> cost of writing a 16k random write is not much more than the time to
> write a 4k random write; that is, the cost of HDD seeks dominates.
> There is also a cost in having a many additional entries in the extent
> tree.  So if we have a fallocated region, e.g:
>
>     +-------------+---+---+---+----------+---+---+---------+
> ... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
>     +-------------+---+---+---+----------+---+---+---------+
>
> It's more efficient to have the extent tree look like this
>
>     +-------------+-----------+----------+---+---+---------+
> ... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
>     +-------------+-----------+----------+---+---+---------+
>
> And just simply write zeros to the first "U" in the above figure.
>
> The default value of extent_max_zeroout_kb is 32k.  This optimization
> can be disabled by setting extent_max_zeroout_kb to 0.  The downside
> of this is a potential degredation of a random write workload (using
> for example the fio benchmark program) on that file system.
>
> Cheers,
>
>                                                 - Ted

Hi Ted,

Thanks for pointing this out. We'll look into the use of this option.

Also, is this parameter also respected when a hole is punched in the
middle of an allocated data extent? i.e. is there still a possibility
that a punched hole does not translate to splitting the data extent,
even when extent_max_zeroout_kb is set to 0?

-- 
Regards,
Shyam

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regarding ext4 extent allocation strategy
  2021-07-13 12:57   ` Shyam Prasad N
@ 2021-07-13 20:18     ` Theodore Y. Ts'o
  2021-07-14  0:37       ` Shyam Prasad N
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Y. Ts'o @ 2021-07-13 20:18 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: David Howells, Steve French, linux-ext4

On Tue, Jul 13, 2021 at 06:27:37PM +0530, Shyam Prasad N wrote:
> 
> Also, is this parameter also respected when a hole is punched in the
> middle of an allocated data extent? i.e. is there still a possibility
> that a punched hole does not translate to splitting the data extent,
> even when extent_max_zeroout_kb is set to 0?

Ext4 doesn't ever try to zero blocks as part of a punch operation.
It's true a file system is allowed to do it, but I would guess most
wouldn't, since the presumption is that userspace is actually trying
to free up disk space, and so you would want to release the disk
blocks in the punch hole case.

The more interesting one is the FALLOC_FL_ZERO_RANGE_FL operation,
which *should* work by transitioning the extent to be uninitialized,
but there might be cases where writing a few zero blocks might be
faster in some cases.  That should use the same code path which
resepects the max_zeroout configuration parameter for ext4.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regarding ext4 extent allocation strategy
  2021-07-13 20:18     ` Theodore Y. Ts'o
@ 2021-07-14  0:37       ` Shyam Prasad N
  0 siblings, 0 replies; 7+ messages in thread
From: Shyam Prasad N @ 2021-07-14  0:37 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: David Howells, Steve French, linux-ext4

On Wed, Jul 14, 2021 at 1:48 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Jul 13, 2021 at 06:27:37PM +0530, Shyam Prasad N wrote:
> >
> > Also, is this parameter also respected when a hole is punched in the
> > middle of an allocated data extent? i.e. is there still a possibility
> > that a punched hole does not translate to splitting the data extent,
> > even when extent_max_zeroout_kb is set to 0?
>
> Ext4 doesn't ever try to zero blocks as part of a punch operation.
> It's true a file system is allowed to do it, but I would guess most
> wouldn't, since the presumption is that userspace is actually trying
> to free up disk space, and so you would want to release the disk
> blocks in the punch hole case.
>
> The more interesting one is the FALLOC_FL_ZERO_RANGE_FL operation,
> which *should* work by transitioning the extent to be uninitialized,
> but there might be cases where writing a few zero blocks might be
> faster in some cases.  That should use the same code path which
> resepects the max_zeroout configuration parameter for ext4.
>
> Cheers,
>
>                                         - Ted

Thanks a lot for your replies, Ted. This was useful.

-- 
Regards,
Shyam

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regarding ext4 extent allocation strategy
  2021-07-13 11:39 ` Theodore Y. Ts'o
  2021-07-13 12:57   ` Shyam Prasad N
@ 2022-02-18  3:18   ` Gao Xiang
  2022-02-22  2:48     ` Gao Xiang
  1 sibling, 1 reply; 7+ messages in thread
From: Gao Xiang @ 2022-02-18  3:18 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Shyam Prasad N, David Howells, Steve French, linux-ext4,
	Jeffle Xu, bo.liu, tao.peng

Hi Ted and David,

On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> > 
> > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > filesystem has recently been exploring the use of fscache on top of
> > ext4 for caching the network filesystem data for some customer
> > workloads.
> > 
> > However, the maintainer of fscache (David Howells) recently warned us
> > that a few other extent based filesystem developers pointed out a
> > theoretical bug in the current implementation of fscache/cachefiles.
> > It currently does not maintain a separate metadata for the cached data
> > it holds, but instead uses the sparseness of the underlying filesystem
> > to track the ranges of the data that is being cached.
> > The bug that has been pointed out with this is that the underlying
> > filesystems could bridge holes between data ranges with zeroes or
> > punch hole in data ranges that contain zeroes. (@David please add if I
> > missed something).
> > 
> > David has already begun working on the fix to this by maintaining the
> > metadata of the cached ranges in fscache itself.
> > However, since it could take some time for this fix to be approved and
> > then backported by various distros, I'd like to understand if there is
> > a potential problem in using fscache on top of ext4 without the fix.
> > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > way to disable such optimizations, I think we'll be okay to use the
> > older versions of fscache even without the fix mentioned above.
> 
> Yes, the tuning knob you are looking for is:
> 
> What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
> Date:		August 2012
> Contact:	"Theodore Ts'o" <tytso@mit.edu>
> Description:
> 		The maximum number of kilobytes which will be zeroed
> 		out in preference to creating a new uninitialized
> 		extent when manipulating an inode's extent tree.  Note
> 		that using a larger value will increase the
> 		variability of time necessary to complete a random
> 		write operation (since a 4k random write might turn
> 		into a much larger write due to the zeroout
> 		operation).
> 
> (From Documentation/ABI/testing/sysfs-fs-ext4)
> 
> The basic idea here is that with a random workload, with HDD's, the
> cost of writing a 16k random write is not much more than the time to
> write a 4k random write; that is, the cost of HDD seeks dominates.
> There is also a cost in having a many additional entries in the extent
> tree.  So if we have a fallocated region, e.g:
> 
>     +-------------+---+---+---+----------+---+---+---------+
> ... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
>     +-------------+---+---+---+----------+---+---+---------+
> 
> It's more efficient to have the extent tree look like this
> 
>     +-------------+-----------+----------+---+---+---------+
> ... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
>     +-------------+-----------+----------+---+---+---------+
> 
> And just simply write zeros to the first "U" in the above figure.
> 
> The default value of extent_max_zeroout_kb is 32k.  This optimization
> can be disabled by setting extent_max_zeroout_kb to 0.  The downside
> of this is a potential degredation of a random write workload (using
> for example the fio benchmark program) on that file system.
> 

As far as I understand what cachefile does, it just truncates a sparse
file with a big hole, and do direct IO _only_ all the time to fill the
holes.

But the description above is all around (un)written extents, which
already have physical blocks allocated, but just without data
initialization. So we could zero out the middle extent and merge
these extents into one bigger written extent.

However, IMO, it's not the case of what the current cachefiles
behavior is... I think rare local fs allocates blocks with direct
i/o due to real holes, zero out and merge extents since at least it
touches disk quota.

David pointed this message yesterday since we're doing on-demand read
feature by using cachefiles as well. But I still fail to understand why
the current cachefile behavior is wrong.

Could you kindly leave more hints about this? Many thanks!

Thanks,
Gao Xiang

> Cheers,
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regarding ext4 extent allocation strategy
  2022-02-18  3:18   ` Gao Xiang
@ 2022-02-22  2:48     ` Gao Xiang
  0 siblings, 0 replies; 7+ messages in thread
From: Gao Xiang @ 2022-02-22  2:48 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Shyam Prasad N, David Howells, Steve French, linux-ext4,
	Jeffle Xu, bo.liu, tao.peng

Hi Ted,

Sorry for pinging so quickly since it's quite important for the
container on-demand use cases (maybe other on-demand distribution
use cases as well.) We still prefer this cachefiles way since
its data plane won't cross kernel-userspace boundary when data is
ready (and that's the common cases after data is fetched from
network.)

Many thanks again!
Gao Xiang

On Fri, Feb 18, 2022 at 11:18:14AM +0800, Gao Xiang wrote:
> Hi Ted and David,
> 
> On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote:
> > On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> > > 
> > > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > > filesystem has recently been exploring the use of fscache on top of
> > > ext4 for caching the network filesystem data for some customer
> > > workloads.
> > > 
> > > However, the maintainer of fscache (David Howells) recently warned us
> > > that a few other extent based filesystem developers pointed out a
> > > theoretical bug in the current implementation of fscache/cachefiles.
> > > It currently does not maintain a separate metadata for the cached data
> > > it holds, but instead uses the sparseness of the underlying filesystem
> > > to track the ranges of the data that is being cached.
> > > The bug that has been pointed out with this is that the underlying
> > > filesystems could bridge holes between data ranges with zeroes or
> > > punch hole in data ranges that contain zeroes. (@David please add if I
> > > missed something).
> > > 
> > > David has already begun working on the fix to this by maintaining the
> > > metadata of the cached ranges in fscache itself.
> > > However, since it could take some time for this fix to be approved and
> > > then backported by various distros, I'd like to understand if there is
> > > a potential problem in using fscache on top of ext4 without the fix.
> > > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > > way to disable such optimizations, I think we'll be okay to use the
> > > older versions of fscache even without the fix mentioned above.
> > 
> > Yes, the tuning knob you are looking for is:
> > 
> > What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
> > Date:		August 2012
> > Contact:	"Theodore Ts'o" <tytso@mit.edu>
> > Description:
> > 		The maximum number of kilobytes which will be zeroed
> > 		out in preference to creating a new uninitialized
> > 		extent when manipulating an inode's extent tree.  Note
> > 		that using a larger value will increase the
> > 		variability of time necessary to complete a random
> > 		write operation (since a 4k random write might turn
> > 		into a much larger write due to the zeroout
> > 		operation).
> > 
> > (From Documentation/ABI/testing/sysfs-fs-ext4)
> > 
> > The basic idea here is that with a random workload, with HDD's, the
> > cost of writing a 16k random write is not much more than the time to
> > write a 4k random write; that is, the cost of HDD seeks dominates.
> > There is also a cost in having a many additional entries in the extent
> > tree.  So if we have a fallocated region, e.g:
> > 
> >     +-------------+---+---+---+----------+---+---+---------+
> > ... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
> >     +-------------+---+---+---+----------+---+---+---------+
> > 
> > It's more efficient to have the extent tree look like this
> > 
> >     +-------------+-----------+----------+---+---+---------+
> > ... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
> >     +-------------+-----------+----------+---+---+---------+
> > 
> > And just simply write zeros to the first "U" in the above figure.
> > 
> > The default value of extent_max_zeroout_kb is 32k.  This optimization
> > can be disabled by setting extent_max_zeroout_kb to 0.  The downside
> > of this is a potential degredation of a random write workload (using
> > for example the fio benchmark program) on that file system.
> > 
> 
> As far as I understand what cachefile does, it just truncates a sparse
> file with a big hole, and do direct IO _only_ all the time to fill the
> holes.
> 
> But the description above is all around (un)written extents, which
> already have physical blocks allocated, but just without data
> initialization. So we could zero out the middle extent and merge
> these extents into one bigger written extent.
> 
> However, IMO, it's not the case of what the current cachefiles
> behavior is... I think rare local fs allocates blocks with direct
> i/o due to real holes, zero out and merge extents since at least it
> touches disk quota.
> 
> David pointed this message yesterday since we're doing on-demand read
> feature by using cachefiles as well. But I still fail to understand why
> the current cachefile behavior is wrong.
> 
> Could you kindly leave more hints about this? Many thanks!
> 
> Thanks,
> Gao Xiang
> 
> > Cheers,
> > 
> > 						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-02-22  2:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-13  6:52 Regarding ext4 extent allocation strategy Shyam Prasad N
2021-07-13 11:39 ` Theodore Y. Ts'o
2021-07-13 12:57   ` Shyam Prasad N
2021-07-13 20:18     ` Theodore Y. Ts'o
2021-07-14  0:37       ` Shyam Prasad N
2022-02-18  3:18   ` Gao Xiang
2022-02-22  2:48     ` Gao Xiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.