All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
@ 2023-02-06 13:41 ` Hans Holmberg
  2023-02-06 14:24   ` Johannes Thumshirn
                     ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Hans Holmberg @ 2023-02-06 13:41 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jaegeuk, josef, Matias Bjørling, Damien Le Moal,
	Dennis Maisenbacher, Naohiro Aota, Johannes Thumshirn,
	Aravind Ramesh, Jørgen Hansen, mcgrof, javier, hch,
	a.manzanares, guokuankuan, viacheslav.dubeyko, j.granados

Write amplification induced by garbage collection negatively impacts
both the performance and the life time for storage devices.

With zoned storage now standardized for SMR hard drives
and flash(both NVME and UFS) we have an interface that allows
us to reduce this overhead by adapting file systems to do
better data placement.

Background
----------

Zoned block devices enables the host to reduce the cost of
reclaim/garbage collection/cleaning by exposing the media erase
units as zones.

By filling up zones with data from files that will
have roughly the same life span, garbage collection I/O
can be minimized, reducing write amplification.
Less disk I/O per user write.

Reduced amounts of garbage collection I/O improves
user max read and write throughput and tail latencies, see [1].

Migrating out still-valid data to erase and reclaim unused
capacity in e.g. NAND blocks has a significant performance
cost. Unnecessarily moving data around also means that there
will be more erase cycles per user write, reducing the life
time of the media.

Current state
-------------

To enable the performance benefits of zoned block devices
a file system needs to:

1) Comply with the write restrictions associated to the
zoned device model. 

2) Make active choices when allocating file data into zones
to minimize GC.

Out of the upstream file systems, btrfs and f2fs supports
the zoned block device model. F2fs supports active data placement
by separating cold from hot data which helps in reducing gc,
but there is room for improvement.


There is still work to be done
------------------------------

I've spent a fair amount of time benchmarking btrfs and f2fs
on top of zoned block devices along with xfs, ext4 and other
file systems using the conventional block interface
and at least for modern applicationsm, doing log-structured
flash-friendly writes, much can be improved. 

A good example of a flash-friendly workload is RocksDB [6]
which both does append-only writes and has a good prediction model
for the life time of its files (due to its lsm-tree based data structures)

For RocksDB workloads, the cost of garbage collection can be reduced
by 3x if near-optimal data placement is done (at 80% capacity usage).
This is based on comparing ZenFS[2], a zoned storage file system plugin
for RocksDB, with f2fs, xfs, ext4 and btrfs.

I see no good reason why linux kernel file systems (at least f2fs & btrfs)
could not play as nice with these workload as ZenFS does, by just allocating
file data blocks in a better way.

In addition to ZenFS we also have flex-alloc [5].
There are probably more data placement schemes for zoned storage out there.

I think wee need to implement a scheme that is general-purpose enough
for in-kernel file systems to cover a wide range of use cases and workloads.

I brought this up at LPC last year[4], but we did not have much time
for discussions.

What is missing
---------------

Data needs to be allocated to zones in a way that minimizes the need for
reclaim. Best-effort placement decision making could be implemented to place
files of similar life times into the same zones.

To do this, file systems would have to utilize some sort of hint to
separate data into different life-time-buckets and map those to
different zones.

There is a user ABI for hints available - the write-life-time hint interface
that was introduced for streams [3]. F2FS is the only user of this currently.

BTRFS and other file systems with zoned support could make use of it too,
but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.

Maybe the life time hints could be combined with process id to separate
different workloads better, maybe we need something else. F2FS supports
cold/hot data separation based on file extension, which is another solution.

This is the first thing I'd like to discuss.

The second thing I'd like to discuss is testing and benchmarking, which
is probably even more important and something that should be put into
place first.

Testing/benchmarking
--------------------

I think any improvements must be measurable, preferably without having to
run live production application workloads.

Benchmarking and testing is generally hard to get right, and particularily hard
when it comes to testing and benchmarking reclaim/garbage collection,
so it would make sense to share the effort.

We should be able to use fio to model a bunch of application workloads
that would benefit from data placement (lsm-tree based key-value database
stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 

Once we have a set of benchmarks that we collectively care about, I think we
can work towards generic data placement methods with some level of
confidence that it will actually work in practice.

Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
would be beneficial not only for kernel file systems but also for user-space
and distributed file systems such as ceph.

Thanks,
Hans

[1] https://www.usenix.org/system/files/atc21-bjorling.pdf
[2] https://github.com/westerndigitalcorporation/zenfs
[3] https://lwn.net/Articles/726477/
[4] https://lpc.events/event/16/contributions/1231/
[5] https://github.com/OpenMPDK/FlexAlloc
[6] https://github.com/facebook/rocksdb

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-06 13:41 ` [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices Hans Holmberg
@ 2023-02-06 14:24   ` Johannes Thumshirn
  2023-02-07 12:31     ` Hans Holmberg
  2023-02-07 19:53   ` [External] " Viacheslav A.Dubeyko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Johannes Thumshirn @ 2023-02-06 14:24 UTC (permalink / raw)
  To: Hans Holmberg, linux-fsdevel
  Cc: jaegeuk, josef, Matias Bjørling, Damien Le Moal,
	Dennis Maisenbacher, Naohiro Aota, Aravind Ramesh,
	Jørgen Hansen, mcgrof, javier, hch, a.manzanares,
	guokuankuan, viacheslav.dubeyko, j.granados, Boris Burkov

On 06.02.23 14:41, Hans Holmberg wrote:
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.

FYI, there's a patchset [1] from Boris for btrfs which uses different
size classes to further parallelize placement. As of now it leaves out
ZNS drives, as this can clash with the MOZ/MAZ limits but once active
zone tracking is fully bug free^TM we should look into using these 
allocator hints for ZNS as well. 

The hot/cold data can be a 2nd placement hint, of cause, not just the
different size classes of an extent.

[1] https://lore.kernel.org/linux-btrfs/cover.1668795992.git.boris@bur.io

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-06 14:24   ` Johannes Thumshirn
@ 2023-02-07 12:31     ` Hans Holmberg
  2023-02-07 17:46       ` Boris Burkov
  0 siblings, 1 reply; 12+ messages in thread
From: Hans Holmberg @ 2023-02-07 12:31 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hans Holmberg, linux-fsdevel, jaegeuk, josef,
	Matias Bjørling, Damien Le Moal, Dennis Maisenbacher,
	Naohiro Aota, Aravind Ramesh, Jørgen Hansen, mcgrof, javier,
	hch, a.manzanares, guokuankuan, viacheslav.dubeyko, j.granados,
	Boris Burkov

On Mon, Feb 6, 2023 at 3:24 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 06.02.23 14:41, Hans Holmberg wrote:
> > Out of the upstream file systems, btrfs and f2fs supports
> > the zoned block device model. F2fs supports active data placement
> > by separating cold from hot data which helps in reducing gc,
> > but there is room for improvement.
>
> FYI, there's a patchset [1] from Boris for btrfs which uses different
> size classes to further parallelize placement. As of now it leaves out
> ZNS drives, as this can clash with the MOZ/MAZ limits but once active
> zone tracking is fully bug free^TM we should look into using these
> allocator hints for ZNS as well.
>

That looks like a great start!

Via that patch series I also found Josef's fsperf repo [1], which is
exactly what I have
been looking for: a set of common tests for file system performance. I hope that
it can be extended with longer-running tests doing several disk overwrites with
application-like workloads.

> The hot/cold data can be a 2nd placement hint, of cause, not just the
> different size classes of an extent.

Yes. I'll dig into the patches and see if I can figure out how that
could be done.

Cheers,
Hans

[1] https://github.com/josefbacik/fsperf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-07 12:31     ` Hans Holmberg
@ 2023-02-07 17:46       ` Boris Burkov
  2023-02-08  9:16         ` Hans Holmberg
  0 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2023-02-07 17:46 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: Johannes Thumshirn, Hans Holmberg, linux-fsdevel, jaegeuk, josef,
	Matias Bjørling, Damien Le Moal, Dennis Maisenbacher,
	Naohiro Aota, Aravind Ramesh, Jørgen Hansen, mcgrof, javier,
	hch, a.manzanares, guokuankuan, viacheslav.dubeyko, j.granados

On Tue, Feb 07, 2023 at 01:31:44PM +0100, Hans Holmberg wrote:
> On Mon, Feb 6, 2023 at 3:24 PM Johannes Thumshirn
> <Johannes.Thumshirn@wdc.com> wrote:
> >
> > On 06.02.23 14:41, Hans Holmberg wrote:
> > > Out of the upstream file systems, btrfs and f2fs supports
> > > the zoned block device model. F2fs supports active data placement
> > > by separating cold from hot data which helps in reducing gc,
> > > but there is room for improvement.
> >
> > FYI, there's a patchset [1] from Boris for btrfs which uses different
> > size classes to further parallelize placement. As of now it leaves out
> > ZNS drives, as this can clash with the MOZ/MAZ limits but once active
> > zone tracking is fully bug free^TM we should look into using these
> > allocator hints for ZNS as well.
> >
> 
> That looks like a great start!
> 
> Via that patch series I also found Josef's fsperf repo [1], which is
> exactly what I have
> been looking for: a set of common tests for file system performance. I hope that
> it can be extended with longer-running tests doing several disk overwrites with
> application-like workloads.

It should be relatively straightforward to add more tests to fsperf and
we are happy to take new workloads! Also, feel free to shoot me any
questions you run into while working on it and I'm happy to help.

> 
> > The hot/cold data can be a 2nd placement hint, of cause, not just the
> > different size classes of an extent.
> 
> Yes. I'll dig into the patches and see if I can figure out how that
> could be done.

FWIW, I was working on reducing fragmentation/streamlining reclaim for
non zoned btrfs. I have another patch set that I am still working on
which attempts to use a working set concept to make placement
lifetime/lifecycle a bigger part of the btrfs allocator.

That patch set tries to make btrfs write faster in parallel, which may
be against what you are going for, that I'm not sure of. Also, I didn't
take advantage of the lifetime hints because I wanted it to help for the
general case, but that could be an interesting direction too!

If you're curious about that work, the current state of the patches is
in this branch:
https://github.com/kdave/btrfs-devel/compare/misc-next...boryas:linux:bg-ws
(Johannes, those are the patches I worked on after you noticed the
allocator being slow with many disks.)

Boris

> 
> Cheers,
> Hans
> 
> [1] https://github.com/josefbacik/fsperf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-06 13:41 ` [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices Hans Holmberg
  2023-02-06 14:24   ` Johannes Thumshirn
@ 2023-02-07 19:53   ` Viacheslav A.Dubeyko
  2023-02-08 17:13   ` Adam Manzanares
  2023-02-14 21:08   ` Joel Granados
  3 siblings, 0 replies; 12+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-02-07 19:53 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-fsdevel, jaegeuk, josef, Matias Bjørling,
	Damien Le Moal, Dennis Maisenbacher, Naohiro Aota,
	Johannes Thumshirn, Aravind Ramesh, Jørgen Hansen, mcgrof,
	javier, hch, a.manzanares, guokuankuan, j.granados



> On Feb 6, 2023, at 5:41 AM, Hans Holmberg <hans.holmberg@wdc.com> wrote:
> 
> Write amplification induced by garbage collection negatively impacts
> both the performance and the life time for storage devices.
> 
> With zoned storage now standardized for SMR hard drives
> and flash(both NVME and UFS) we have an interface that allows
> us to reduce this overhead by adapting file systems to do
> better data placement.
> 

I would love to join this discussion. I agree it’s very important topic and there is
room for significant improvement here.

> Background
> ----------
> 
> Zoned block devices enables the host to reduce the cost of
> reclaim/garbage collection/cleaning by exposing the media erase
> units as zones.
> 
> By filling up zones with data from files that will
> have roughly the same life span, garbage collection I/O
> can be minimized, reducing write amplification.
> Less disk I/O per user write.
> 
> Reduced amounts of garbage collection I/O improves
> user max read and write throughput and tail latencies, see [1].
> 
> Migrating out still-valid data to erase and reclaim unused
> capacity in e.g. NAND blocks has a significant performance
> cost. Unnecessarily moving data around also means that there
> will be more erase cycles per user write, reducing the life
> time of the media.
> 

Yes, it’s true. This is why I am trying to eliminate GC activity in SSDFS file system. :)

> Current state
> -------------
> 
> To enable the performance benefits of zoned block devices
> a file system needs to:
> 
> 1) Comply with the write restrictions associated to the
> zoned device model. 
> 
> 2) Make active choices when allocating file data into zones
> to minimize GC.
> 
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.
> 

Yeah, but F2FS requires conventional zone anyway because of in-place update area.
I am not sure that F2FS can switch on pure append-only mode.

> 
> There is still work to be done
> ------------------------------
> 

It’s definitely true statement. :)

> I've spent a fair amount of time benchmarking btrfs and f2fs
> on top of zoned block devices along with xfs, ext4 and other
> file systems using the conventional block interface
> and at least for modern applicationsm, doing log-structured
> flash-friendly writes, much can be improved. 
> 
> A good example of a flash-friendly workload is RocksDB [6]
> which both does append-only writes and has a good prediction model
> for the life time of its files (due to its lsm-tree based data structures)
> 
> For RocksDB workloads, the cost of garbage collection can be reduced
> by 3x if near-optimal data placement is done (at 80% capacity usage).
> This is based on comparing ZenFS[2], a zoned storage file system plugin
> for RocksDB, with f2fs, xfs, ext4 and btrfs.
> 
> I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> could not play as nice with these workload as ZenFS does, by just allocating
> file data blocks in a better way.
> 

I think it’s not easy point. It could require the painful on-disk layout modifications.

> In addition to ZenFS we also have flex-alloc [5].
> There are probably more data placement schemes for zoned storage out there.
> 
> I think wee need to implement a scheme that is general-purpose enough
> for in-kernel file systems to cover a wide range of use cases and workloads.
> 

Yeah, it’s great idea. But it could be really tough to implement. Especially, because
every file system has very special on-disk layout and architectural philosophy. So, to have
a general-purpose scheme sounds very exciting but it could be really tough to find a “global”
optimum that will serve perfectly all file systems. But it could be worth to try. :) 

> I brought this up at LPC last year[4], but we did not have much time
> for discussions.
> 
> What is missing
> ---------------
> 
> Data needs to be allocated to zones in a way that minimizes the need for
> reclaim. Best-effort placement decision making could be implemented to place
> files of similar life times into the same zones.
> 
> To do this, file systems would have to utilize some sort of hint to
> separate data into different life-time-buckets and map those to
> different zones.
> 
> There is a user ABI for hints available - the write-life-time hint interface
> that was introduced for streams [3]. F2FS is the only user of this currently.
> 
> BTRFS and other file systems with zoned support could make use of it too,
> but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> 
> Maybe the life time hints could be combined with process id to separate
> different workloads better, maybe we need something else. F2FS supports
> cold/hot data separation based on file extension, which is another solution.
> 

It’s tricky, I assume. So, it looks like a good discussion. As far as I can see, such policy
can be implemented above any particular file system.

File extension is not stable basis because file could not have extension at all,
extension can be wrong, or not representative at all. And to check extension on
file system level sounds like breaking the file system philosophy.

Write-life-time hints sounds tricky too, from my point of view. Not every application
can properly define the lifetime of data. Also, file system’s allocation policy/model is
heavily defines distribution of data on the volume. And it is really tough to follow
the policy of distribution of logical blocks among streams with different lifetime.

> This is the first thing I'd like to discuss.
> 
> The second thing I'd like to discuss is testing and benchmarking, which
> is probably even more important and something that should be put into
> place first.
> 
> Testing/benchmarking
> --------------------
> 
> I think any improvements must be measurable, preferably without having to
> run live production application workloads.
> 
> Benchmarking and testing is generally hard to get right, and particularily hard
> when it comes to testing and benchmarking reclaim/garbage collection,
> so it would make sense to share the effort.
> 
> We should be able to use fio to model a bunch of application workloads
> that would benefit from data placement (lsm-tree based key-value database
> stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 
> 
> Once we have a set of benchmarks that we collectively care about, I think we
> can work towards generic data placement methods with some level of
> confidence that it will actually work in practice.
> 
> Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> would be beneficial not only for kernel file systems but also for user-space
> and distributed file systems such as ceph.
> 

Yeah, simulation of aged volume (that requires GC activity) is pretty complicated task.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-07 17:46       ` Boris Burkov
@ 2023-02-08  9:16         ` Hans Holmberg
  0 siblings, 0 replies; 12+ messages in thread
From: Hans Holmberg @ 2023-02-08  9:16 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Johannes Thumshirn, Hans Holmberg, linux-fsdevel, jaegeuk, josef,
	Matias Bjørling, Damien Le Moal, Dennis Maisenbacher,
	Naohiro Aota, Aravind Ramesh, Jørgen Hansen, mcgrof, javier,
	hch, a.manzanares, guokuankuan, viacheslav.dubeyko, j.granados

On Tue, Feb 7, 2023 at 6:46 PM Boris Burkov <boris@bur.io> wrote:
>
> On Tue, Feb 07, 2023 at 01:31:44PM +0100, Hans Holmberg wrote:
> > On Mon, Feb 6, 2023 at 3:24 PM Johannes Thumshirn
> > <Johannes.Thumshirn@wdc.com> wrote:
> > >
> > > On 06.02.23 14:41, Hans Holmberg wrote:
> > > > Out of the upstream file systems, btrfs and f2fs supports
> > > > the zoned block device model. F2fs supports active data placement
> > > > by separating cold from hot data which helps in reducing gc,
> > > > but there is room for improvement.
> > >
> > > FYI, there's a patchset [1] from Boris for btrfs which uses different
> > > size classes to further parallelize placement. As of now it leaves out
> > > ZNS drives, as this can clash with the MOZ/MAZ limits but once active
> > > zone tracking is fully bug free^TM we should look into using these
> > > allocator hints for ZNS as well.
> > >
> >
> > That looks like a great start!
> >
> > Via that patch series I also found Josef's fsperf repo [1], which is
> > exactly what I have
> > been looking for: a set of common tests for file system performance. I hope that
> > it can be extended with longer-running tests doing several disk overwrites with
> > application-like workloads.
>
> It should be relatively straightforward to add more tests to fsperf and
> we are happy to take new workloads! Also, feel free to shoot me any
> questions you run into while working on it and I'm happy to help.

Great,, thanks!

>
> >
> > > The hot/cold data can be a 2nd placement hint, of cause, not just the
> > > different size classes of an extent.
> >
> > Yes. I'll dig into the patches and see if I can figure out how that
> > could be done.
>
> FWIW, I was working on reducing fragmentation/streamlining reclaim for
> non zoned btrfs. I have another patch set that I am still working on
> which attempts to use a working set concept to make placement
> lifetime/lifecycle a bigger part of the btrfs allocator.
>
> That patch set tries to make btrfs write faster in parallel, which may
> be against what you are going for, that I'm not sure of. Also, I didn't
> take advantage of the lifetime hints because I wanted it to help for the
> general case, but that could be an interesting direction too!

I'll need to dig into your patchset and look deeper into the btrfs allocator
code, to know for sure, but reducing fragmentation is great for zoned storage
in general.

Filling up zones with data from a single file is the easiest way to reduce
write amplification, and  the optimal from a reclaim perspective.
Entire zone(s) can be reclaimed as soon as the file is deleted.

This works great for lsm-tree based workloads like rocksdb and should
work well for other applications using copy-on-write data structures with
configurable file sizes (like apache kafka [1], which uses 1 gigabyte log
file sizes per default)

When file data from several files needs to be co-located in the same zone
things get more complicated. Then we have to co-locate file data from
more than one file, trying to match up files data that have the same life span.

If the user can tell us about the expected data life time via a hint,
that is great. If the
file system does not have that information, some other heuristic is needed,
like assuming that data being written by different processes or
users/groups have
different life spans. A more advanced scheme, SepBIT [2], has been proposed
for block storage, which may be applicable for file system data as well.

Thanks,
Hans

[1] https://kafka.apache.org/documentation/#topicconfigs_segment.bytes
[2] http://adslab.cse.cuhk.edu.hk/pubs/tech_sepbit.pdf

> If you're curious about that work, the current state of the patches is
> in this branch:
> https://github.com/kdave/btrfs-devel/compare/misc-next...boryas:linux:bg-ws
> (Johannes, those are the patches I worked on after you noticed the
> allocator being slow with many disks.)
>
> Boris
>
> >
> > Cheers,
> > Hans
> >
> > [1] https://github.com/josefbacik/fsperf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-06 13:41 ` [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices Hans Holmberg
  2023-02-06 14:24   ` Johannes Thumshirn
  2023-02-07 19:53   ` [External] " Viacheslav A.Dubeyko
@ 2023-02-08 17:13   ` Adam Manzanares
  2023-02-09 10:05     ` Hans Holmberg
  2023-02-14 21:08   ` Joel Granados
  3 siblings, 1 reply; 12+ messages in thread
From: Adam Manzanares @ 2023-02-08 17:13 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-fsdevel, jaegeuk, josef, Matias Bjørling,
	Damien Le Moal, Dennis Maisenbacher, Naohiro Aota,
	Johannes Thumshirn, Aravind Ramesh, Jørgen Hansen, mcgrof,
	javier, hch, guokuankuan, viacheslav.dubeyko, j.granados

On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote:
> Write amplification induced by garbage collection negatively impacts
> both the performance and the life time for storage devices.
> 
> With zoned storage now standardized for SMR hard drives
> and flash(both NVME and UFS) we have an interface that allows
> us to reduce this overhead by adapting file systems to do
> better data placement.

I would be interested in this discussion as well. Data placement on storage
media seems like a topic that is not going to go away any time soon. Interfaces
that are not tied to particular HW implementations seem like a longer term
approach to the issue. 

> 
> Background
> ----------
> 
> Zoned block devices enables the host to reduce the cost of
> reclaim/garbage collection/cleaning by exposing the media erase
> units as zones.
> 
> By filling up zones with data from files that will
> have roughly the same life span, garbage collection I/O
> can be minimized, reducing write amplification.
> Less disk I/O per user write.
> 
> Reduced amounts of garbage collection I/O improves
> user max read and write throughput and tail latencies, see [1].
> 
> Migrating out still-valid data to erase and reclaim unused
> capacity in e.g. NAND blocks has a significant performance
> cost. Unnecessarily moving data around also means that there
> will be more erase cycles per user write, reducing the life
> time of the media.
> 
> Current state
> -------------
> 
> To enable the performance benefits of zoned block devices
> a file system needs to:
> 
> 1) Comply with the write restrictions associated to the
> zoned device model. 
> 
> 2) Make active choices when allocating file data into zones
> to minimize GC.
> 
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.
> 
> 
> There is still work to be done
> ------------------------------
> 
> I've spent a fair amount of time benchmarking btrfs and f2fs
> on top of zoned block devices along with xfs, ext4 and other
> file systems using the conventional block interface
> and at least for modern applicationsm, doing log-structured
> flash-friendly writes, much can be improved. 
> 
> A good example of a flash-friendly workload is RocksDB [6]
> which both does append-only writes and has a good prediction model
> for the life time of its files (due to its lsm-tree based data structures)
> 
> For RocksDB workloads, the cost of garbage collection can be reduced
> by 3x if near-optimal data placement is done (at 80% capacity usage).
> This is based on comparing ZenFS[2], a zoned storage file system plugin
> for RocksDB, with f2fs, xfs, ext4 and btrfs.
> 
> I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> could not play as nice with these workload as ZenFS does, by just allocating
> file data blocks in a better way.
>

For RocksDB one thing I have struggled with is the fact that RocksDB appears
to me as a lightweight FS user. We expect much more from kernel FS than what
RocksDB expects. There are multiple user space FS that are compatible with 
RocksDB. How far should the kernel go to accomodate this use case?

> In addition to ZenFS we also have flex-alloc [5].
> There are probably more data placement schemes for zoned storage out there.
> 
> I think wee need to implement a scheme that is general-purpose enough
> for in-kernel file systems to cover a wide range of use cases and workloads.

This is the key point of the work IMO. I would hope to hear more use cases and
make sure that the demand comes from potential users of the API.

> 
> I brought this up at LPC last year[4], but we did not have much time
> for discussions.
> 
> What is missing
> ---------------
> 
> Data needs to be allocated to zones in a way that minimizes the need for
> reclaim. Best-effort placement decision making could be implemented to place
> files of similar life times into the same zones.
> 
> To do this, file systems would have to utilize some sort of hint to
> separate data into different life-time-buckets and map those to
> different zones.
> 
> There is a user ABI for hints available - the write-life-time hint interface
> that was introduced for streams [3]. F2FS is the only user of this currently.
> 
> BTRFS and other file systems with zoned support could make use of it too,
> but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> 

I noticed F2FS uses only two of the four values ATM. I would like to hear more
from F2FS users who use these hints as to what the impact of using the hints is.

> Maybe the life time hints could be combined with process id to separate
> different workloads better, maybe we need something else. F2FS supports
> cold/hot data separation based on file extension, which is another solution.
> 
> This is the first thing I'd like to discuss.
> 
> The second thing I'd like to discuss is testing and benchmarking, which
> is probably even more important and something that should be put into
> place first.
> 
> Testing/benchmarking
> --------------------
> 
> I think any improvements must be measurable, preferably without having to
> run live production application workloads.
> 
> Benchmarking and testing is generally hard to get right, and particularily hard
> when it comes to testing and benchmarking reclaim/garbage collection,
> so it would make sense to share the effort.
> 
> We should be able to use fio to model a bunch of application workloads
> that would benefit from data placement (lsm-tree based key-value database
> stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 

Should we just skip fio and run benchmarks on top of rocksDB and kafka? I was
looking at mmtests recently and noticed that it goes and downloads mm relevant
applications and runs benchmarks on the chose benchmarks.

> 
> Once we have a set of benchmarks that we collectively care about, I think we
> can work towards generic data placement methods with some level of
> confidence that it will actually work in practice.
> 
> Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> would be beneficial not only for kernel file systems but also for user-space
> and distributed file systems such as ceph.

This would be very valuable. Ideally with input from consumers of the data
placement APIS.

> 
> Thanks,
> Hans
> 
> [1] https://urldefense.com/v3/__https://www.usenix.org/system/files/atc21-bjorling.pdf__;!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirD4LbAEY$ 
> [2] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=462cf2bb-27a7e781-462d79f4-74fe4860008a-ab419c0ae2c7fb34&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Fgithub.com*2Fwesterndigitalcorporation*2Fzenfs__;JSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirB3JeheY$ 
> [3] https://urldefense.com/v3/__https://lwn.net/Articles/726477/__;!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirPUCJNUc$ 
> [4] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=7eb17e0e-1f3a6b34-7eb0f541-74fe4860008a-0e46d2a09227c132&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Flpc.events*2Fevent*2F16*2Fcontributions*2F1231*2F__;JSUlJSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirgbmZXI0$ 
> [5] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=19cdffac-7846ea96-19cc74e3-74fe4860008a-1121f5b082abfbe3&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Fgithub.com*2FOpenMPDK*2FFlexAlloc__;JSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirL2CmpSE$ 
> [6] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=6ed08255-0f5b976f-6ed1091a-74fe4860008a-2a012b612f36b36a&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Fgithub.com*2Ffacebook*2Frocksdb__;JSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirJuN380k$ 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-08 17:13   ` Adam Manzanares
@ 2023-02-09 10:05     ` Hans Holmberg
  2023-02-09 10:22       ` Johannes Thumshirn
  2023-02-14 21:05       ` Joel Granados
  0 siblings, 2 replies; 12+ messages in thread
From: Hans Holmberg @ 2023-02-09 10:05 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Hans Holmberg, linux-fsdevel, jaegeuk, josef,
	Matias Bjørling, Damien Le Moal, Dennis Maisenbacher,
	Naohiro Aota, Johannes Thumshirn, Aravind Ramesh,
	Jørgen Hansen, mcgrof, javier, hch, guokuankuan,
	viacheslav.dubeyko, j.granados

On Wed, Feb 8, 2023 at 6:13 PM Adam Manzanares <a.manzanares@samsung.com> wrote:
>
> On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote:
> > Write amplification induced by garbage collection negatively impacts
> > both the performance and the life time for storage devices.
> >
> > With zoned storage now standardized for SMR hard drives
> > and flash(both NVME and UFS) we have an interface that allows
> > us to reduce this overhead by adapting file systems to do
> > better data placement.
>
> I would be interested in this discussion as well. Data placement on storage
> media seems like a topic that is not going to go away any time soon. Interfaces
> that are not tied to particular HW implementations seem like a longer term
> approach to the issue.
>
> >
> > Background
> > ----------
> >
> > Zoned block devices enables the host to reduce the cost of
> > reclaim/garbage collection/cleaning by exposing the media erase
> > units as zones.
> >
> > By filling up zones with data from files that will
> > have roughly the same life span, garbage collection I/O
> > can be minimized, reducing write amplification.
> > Less disk I/O per user write.
> >
> > Reduced amounts of garbage collection I/O improves
> > user max read and write throughput and tail latencies, see [1].
> >
> > Migrating out still-valid data to erase and reclaim unused
> > capacity in e.g. NAND blocks has a significant performance
> > cost. Unnecessarily moving data around also means that there
> > will be more erase cycles per user write, reducing the life
> > time of the media.
> >
> > Current state
> > -------------
> >
> > To enable the performance benefits of zoned block devices
> > a file system needs to:
> >
> > 1) Comply with the write restrictions associated to the
> > zoned device model.
> >
> > 2) Make active choices when allocating file data into zones
> > to minimize GC.
> >
> > Out of the upstream file systems, btrfs and f2fs supports
> > the zoned block device model. F2fs supports active data placement
> > by separating cold from hot data which helps in reducing gc,
> > but there is room for improvement.
> >
> >
> > There is still work to be done
> > ------------------------------
> >
> > I've spent a fair amount of time benchmarking btrfs and f2fs
> > on top of zoned block devices along with xfs, ext4 and other
> > file systems using the conventional block interface
> > and at least for modern applicationsm, doing log-structured
> > flash-friendly writes, much can be improved.
> >
> > A good example of a flash-friendly workload is RocksDB [6]
> > which both does append-only writes and has a good prediction model
> > for the life time of its files (due to its lsm-tree based data structures)
> >
> > For RocksDB workloads, the cost of garbage collection can be reduced
> > by 3x if near-optimal data placement is done (at 80% capacity usage).
> > This is based on comparing ZenFS[2], a zoned storage file system plugin
> > for RocksDB, with f2fs, xfs, ext4 and btrfs.
> >
> > I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> > could not play as nice with these workload as ZenFS does, by just allocating
> > file data blocks in a better way.
> >
>
> For RocksDB one thing I have struggled with is the fact that RocksDB appears
> to me as a lightweight FS user. We expect much more from kernel FS than what
> RocksDB expects. There are multiple user space FS that are compatible with
> RocksDB. How far should the kernel go to accomodate this use case?
>
> > In addition to ZenFS we also have flex-alloc [5].
> > There are probably more data placement schemes for zoned storage out there.
> >
> > I think we need to implement a scheme that is general-purpose enough
> > for in-kernel file systems to cover a wide range of use cases and workloads.
>
> This is the key point of the work IMO. I would hope to hear more use cases and
> make sure that the demand comes from potential users of the API.
>
> >
> > I brought this up at LPC last year[4], but we did not have much time
> > for discussions.
> >
> > What is missing
> > ---------------
> >
> > Data needs to be allocated to zones in a way that minimizes the need for
> > reclaim. Best-effort placement decision making could be implemented to place
> > files of similar life times into the same zones.
> >
> > To do this, file systems would have to utilize some sort of hint to
> > separate data into different life-time-buckets and map those to
> > different zones.
> >
> > There is a user ABI for hints available - the write-life-time hint interface
> > that was introduced for streams [3]. F2FS is the only user of this currently.
> >
> > BTRFS and other file systems with zoned support could make use of it too,
> > but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> >
>
> I noticed F2FS uses only two of the four values ATM. I would like to hear more
> from F2FS users who use these hints as to what the impact of using the hints is.
>
> > Maybe the life time hints could be combined with process id to separate
> > different workloads better, maybe we need something else. F2FS supports
> > cold/hot data separation based on file extension, which is another solution.
> >
> > This is the first thing I'd like to discuss.
> >
> > The second thing I'd like to discuss is testing and benchmarking, which
> > is probably even more important and something that should be put into
> > place first.
> >
> > Testing/benchmarking
> > --------------------
> >
> > I think any improvements must be measurable, preferably without having to
> > run live production application workloads.
> >
> > Benchmarking and testing is generally hard to get right, and particularily hard
> > when it comes to testing and benchmarking reclaim/garbage collection,
> > so it would make sense to share the effort.
> >
> > We should be able to use fio to model a bunch of application workloads
> > that would benefit from data placement (lsm-tree based key-value database
> > stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) ..
>
> Should we just skip fio and run benchmarks on top of rocksDB and kafka? I was
> looking at mmtests recently and noticed that it goes and downloads mm relevant
> applications and runs benchmarks on the chose benchmarks.

It takes a significant amount of time and trouble to build, run and understand
benchmarks for these applications. Modeling the workloads using fio
minimizes the set-up work and would enable more developers to actually
run these things. The workload definitions could also help developers
understanding what sort of IO that these use cases generate.

There is already one mixed-lifetime benchmark in fsperf [7], more
could probably be added.
I'm looking into adding a lsm-tree workload.

Full, automated, application benchmarks(db_bech, sysbench, ..) would
be great to have as well of course.

[7] https://github.com/josefbacik/fsperf/blob/master/frag_tests/mixed-lifetimes.fio

Cheers,
Hans

>
> >
> > Once we have a set of benchmarks that we collectively care about, I think we
> > can work towards generic data placement methods with some level of
> > confidence that it will actually work in practice.
> >
> > Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> > would be beneficial not only for kernel file systems but also for user-space
> > and distributed file systems such as ceph.
>
> This would be very valuable. Ideally with input from consumers of the data
> placement APIS.
>
> >
> > Thanks,
> > Hans
> >
> > [1] https://urldefense.com/v3/__https://www.usenix.org/system/files/atc21-bjorling.pdf__;!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirD4LbAEY$
> > [2] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=462cf2bb-27a7e781-462d79f4-74fe4860008a-ab419c0ae2c7fb34&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Fgithub.com*2Fwesterndigitalcorporation*2Fzenfs__;JSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirB3JeheY$
> > [3] https://urldefense.com/v3/__https://lwn.net/Articles/726477/__;!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirPUCJNUc$
> > [4] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=7eb17e0e-1f3a6b34-7eb0f541-74fe4860008a-0e46d2a09227c132&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Flpc.events*2Fevent*2F16*2Fcontributions*2F1231*2F__;JSUlJSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirgbmZXI0$
> > [5] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=19cdffac-7846ea96-19cc74e3-74fe4860008a-1121f5b082abfbe3&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Fgithub.com*2FOpenMPDK*2FFlexAlloc__;JSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirL2CmpSE$
> > [6] https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=6ed08255-0f5b976f-6ed1091a-74fe4860008a-2a012b612f36b36a&q=1&e=66a35d4b-398f-4758-82c5-79f023ada0b4&u=https*3A*2F*2Fgithub.com*2Ffacebook*2Frocksdb__;JSUlJSU!!EwVzqGoTKBqv-0DWAJBm!WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirJuN380k$

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-09 10:05     ` Hans Holmberg
@ 2023-02-09 10:22       ` Johannes Thumshirn
  2023-02-17 14:33         ` Jan Kara
  2023-02-14 21:05       ` Joel Granados
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Thumshirn @ 2023-02-09 10:22 UTC (permalink / raw)
  To: Hans Holmberg, Adam Manzanares
  Cc: Hans Holmberg, linux-fsdevel, jaegeuk, josef,
	Matias Bjørling, Damien Le Moal, Dennis Maisenbacher,
	Naohiro Aota, Aravind Ramesh, Jørgen Hansen, mcgrof, javier,
	hch, guokuankuan, viacheslav.dubeyko, j.granados

On 09.02.23 11:06, Hans Holmberg wrote:
> It takes a significant amount of time and trouble to build, run and understand
> benchmarks for these applications. Modeling the workloads using fio
> minimizes the set-up work and would enable more developers to actually
> run these things. The workload definitions could also help developers
> understanding what sort of IO that these use cases generate.

True, but I think Adam has a point here. IIRC mmtests comes with some scripts
to download, build and run the desired applications and then do the maths.

In this day and age people would probably want to use a container with the
application inside and some automation around it to run the benchmark and 
present the results.

Don't get me wrong. Simple fio snippets are definitively what I would prefer
as well, but creating these does require a lot of insight into the inner
workings of a desired user-space workload. Of cause once that is achieved we
can easily add this workload to fsperf and have some sort of CI testing for
it to make sure we don't regress here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-09 10:05     ` Hans Holmberg
  2023-02-09 10:22       ` Johannes Thumshirn
@ 2023-02-14 21:05       ` Joel Granados
  1 sibling, 0 replies; 12+ messages in thread
From: Joel Granados @ 2023-02-14 21:05 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: Adam Manzanares, Hans Holmberg, linux-fsdevel, jaegeuk, josef,
	Matias Bjørling, Damien Le Moal, Dennis Maisenbacher,
	Naohiro Aota, Johannes Thumshirn, Aravind Ramesh,
	Jørgen Hansen, mcgrof, javier, hch, guokuankuan,
	viacheslav.dubeyko

[-- Attachment #1: Type: text/plain, Size: 11685 bytes --]

On Thu, Feb 09, 2023 at 11:05:44AM +0100, Hans Holmberg wrote:
> On Wed, Feb 8, 2023 at 6:13 PM Adam Manzanares <a.manzanares@samsung.com> wrote:
> >
> > On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote:
> > > Write amplification induced by garbage collection negatively impacts
> > > both the performance and the life time for storage devices.
> > >
> > > With zoned storage now standardized for SMR hard drives
> > > and flash(both NVME and UFS) we have an interface that allows
> > > us to reduce this overhead by adapting file systems to do
> > > better data placement.
> >
> > I would be interested in this discussion as well. Data placement on storage
> > media seems like a topic that is not going to go away any time soon. Interfaces
> > that are not tied to particular HW implementations seem like a longer term
> > approach to the issue.
> >
> > >
> > > Background
> > > ----------
> > >
> > > Zoned block devices enables the host to reduce the cost of
> > > reclaim/garbage collection/cleaning by exposing the media erase
> > > units as zones.
> > >
> > > By filling up zones with data from files that will
> > > have roughly the same life span, garbage collection I/O
> > > can be minimized, reducing write amplification.
> > > Less disk I/O per user write.
> > >
> > > Reduced amounts of garbage collection I/O improves
> > > user max read and write throughput and tail latencies, see [1].
> > >
> > > Migrating out still-valid data to erase and reclaim unused
> > > capacity in e.g. NAND blocks has a significant performance
> > > cost. Unnecessarily moving data around also means that there
> > > will be more erase cycles per user write, reducing the life
> > > time of the media.
> > >
> > > Current state
> > > -------------
> > >
> > > To enable the performance benefits of zoned block devices
> > > a file system needs to:
> > >
> > > 1) Comply with the write restrictions associated to the
> > > zoned device model.
> > >
> > > 2) Make active choices when allocating file data into zones
> > > to minimize GC.
> > >
> > > Out of the upstream file systems, btrfs and f2fs supports
> > > the zoned block device model. F2fs supports active data placement
> > > by separating cold from hot data which helps in reducing gc,
> > > but there is room for improvement.
> > >
> > >
> > > There is still work to be done
> > > ------------------------------
> > >
> > > I've spent a fair amount of time benchmarking btrfs and f2fs
> > > on top of zoned block devices along with xfs, ext4 and other
> > > file systems using the conventional block interface
> > > and at least for modern applicationsm, doing log-structured
> > > flash-friendly writes, much can be improved.
> > >
> > > A good example of a flash-friendly workload is RocksDB [6]
> > > which both does append-only writes and has a good prediction model
> > > for the life time of its files (due to its lsm-tree based data structures)
> > >
> > > For RocksDB workloads, the cost of garbage collection can be reduced
> > > by 3x if near-optimal data placement is done (at 80% capacity usage).
> > > This is based on comparing ZenFS[2], a zoned storage file system plugin
> > > for RocksDB, with f2fs, xfs, ext4 and btrfs.
> > >
> > > I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> > > could not play as nice with these workload as ZenFS does, by just allocating
> > > file data blocks in a better way.
> > >
> >
> > For RocksDB one thing I have struggled with is the fact that RocksDB appears
> > to me as a lightweight FS user. We expect much more from kernel FS than what
> > RocksDB expects. There are multiple user space FS that are compatible with
> > RocksDB. How far should the kernel go to accomodate this use case?
> >
> > > In addition to ZenFS we also have flex-alloc [5].
> > > There are probably more data placement schemes for zoned storage out there.
> > >
> > > I think we need to implement a scheme that is general-purpose enough
> > > for in-kernel file systems to cover a wide range of use cases and workloads.
> >
> > This is the key point of the work IMO. I would hope to hear more use cases and
> > make sure that the demand comes from potential users of the API.
> >
> > >
> > > I brought this up at LPC last year[4], but we did not have much time
> > > for discussions.
> > >
> > > What is missing
> > > ---------------
> > >
> > > Data needs to be allocated to zones in a way that minimizes the need for
> > > reclaim. Best-effort placement decision making could be implemented to place
> > > files of similar life times into the same zones.
> > >
> > > To do this, file systems would have to utilize some sort of hint to
> > > separate data into different life-time-buckets and map those to
> > > different zones.
> > >
> > > There is a user ABI for hints available - the write-life-time hint interface
> > > that was introduced for streams [3]. F2FS is the only user of this currently.
> > >
> > > BTRFS and other file systems with zoned support could make use of it too,
> > > but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> > >
> >
> > I noticed F2FS uses only two of the four values ATM. I would like to hear more
> > from F2FS users who use these hints as to what the impact of using the hints is.
> >
> > > Maybe the life time hints could be combined with process id to separate
> > > different workloads better, maybe we need something else. F2FS supports
> > > cold/hot data separation based on file extension, which is another solution.
> > >
> > > This is the first thing I'd like to discuss.
> > >
> > > The second thing I'd like to discuss is testing and benchmarking, which
> > > is probably even more important and something that should be put into
> > > place first.
> > >
> > > Testing/benchmarking
> > > --------------------
> > >
> > > I think any improvements must be measurable, preferably without having to
> > > run live production application workloads.
> > >
> > > Benchmarking and testing is generally hard to get right, and particularily hard
> > > when it comes to testing and benchmarking reclaim/garbage collection,
> > > so it would make sense to share the effort.
> > >
> > > We should be able to use fio to model a bunch of application workloads
> > > that would benefit from data placement (lsm-tree based key-value database
> > > stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) ..
> >
> > Should we just skip fio and run benchmarks on top of rocksDB and kafka? I was
> > looking at mmtests recently and noticed that it goes and downloads mm relevant
> > applications and runs benchmarks on the chose benchmarks.
> 
> It takes a significant amount of time and trouble to build, run and understand
> benchmarks for these applications. Modeling the workloads using fio
> minimizes the set-up work and would enable more developers to actually
> run these things. The workload definitions could also help developers
> understanding what sort of IO that these use cases generate.
> 
> There is already one mixed-lifetime benchmark in fsperf [7], more
> could probably be added.
> I'm looking into adding a lsm-tree workload.
> 
> Full, automated, application benchmarks(db_bech, sysbench, ..) would
> be great to have as well of course.

I think the two compliment each other. Fio has a very nice property
which is that you control practically every IO parameter imaginable. So
it can be used to test specific things. However there is always the
question of how will it behave out in the "real world" and things like
sysbench, db_bench/rocksdb, kafka give insight into this behavior.


> 
> [7] https://protect2.fireeye.com/v1/url?k=acdf3830-cd542d06-acdeb37f-74fe485cbff1-62060b9a67391c4c&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Fgithub.com%2Fjosefbacik%2Ffsperf%2Fblob%2Fmaster%2Ffrag_tests%2Fmixed-lifetimes.fio
> 
> Cheers,
> Hans
> 
> >
> > >
> > > Once we have a set of benchmarks that we collectively care about, I think we
> > > can work towards generic data placement methods with some level of
> > > confidence that it will actually work in practice.
> > >
> > > Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> > > would be beneficial not only for kernel file systems but also for user-space
> > > and distributed file systems such as ceph.
> >
> > This would be very valuable. Ideally with input from consumers of the data
> > placement APIS.
> >
> > >
> > > Thanks,
> > > Hans
> > >
> > > [1] https://protect2.fireeye.com/v1/url?k=8170ab1e-e0fbbe28-81712051-74fe485cbff1-d6a2c52f6b47b21e&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fwww.usenix.org%2Fsystem%2Ffiles%2Fatc21-bjorling.pdf__%3B%21%21EwVzqGoTKBqv-0DWAJBm%21WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirD4LbAEY%24
> > > [2] https://protect2.fireeye.com/v1/url?k=2a19dfab-4b92ca9d-2a1854e4-74fe485cbff1-02d1dbdd0e5b5b42&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fprotect2.fireeye.com%2Fv1%2Furl%3Fk%3D462cf2bb-27a7e781-462d79f4-74fe4860008a-ab419c0ae2c7fb34%26q%3D1%26e%3D66a35d4b-398f-4758-82c5-79f023ada0b4%26u%3Dhttps%2A3A%2A2F%2A2Fgithub.com%2A2Fwesterndigitalcorporation%2A2Fzenfs__%3BJSUlJSU%21%21EwVzqGoTKBqv-0DWAJBm%21WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirB3JeheY%24
> > > [3] https://protect2.fireeye.com/v1/url?k=911bc000-f090d536-911a4b4f-74fe485cbff1-2c11d212d3307ae7&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Flwn.net%2FArticles%2F726477%2F__%3B%21%21EwVzqGoTKBqv-0DWAJBm%21WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirPUCJNUc%24
> > > [4] https://protect2.fireeye.com/v1/url?k=5ebef0e0-3f35e5d6-5ebf7baf-74fe485cbff1-384c47aa42ce281c&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fprotect2.fireeye.com%2Fv1%2Furl%3Fk%3D7eb17e0e-1f3a6b34-7eb0f541-74fe4860008a-0e46d2a09227c132%26q%3D1%26e%3D66a35d4b-398f-4758-82c5-79f023ada0b4%26u%3Dhttps%2A3A%2A2F%2A2Flpc.events%2A2Fevent%2A2F16%2A2Fcontributions%2A2F1231%2A2F__%3BJSUlJSUlJSU%21%21EwVzqGoTKBqv-0DWAJBm%21WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirgbmZXI0%24
> > > [5] https://protect2.fireeye.com/v1/url?k=44834f5c-25085a6a-4482c413-74fe485cbff1-5f05627461a4cdbe&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fprotect2.fireeye.com%2Fv1%2Furl%3Fk%3D19cdffac-7846ea96-19cc74e3-74fe4860008a-1121f5b082abfbe3%26q%3D1%26e%3D66a35d4b-398f-4758-82c5-79f023ada0b4%26u%3Dhttps%2A3A%2A2F%2A2Fgithub.com%2A2FOpenMPDK%2A2FFlexAlloc__%3BJSUlJSU%21%21EwVzqGoTKBqv-0DWAJBm%21WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirL2CmpSE%24
> > > [6] https://protect2.fireeye.com/v1/url?k=8ee14fdd-ef6a5aeb-8ee0c492-74fe485cbff1-53c0852325edc09a&q=1&e=e339ca4d-d944-42e6-a67e-5032677c1b43&u=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fprotect2.fireeye.com%2Fv1%2Furl%3Fk%3D6ed08255-0f5b976f-6ed1091a-74fe4860008a-2a012b612f36b36a%26q%3D1%26e%3D66a35d4b-398f-4758-82c5-79f023ada0b4%26u%3Dhttps%2A3A%2A2F%2A2Fgithub.com%2A2Ffacebook%2A2Frocksdb__%3BJSUlJSU%21%21EwVzqGoTKBqv-0DWAJBm%21WC4RGRyZ9YioNTLW94o29OSHK5LD8GlXL_2VKMGS7Z5e0cojtPDKfqU0iETvqHpyuKD6UpBapa6jkGmbktirJuN380k%24

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-06 13:41 ` [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices Hans Holmberg
                     ` (2 preceding siblings ...)
  2023-02-08 17:13   ` Adam Manzanares
@ 2023-02-14 21:08   ` Joel Granados
  3 siblings, 0 replies; 12+ messages in thread
From: Joel Granados @ 2023-02-14 21:08 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-fsdevel, jaegeuk, josef, Matias Bjørling,
	Damien Le Moal, Dennis Maisenbacher, Naohiro Aota,
	Johannes Thumshirn, Aravind Ramesh, Jørgen Hansen, mcgrof,
	javier, hch, a.manzanares, guokuankuan, viacheslav.dubeyko

[-- Attachment #1: Type: text/plain, Size: 6520 bytes --]

On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote:
> Write amplification induced by garbage collection negatively impacts
> both the performance and the life time for storage devices.
> 
> With zoned storage now standardized for SMR hard drives
> and flash(both NVME and UFS) we have an interface that allows
> us to reduce this overhead by adapting file systems to do
> better data placement.
I'm also very interested in discussions related to data placements. I am
interested in this discussion.

> 
> Background
> ----------
> 
> Zoned block devices enables the host to reduce the cost of
> reclaim/garbage collection/cleaning by exposing the media erase
> units as zones.
> 
> By filling up zones with data from files that will
> have roughly the same life span, garbage collection I/O
> can be minimized, reducing write amplification.
> Less disk I/O per user write.
> 
> Reduced amounts of garbage collection I/O improves
> user max read and write throughput and tail latencies, see [1].
> 
> Migrating out still-valid data to erase and reclaim unused
> capacity in e.g. NAND blocks has a significant performance
> cost. Unnecessarily moving data around also means that there
> will be more erase cycles per user write, reducing the life
> time of the media.
> 
> Current state
> -------------
> 
> To enable the performance benefits of zoned block devices
> a file system needs to:
> 
> 1) Comply with the write restrictions associated to the
> zoned device model. 
> 
> 2) Make active choices when allocating file data into zones
> to minimize GC.
> 
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.
> 
> 
> There is still work to be done
> ------------------------------
> 
> I've spent a fair amount of time benchmarking btrfs and f2fs
> on top of zoned block devices along with xfs, ext4 and other
> file systems using the conventional block interface
> and at least for modern applicationsm, doing log-structured
> flash-friendly writes, much can be improved. 
> 
> A good example of a flash-friendly workload is RocksDB [6]
> which both does append-only writes and has a good prediction model
> for the life time of its files (due to its lsm-tree based data structures)
> 
> For RocksDB workloads, the cost of garbage collection can be reduced
> by 3x if near-optimal data placement is done (at 80% capacity usage).
> This is based on comparing ZenFS[2], a zoned storage file system plugin
> for RocksDB, with f2fs, xfs, ext4 and btrfs.
> 
> I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> could not play as nice with these workload as ZenFS does, by just allocating
> file data blocks in a better way.
> 
> In addition to ZenFS we also have flex-alloc [5].
> There are probably more data placement schemes for zoned storage out there.
> 
> I think wee need to implement a scheme that is general-purpose enough
> for in-kernel file systems to cover a wide range of use cases and workloads.
> 
> I brought this up at LPC last year[4], but we did not have much time
> for discussions.
> 
> What is missing
> ---------------
> 
> Data needs to be allocated to zones in a way that minimizes the need for
> reclaim. Best-effort placement decision making could be implemented to place
> files of similar life times into the same zones.
> 
> To do this, file systems would have to utilize some sort of hint to
> separate data into different life-time-buckets and map those to
> different zones.
> 
> There is a user ABI for hints available - the write-life-time hint interface
> that was introduced for streams [3]. F2FS is the only user of this currently.
> 
> BTRFS and other file systems with zoned support could make use of it too,
> but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> 
> Maybe the life time hints could be combined with process id to separate
> different workloads better, maybe we need something else. F2FS supports
> cold/hot data separation based on file extension, which is another solution.
> 
> This is the first thing I'd like to discuss.
> 
> The second thing I'd like to discuss is testing and benchmarking, which
> is probably even more important and something that should be put into
> place first.
> 
> Testing/benchmarking
> --------------------
> 
> I think any improvements must be measurable, preferably without having to
> run live production application workloads.
> 
> Benchmarking and testing is generally hard to get right, and particularily hard
> when it comes to testing and benchmarking reclaim/garbage collection,
> so it would make sense to share the effort.
> 
> We should be able to use fio to model a bunch of application workloads
> that would benefit from data placement (lsm-tree based key-value database
> stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 
> 
> Once we have a set of benchmarks that we collectively care about, I think we
> can work towards generic data placement methods with some level of
> confidence that it will actually work in practice.
> 
> Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> would be beneficial not only for kernel file systems but also for user-space
> and distributed file systems such as ceph.
> 
> Thanks,
> Hans
> 
> [1] https://www.usenix.org/system/files/atc21-bjorling.pdf
> [2] https://protect2.fireeye.com/v1/url?k=dce9fbc5-8372c2a0-dce8708a-000babff32e3-302a3cb629dc78ae&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Fwesterndigitalcorporation%2Fzenfs
> [3] https://lwn.net/Articles/726477/
> [4] https://protect2.fireeye.com/v1/url?k=911c6738-ce875e5d-911dec77-000babff32e3-7bd289693aa18731&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Flpc.events%2Fevent%2F16%2Fcontributions%2F1231%2F
> [5] https://protect2.fireeye.com/v1/url?k=e4102d1c-bb8b1479-e411a653-000babff32e3-d07ddeaede7547d7&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2FOpenMPDK%2FFlexAlloc
> [6] https://protect2.fireeye.com/v1/url?k=1f7befc6-40e0d6a3-1f7a6489-000babff32e3-a7f3b118578d6c39&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
  2023-02-09 10:22       ` Johannes Thumshirn
@ 2023-02-17 14:33         ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2023-02-17 14:33 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hans Holmberg, Adam Manzanares, Hans Holmberg, linux-fsdevel,
	jaegeuk, josef, Matias Bjørling, Damien Le Moal,
	Dennis Maisenbacher, Naohiro Aota, Aravind Ramesh,
	Jørgen Hansen, mcgrof, javier, hch, guokuankuan,
	viacheslav.dubeyko, j.granados

On Thu 09-02-23 10:22:31, Johannes Thumshirn wrote:
> On 09.02.23 11:06, Hans Holmberg wrote:
> > It takes a significant amount of time and trouble to build, run and understand
> > benchmarks for these applications. Modeling the workloads using fio
> > minimizes the set-up work and would enable more developers to actually
> > run these things. The workload definitions could also help developers
> > understanding what sort of IO that these use cases generate.
> 
> True, but I think Adam has a point here. IIRC mmtests comes with some scripts
> to download, build and run the desired applications and then do the maths.
> 
> In this day and age people would probably want to use a container with the
> application inside and some automation around it to run the benchmark and 
> present the results.

Yeah, although containers also do have some impact on the benchmark
behavior (not that much for IO but for scheduling and memory management it
is more visible) so bare-metal testing is still worthwhile. Mmtests
actually already have some support for VM testing and we have implemented
basic container testing just recently. There are still rough edges and more
work is needed but mmtests are also moving into the next decade ;).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-02-17 14:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230206134200uscas1p20382220d7fc10c899b4c79e01d94cf0b@uscas1p2.samsung.com>
2023-02-06 13:41 ` [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices Hans Holmberg
2023-02-06 14:24   ` Johannes Thumshirn
2023-02-07 12:31     ` Hans Holmberg
2023-02-07 17:46       ` Boris Burkov
2023-02-08  9:16         ` Hans Holmberg
2023-02-07 19:53   ` [External] " Viacheslav A.Dubeyko
2023-02-08 17:13   ` Adam Manzanares
2023-02-09 10:05     ` Hans Holmberg
2023-02-09 10:22       ` Johannes Thumshirn
2023-02-17 14:33         ` Jan Kara
2023-02-14 21:05       ` Joel Granados
2023-02-14 21:08   ` Joel Granados

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.