Re: [External] [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices

From: "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com>
To: Hans Holmberg <hans.holmberg@wdc.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"jaegeuk@kernel.org" <jaegeuk@kernel.org>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"Matias Bjørling" <Matias.Bjorling@wdc.com>,
	"Damien Le Moal" <Damien.LeMoal@wdc.com>,
	"Dennis Maisenbacher" <dennis.maisenbacher@wdc.com>,
	"Naohiro Aota" <Naohiro.Aota@wdc.com>,
	"Johannes Thumshirn" <Johannes.Thumshirn@wdc.com>,
	"Aravind Ramesh" <Aravind.Ramesh@wdc.com>,
	"Jørgen Hansen" <Jorgen.Hansen@wdc.com>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"javier@javigon.com" <javier@javigon.com>,
	"hch@lst.de" <hch@lst.de>,
	"a.manzanares@samsung.com" <a.manzanares@samsung.com>,
	"guokuankuan@bytedance.com" <guokuankuan@bytedance.com>,
	"j.granados@samsung.com" <j.granados@samsung.com>
Subject: Re: [External] [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
Date: Tue, 7 Feb 2023 11:53:41 -0800	[thread overview]
Message-ID: <4130BE54-D7E4-4E8F-B6A3-844B815841AC@bytedance.com> (raw)
In-Reply-To: <20230206134148.GD6704@gsv>

> On Feb 6, 2023, at 5:41 AM, Hans Holmberg <hans.holmberg@wdc.com> wrote:
> 
> Write amplification induced by garbage collection negatively impacts
> both the performance and the life time for storage devices.
> 
> With zoned storage now standardized for SMR hard drives
> and flash(both NVME and UFS) we have an interface that allows
> us to reduce this overhead by adapting file systems to do
> better data placement.
> 

I would love to join this discussion. I agree it’s very important topic and there is
room for significant improvement here.

> Background
> ----------
> 
> Zoned block devices enables the host to reduce the cost of
> reclaim/garbage collection/cleaning by exposing the media erase
> units as zones.
> 
> By filling up zones with data from files that will
> have roughly the same life span, garbage collection I/O
> can be minimized, reducing write amplification.
> Less disk I/O per user write.
> 
> Reduced amounts of garbage collection I/O improves
> user max read and write throughput and tail latencies, see [1].
> 
> Migrating out still-valid data to erase and reclaim unused
> capacity in e.g. NAND blocks has a significant performance
> cost. Unnecessarily moving data around also means that there
> will be more erase cycles per user write, reducing the life
> time of the media.
> 

Yes, it’s true. This is why I am trying to eliminate GC activity in SSDFS file system. :)

> Current state
> -------------
> 
> To enable the performance benefits of zoned block devices
> a file system needs to:
> 
> 1) Comply with the write restrictions associated to the
> zoned device model. 
> 
> 2) Make active choices when allocating file data into zones
> to minimize GC.
> 
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.
> 

Yeah, but F2FS requires conventional zone anyway because of in-place update area.
I am not sure that F2FS can switch on pure append-only mode.

> 
> There is still work to be done
> ------------------------------
> 

It’s definitely true statement. :)

> I've spent a fair amount of time benchmarking btrfs and f2fs
> on top of zoned block devices along with xfs, ext4 and other
> file systems using the conventional block interface
> and at least for modern applicationsm, doing log-structured
> flash-friendly writes, much can be improved. 
> 
> A good example of a flash-friendly workload is RocksDB [6]
> which both does append-only writes and has a good prediction model
> for the life time of its files (due to its lsm-tree based data structures)
> 
> For RocksDB workloads, the cost of garbage collection can be reduced
> by 3x if near-optimal data placement is done (at 80% capacity usage).
> This is based on comparing ZenFS[2], a zoned storage file system plugin
> for RocksDB, with f2fs, xfs, ext4 and btrfs.
> 
> I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> could not play as nice with these workload as ZenFS does, by just allocating
> file data blocks in a better way.
> 

I think it’s not easy point. It could require the painful on-disk layout modifications.

> In addition to ZenFS we also have flex-alloc [5].
> There are probably more data placement schemes for zoned storage out there.
> 
> I think wee need to implement a scheme that is general-purpose enough
> for in-kernel file systems to cover a wide range of use cases and workloads.
> 

Yeah, it’s great idea. But it could be really tough to implement. Especially, because
every file system has very special on-disk layout and architectural philosophy. So, to have
a general-purpose scheme sounds very exciting but it could be really tough to find a “global”
optimum that will serve perfectly all file systems. But it could be worth to try. :) 

> I brought this up at LPC last year[4], but we did not have much time
> for discussions.
> 
> What is missing
> ---------------
> 
> Data needs to be allocated to zones in a way that minimizes the need for
> reclaim. Best-effort placement decision making could be implemented to place
> files of similar life times into the same zones.
> 
> To do this, file systems would have to utilize some sort of hint to
> separate data into different life-time-buckets and map those to
> different zones.
> 
> There is a user ABI for hints available - the write-life-time hint interface
> that was introduced for streams [3]. F2FS is the only user of this currently.
> 
> BTRFS and other file systems with zoned support could make use of it too,
> but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> 
> Maybe the life time hints could be combined with process id to separate
> different workloads better, maybe we need something else. F2FS supports
> cold/hot data separation based on file extension, which is another solution.
> 

It’s tricky, I assume. So, it looks like a good discussion. As far as I can see, such policy
can be implemented above any particular file system.

File extension is not stable basis because file could not have extension at all,
extension can be wrong, or not representative at all. And to check extension on
file system level sounds like breaking the file system philosophy.

Write-life-time hints sounds tricky too, from my point of view. Not every application
can properly define the lifetime of data. Also, file system’s allocation policy/model is
heavily defines distribution of data on the volume. And it is really tough to follow
the policy of distribution of logical blocks among streams with different lifetime.

> This is the first thing I'd like to discuss.
> 
> The second thing I'd like to discuss is testing and benchmarking, which
> is probably even more important and something that should be put into
> place first.
> 
> Testing/benchmarking
> --------------------
> 
> I think any improvements must be measurable, preferably without having to
> run live production application workloads.
> 
> Benchmarking and testing is generally hard to get right, and particularily hard
> when it comes to testing and benchmarking reclaim/garbage collection,
> so it would make sense to share the effort.
> 
> We should be able to use fio to model a bunch of application workloads
> that would benefit from data placement (lsm-tree based key-value database
> stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 
> 
> Once we have a set of benchmarks that we collectively care about, I think we
> can work towards generic data placement methods with some level of
> confidence that it will actually work in practice.
> 
> Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> would be beneficial not only for kernel file systems but also for user-space
> and distributed file systems such as ceph.
> 

Yeah, simulation of aged volume (that requires GC activity) is pretty complicated task.

Thanks,
Slava.