Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
Date: Fri, 11 Mar 2022 22:16:45 -0500	[thread overview]
Message-ID: <YiwQnf933PMnhGKI@hungrycats.org> (raw)
In-Reply-To: <CAODFU0oWBvRkpM3oirpfitGiTex8=EST021egQzUiBCMYrhVVg@mail.gmail.com>

On Sat, Mar 12, 2022 at 01:01:36AM +0100, Jan Ziak wrote:
> On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > On 2022/3/12 07:28, Jan Ziak wrote:
> > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > >> As stated before, autodefrag is not really that useful for database.
> > >
> > > Do you realize that you are claiming that btrfs autodefrag should not
> > > - by design - be effective in the case of high-fragmentation files?
> >
> > Unfortunately, that's exactly what I mean.
> >
> > We all know random writes would cause fragments, but autodefrag is not
> > like regular defrag ioctl, as it only scan newer extents.
> >
> > For example:
> >
> > Our autodefrag is required to defrag writes newer than gen 100, and our
> > inode has the following layout:
> >
> > |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---|
> >      Gen 50       Gen 101     Gen 49      Gen 30      Gen 30
> >
> > Then autodefrag will only try to defrag extent B and extent C.
> >
> > Extent B meets the generation requirement, and is mergable with the next
> > extent C.
> >
> > But all the remaining extents A, D, E will not be defragged as their
> > generations don't meet the requirement.
> >
> > While for regular defrag ioctl, we don't have such generation
> > requirement, and is able to defrag all extents from A to E.
> > (But cause way more IO).
> >
> > Furthermore, autodefrag works by marking the target range dirty, and
> > wait for writeback (and hopefully get more writes near it, so it can get
> > even larger)
> >
> > But if the application, like the database, is calling fsync()
> > frequently, such re-dirtied range is going to writeback almost
> > immediately, without any further chance to get merged larger.
> 
> So, basically, what you are saying is that you are refusing to work
> together towards fixing/improving the auto-defragmentation algorithm.
> 
> Based on your decision in this matter, I am now forced either to find
> a replacement filesystem with features similar to btrfs or to
> implement a filesystem (where auto-defragmentation works correctly)
> myself.

The second of those options is the TL;DR of my previous email, and
you don't need to rewrite any part of btrfs except the autodefrag feature.

I can answer questions to get you started.

You will need to read up on:

	TREE_SEARCH_V2, the search ioctl.  This gives you fast access to
	new extent refs.  You'll need to decode them.  The code in
	btrfs-progs for printing tree items is very useful to see how
	this is done.

	INO_PATHS, the resolve-inode-to-path-name ioctl.  TREE_SEARCH_V2
	will give you inode numbers, but DEFRAG_RANGE needs an open fd.
	This ioctl is the bridge between them.

	DEFRAG_RANGE, the defrag ioctl.  This defrags a range of a file.

The simple daemon model is:

	- track the filesystem transid every 30 seconds, sleep until it changes

	- use the TREE_SEARCH_V2 ioctl to find new extent references since
	the previous transid.  See the 'btrfs sub find-new' implementation
	for details on extracting extent references and filtering by age.
	This has to be run on every subvol individually, but you can
	have a daemon for every subvol, or one process that runs this
	loops over all subvols.

	- examine extent references to see if they are good candidates
	for dedupe:  not too large or too small, no holes between, etc.
	This is a replica of the existing kernel algorithm.  You can
	improve on this immediately by running new searches for
	neighboring extents within optimal defrag range without the
	transid filter.

	- ignore bad extent candidates

	- use INO_PATHS to retrieve the filenames of the inode containing
	the extent.  You can improve on this by filtering filenames of
	files that are known to have extremely high update rates, or any
	other criteria that seem useful.

	- open the file using one of the names, and issue DEFRAG_RANGE
	to defragment the extents.

If you store the last transid persistently (say in a /var file), you
can run one iteration of the loop periodically during periods of low
sensitivity to IO latency.  It doesn't need to run continuously, you
can start and stop it at any time depending on need.

There are a few gotchas.  The main one is that there's an upper bound on
optimal extent size in btrfs, as well as a lower bound.  Extents that
are too large waste space because they cannot be deallocated until the
last reference to the last block is overwritten or deleted.  So you probably
want to stop defragmenting once the extents are 256K or so on a database
file, or it will waste a lot of space.  Use lower values for heavily
active files with random writes, higher values for infrequently
modified files.  Maximum extent size is 128K for a compressed extent,
128M for uncompressed.

> Since I failed to persuade you that there are serious errors/mistakes
> in the current btrfs-autodefrag implementation, this is my last email
> in this whole forum thread.

> Sincerely
> Jan