Re: [PATCH 1/5] btrfs: 1G falloc extents

From: Filipe Manana <fdmanana@kernel.org>
To: Boris Burkov <boris@bur.io>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 1/5] btrfs: 1G falloc extents
Date: Thu, 6 Oct 2022 20:56:03 +0100	[thread overview]
Message-ID: <CAL3q7H6Bn47CFL0tOsjCZ3iLgEPRm9_ZXV7duUSMZ2H-g0JhgQ@mail.gmail.com> (raw)
In-Reply-To: <Yz8gq6ErCqeMGUO1@zen>

On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote:
>
> On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
> > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > >
> > >
> > >
> > > On 2022/10/6 03:49, Boris Burkov wrote:
> > > > When doing a large fallocate, btrfs will break it up into 256MiB
> > > > extents. Our data block groups are 1GiB, so a more natural maximum size
> > > > is 1GiB, so that we tend to allocate and fully use block groups rather
> > > > than fragmenting the file around.
> > > >
> > > > This is especially useful if large fallocates tend to be for "round"
> > > > amounts, which strikes me as a reasonable assumption.
> > > >
> > > > While moving to size classes reduces the value of this change, it is
> > > > also good to compare potential allocator algorithms against just 1G
> > > > extents.
> > >
> > > Btrfs extent booking is already causing a lot of wasted space, is this
> > > larger extent size really a good idea?
> > >
> > > E.g. after a lot of random writes, we may have only a very small part of
> > > the original 1G still being referred.
> > > (The first write into the pre-allocated range will not be COWed, but the
> > > next one over the same range will be COWed)
> > >
> > > But the full 1G can only be freed if none of its sectors is referred.
> > > Thus this would make preallocated space much harder to be free,
> > > snapshots/reflink can make it even worse.
> > >
> > > So wouldn't such enlarged preallocted extent size cause more pressure?
> >
> > I agree, increasing the max extent size here does not seem like a good
> > thing to do.
> >
> > If an application fallocates space, then it generally expects to write to all
> > that space. However future random partial writes may not rewrite the entire
> > extent for a very long time, therefore making us keep a 1G extent for a very
> > long time (or forever in the worst case).
> >
> > Even for NOCOW files, it's still an issue if snapshots are used.
> >
>
> I see your point, and agree 1GiB is worse with respect to bookend
> extents. Since the patchset doesn't really rely on this, I don't mind
> dropping the change. I was mostly trying to rule this out as a trivial
> fix that would obviate the need for other changes.
>
> However, I'm not completely convinced by the argument, for two reasons.
>
> The first is essentially Qu's last comment. If you guys are right, then
> 256MiB is probably a really bad value for this as well, and we should be
> reaping the wins of making it smaller.

Well, that's not a reason to increase the size, quite the opposite.

>
> The second is that I'm not convinced of how the regression is going to
> happen here in practice.

Over the years, every now and then there are users reporting that
their free space is mysteriously
going away and they have no clue why, even when not using snapshots.
More experienced users
provide help and eventually notice it's caused by many bookend
extents, and the given workaround
is to defrag the files.

You can frequently see such reports in threads on this mailing list or
in the IRC channel.

> Let's say someone does a 2GiB falloc and writes
> the file out once. In the old code that will be 8 256MiB extents, in the
> new code, 2 1GiB extents. Then, to have this be a regression, the user
> would have to fully overwrite one of the 256MiB extents, but not 1GiB.
> Are there a lot of workloads that don't use nocow, and which randomly
> overwrite all of a 256MiB extent of a larger file? Maybe..

Assuming that all or most workloads that use fallocate also set nocow
on the files, is quite a big stretch IMO.
It's true that for performance critical applications like traditional
relational databases, people usually set nocow,
or the application or some other software does it automatically for them.

But there are also many applications that use fallocate and people are
not aware of and don't set nocow, nor
anything else sets nocow automatically on the files used by them.
Specially if they are not IO intensive, in which
case they may not be noticed and therefore space wasted due to bookend
extents is more likely to happen.

Having a bunch of very small extents, say 4K to 512K, is clearly bad
compared to having just a few 256M extents.
But having a few 1G extents instead of a group of 256M extents,
probably doesn't make such a huge difference as
in the former case.

Does the 1G extents benefit some facebook specific workload, or some
well known workload?

>
> Since I'm making the change, it's incumbent on me to prove it's safe, so
> with that in mind, I would reiterate I'm fine to drop it.
>
> > >
> > > In fact, the original 256M is already too large to me.
> > >
> > > Thanks,
> > > Qu
> > > >
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >   fs/btrfs/inode.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index 45ebef8d3ea8..fd66586ae2fc 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> > > >       if (trans)
> > > >               own_trans = false;
> > > >       while (num_bytes > 0) {
> > > > -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > > > +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
> > > >               cur_bytes = max(cur_bytes, min_size);
> > > >               /*
> > > >                * If we are severely fragmented we could end up with really