Re: Unexpected reflink/subvol snapshot behaviour

From: Filipe Manana <fdmanana@gmail.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Dave Chinner <david@fromorbit.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Unexpected reflink/subvol snapshot behaviour
Date: Sun, 24 Jan 2021 13:08:55 +0000	[thread overview]
Message-ID: <CAL3q7H4zpBuLqqR2_JYO2ru40_0x58ouJBYRisdBc-vC4tP34A@mail.gmail.com> (raw)
In-Reply-To: <58c9d792-5af4-7b54-2072-77230658e677@gmx.com>

On Sat, Jan 23, 2021 at 8:46 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/1/22 上午6:20, Dave Chinner wrote:
> > Hi btrfs-gurus,
> >
> > I'm running a simple reflink/snapshot/COW scalability test at the
> > moment. It is just a loop that does "fio overwrite of 10,000 4kB
> > random direct IOs in a 4GB file; snapshot" and I want to check a
> > couple of things I'm seeing with btrfs. fio config file is appended
> > to the email.
> >
> > Firstly, what is the expected "space amplification" of such a
> > workload over 1000 iterations on btrfs? This will write 40GB of user
> > data, and I'm seeing btrfs consume ~220GB of space for the workload
> > regardless of whether I use subvol snapshot or file clones
> > (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> > wondering if this is expected or whether there's something else
> > going on. XFS amplification for 1000 iterations using reflink is
> > only 1.4x, so 5.5x seems somewhat excessive to me.
>
> This is mostly due to the way btrfs handles COW and the lazy extent
> freeing behavior.
>
> For btrfs, an extent only get freed when there is no reference on any
> part of it, and
>
> This means, if we have an file which has one 128K file extent written to
> disk, and then write 4K, which will be COWed to another 4K extent, the
> 128K extent is still kept as is, even the no longer referred 4K range is
> still kept there, with extra 4K space usage.
>
> This not only increase the space usage, but also increase metadata usage.
> But reduce the complexity on extent tree and snapshot creation.
>
>
> For the worst case, btrfs can allocate a 128 MiB file extent, and have
> good luck to write 127MiB into the extent. It will take 127MiB + 128MiB
> space, until the last 1MiB of the original extent get freed, the full
> 128MiB can be freed.

That is all true, but it does not apply to Dave's test.
If you look at the fio job, it does direct IO writes, all with a fixed
size of 4K, plus the file they write into was not preallocated
(fallocate=none).

>
>
> Thus above reflink/snapshot + DIO write is going to be very unfriendly
> for fs with lazy extent freeing and default data COW behavior.
>
> That's also why btrfs has a worse fragmentation problem.
>
> >
> > On a similar note, the IO bandwidth consumed by btrfs is way out of
> > proportion with the amount of user data being written. I'm seeing
> > multiple GBs being written by btrfs on every iteration - easily
> > exceeding 5GB of writes per cycle in the later iterations of the
> > test. Given that only 40MB of user data is being written per cycle,
> > there's a write amplification factor of well over 100x ocurring
> > here. In comparison, XFS is writing roughly consistently at 80MB/s
> > to disk over the course of the entire workload, largely because of
> > journal traffic for the transactions run during COW and clone
> > operations.  Is such a huge amount of of IO expected for btrfs in
> > this situation?
>
> That's interesting. Any analyse on the type of bios submitted for the
> device?
>
> My educated guess is, metadata takes most of the space, and due to
> default DUP metadata profile, it get doubled to 5G?
>
> >
> > As a side effect of that IO load, btrfs is driving the machine hard
> > into memory reclaim because the page cache footprint of each
> > writeback cycle. btrfs is dirtying a large number of metadata pages
> > in the page cache (at least 50% of the ram in the machine is dirtied
> > on every snapshot/reflink cycle). Hence when the system needs memory
> > reclaim, it hits large amounts of memory it can't reclaim
> > immediately and things go bad very quickly.  This is causing
> > everything on the machine to stall while btrfs dumps the dirty
> > metadata pages to disk at over 1GB/s and 10,000 IOPS for several
> > seconds. Is this expected behaviour?
>
> This may be caused by above mentioned lazy extent freeing (bookend
> extent) behavior.
>
> Especially when 4K dio is submitted, each 4K write will cause an new
> extent, greatly increasing metadata usage.
>
> For the 10,000 4KiB DIO write inside a 4GiB file, it would easily lead
> to 10,000 extents just in one iteration.
> And with several iteration, the 4GiB file will be so heavily fragmented
> that all extents are just in 4K size. (2^20 extents, which will take
> 100MiB metadata just for one subvol).
>
> And since you're also taking snapshot, this means each new extent in
> each subvol will always has its reference there, no way to be freed, and
> cause tons of slowdown just because the amount of metadata.
>
> >
> > Next, subvol snapshot and clone time appears to be scale with the
> > number of snapshots/clones already present. The initial clone/subvol
> > snapshot command take a few milliseconds. At 50 snapshots it take
> > 1.5s. At 200 snapshots it takes 7.5s. At 500 it takes 15s and at
> >> 850 it seems to level off at about 30s a snapshot. There are
> > outliers that take double this time (63s was the longest) and the
> > variation between iterations can be quite substantial. Is this
> > expected scalablity?
>
> The snapshot will make the current subvolume to be fully committed
> before really taking the snapshot.
>
> Considering above metadata overhead, I believe most of the performance
> penalty should come from the metadata writeback, not the snapshot
> creation itself.
>
> If you just create a big subvolume, sync the fs, and try to take as many
> snapshot as you wish, the overhead should be pretty the same as
> snapshotting an empty subvolume.
>
> >
> > On subvol snapshot execution, there appears to be a bug manifesting
> > occasionally and may be one of the reasons for things being so
> > variable. The visible manifestation is that every so often a subvol
> > snapshot takes 0.02s instead of the multiple seconds all the
> > snapshots around it are taking:
>
> That 0.02s the real overhead for snapshot creation.
>
> The short snapshot creation time means those snapshot creation just wait
> for the same transaction to be committed, thus they don't need to wait
> for the full transaction committment, just need to do the snapshot.
>
>
> [...]
>
> > In these instances, fio takes about as long as I would expect the
> > snapshot to have taken to run. Regardless of the cause, something
> > looks to be broken here...
> >
> > An astute reader might also notice that fio performance really drops
> > away quickly as the number of snapshots goes up. Loop 0 is the "no
> > snapshots" performance. By 10 snapshots, performance is half the
> > no-snapshot rate. By 50 snapshots, performance is a quarter of the
> > no-snapshot peroframnce. It levels out around 6-7000 IOPS, which is
> > about 15% of the non-snapshot performance. Is this expected
> > performance degradation as snapshot count increases?
>
> No, this is mostly due to the exploding amount of metadata caused by the
> near-worst case workload.
>
> Yeah, btrfs is pretty bad at handling small dio writes, which can easily
> explode the metadata usage.
>
> Thus for such dio case, we recommend to use preallocated file +
> nodatacow, so that we won't create new extents (unless snapshot is
> involved).
>
> >
> > And before you ask, reflink copies of the fio file rather than
> > subvol snapshots have largely the same performance, IO and
> > behavioural characteristics. The only difference is that clone
> > copying also has a cyclic FIO performance dip (every 3-4 cycles)
> > that corresponds with the system driving hard into memory reclaim
> > during periodic writeback from btrfs.
> >
> > FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> > performance stays largely consistent across all 1000 iterations at
> > around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> > the number of extents in the source file and levels off at about
> > 10-11s per cycle as the extent count in the source file levels off
> > at ~850,000 extents. XFS completes the 1000 iterations of
> > write/clone in about 4 hours, btrfs completels the same part of the
> > workload in about 9 hours.
> >
> > Oh, I almost forget - FIEMAP performance. After the reflink test, I
> > map all the extents in all the cloned files to a) count the extents
> > and b) confirm that the difference between clones is correct (~10000
> > extents not shared with the previous iteration). Pulling the extent
> > maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> > minutes for the whole set when run serialised. btrfs takes 90-100s
> > per clone - after 8 hours it had only managed to map 380 files and
> > was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> > _half a million_ read IOs to map the extents of a single clone that
> > only had a million extents in it. Is it expected that FIEMAP is so
> > slow and IO intensive on cloned files?
>
> Exploding fragments, definitely needs a lot of metadata read, right?
>
> >
> > As there are no performance anomolies or memory reclaim issues with
> > XFS running this workload, I suspect the issues I note above are
> > btrfs issues, not expected behaviour.  I'm not sure what the
> > expected scalability of btrfs file clones and snapshots are though,
> > so I'm interested to hear if these results are expected or not.
>
> I hate to say that, yes, you find the worst scenario workload for btrfs.
>
> 4K dio + snapshot is the best way to explode the already high btrfs
> metadata usage, and exploit the lazy extent reclaim behavior.
>
> But if no snapshot is involved, at least you can limit the damage, a
> 4GiB file can only be at most 1M 4K file extents.
> But with snapshots, there is no upper limit now.
>
> Thanks,
> Qu
>
> >
> > Cheers,
> >
> > Dave.
> >

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”