Btrfs: Issues with remove-intensive workload

* Btrfs: Issues with remove-intensive workload
@ 2021-06-07 20:23 Karaliou, Aliaksei
  2021-06-16  2:01 ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Karaliou, Aliaksei @ 2021-06-07 20:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sinnamohideen, Shafeeq

Hi,

We have an issue with certain workloads and are looking for a hint regarding what
to do in our case. Maybe there are ways of improving this in BTRFS module,
or there is a way for us to avoid certain behaviors.

Issue:
Occasionally, we have a pretty intensive delete-only workload with
tens to hundreds of gigabytes of data being removed during a rather short period
of time as the filesystem allows. It's not 'rm -rf' of a tree but removes
in different locations of directory hierarchy tree by individual requests.
Also, in order to maintain consistency after possible power off/crash events
we fsync these directories with a few background threads relatively frequently
(period of sync is about 5 seconds).

After a few minutes under such load, those background fsyncs are getting stuck
for varying amounts of time. It might be 30 seconds, usually 200-600, but
we've once seen over 13K seconds. Remove operations continue, but since
fsyncs are required operations and we don't allow directories to be
unsynced 'forever', we eventually block foreground operations in application.
After that we don't produce more removes and the transaction has a better
chance to finish 'quickly'.

Hardware: 6 HDDs merged into MDRAID 0 with 8M stripe.
Mount options: relatime,space_cache=v2.

For testing we used files around 100-250M. They had different number of
extents. Half of them had usually less than 4, but on average around 200-250.
Files might be located in snapshots as well (so they have usually unmodified
copies in other BTRFS subvolumes). There are several foreground threads
which may perform some usual activity (search/modify database, etc.) then
remove a file from a tree. This seems similar to having an number of
scripts each picks up a random file from random directory, remove it,
then sleep fraction of a second, and then repeat.

Before getting stuck or transaction commit begin, we deleted at a rate of
300-400 files/s. After an hour, this degraded to an average of just
18 file/s.

It seems that creating sparse files with similar average number of extents
shows same behavior.

Tests were conducted on OpenSUSE Leap:
 * 15.0 (kernel-default-4.12.14-lp150.11.4 but btrfs is modified -
   several patches from 15.1 applied)
 * 15.2 (kernel-default-5.3.18-lp152.19.2)
 * 15.0 with kernel updated to kernel-default-5.12.6-1.1.gfe25271

Analysis with trace points added using 'perf' showed a few scenarios:
A) Several fsync() calls are blocked on 'btrfs_run_delayed_refs(trans, 0)' in
   btrfs_commit_transaction.
   This issue seems to be fixed by
     '[PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit'
   which was already applied to kernel-default-5.12.6-1.1.gfe25271 where most of tests
   were conducted.

B) When our background thread calls fsync() on a directory, its inode already has
   BTRFS_INODE_NEEDS_FULL_SYNC flag set, unfortunately it was rare case and we
   didn't trace where it came from.
   In this case we perform a real-world transaction commit and are doing full
   amount of delayed-refs job instead of btrfs-transaction kthread.

C) btrfs-transaction kthread wakes up and works on committing current transaction.
   While it is on 'btrfs_run_delayed_refs(trans, 0)' stage everything doesn't
   seem to be so bad, but we are adding more delayed refs to be processed
   and this count is usually much more than initial triggering count that this call
   to btrfs_run_delayed_refs is going to take care of.
   Then when we change transaction state to TRANS_STATE_COMMIT_START
   fsync() threads are just stuck at wait_for_commit for a whole time.

I think there might be other scenarios when we move away from pure-delete
workload, but I think these are enough.

More or less happy scenario here is when btrfs-transaction kthread is in charge
of commit (it's sleep interval doesn't really makes difference - 30 or 5 seconds).
But when one of our background fsync threads or foreground threads are in charge
that's a potential disaster because all delayed-refs job is going to block
this thread, especially foreground one.

As I understand, this type of workload is rather extreme and
current BTRFS design doesn't allow to for a way to split delayed-refs handling
across several transactions to allow more granular commits.

Also, an open question for us is how snapshot delete is different in this case?
Will we and when would we get stuck on fsync for snapshot deletion? Would it be
as long?

I assume that snapshot delete is a bit more efficient operation since
directories don't need to be updated and fsync'ed. Is this correct?
I once conducted a test of removing several huge files manually vs removing
them as a subvolume. It seem faster to delete as a subvolume.

I expected some throttling in BTRFS when deleting files with many extents,
but I don't see such behavior (and of course there might be delete of some 50G file
eventually, which is VM image and super-sparse - it will have zillion extents
and throttling applied only to this delete will not help).
Maybe there is something that might be improved within BTRFS internals?

We must delete files, but we potentially may delay this operation by just
moving them to a 'trashbin' directory, then deleting slowly.
But even in this case we would need to have extra information so we can throttle this
well enough so that the next fsync will not take minutes (less than 30 seconds would be ideal).
To my understanding, there is no such feedback mechanism that we can use.
The only indicator for us is file size and number of extents that we may obtain via FIEMAP.

An additional complication are snapshots that might be taken when we still have a bunch
of files sitting in this 'trashbin' directory waiting for our background deletion - at least
from metadata side that means increased amount of work with each snapshot created.

As an option we considered usage of reflink ioctl to first move portions
of file (depending extents count) to some huge file located in different
volume (on the same device of course), then delete it if that makes
internal operations faster, but that definitely adds cost of reflink.

Any advice or comments on this issue and my thoughts on mitigations
would be much appreciated.

Best regards,
  Aliaksei

^ permalink raw reply	[flat|nested] 3+ messages in thread