All of lore.kernel.org
 help / color / mirror / Atom feed
* Btrfs: Issues with remove-intensive workload
@ 2021-06-07 20:23 Karaliou, Aliaksei
  2021-06-16  2:01 ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Karaliou, Aliaksei @ 2021-06-07 20:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sinnamohideen, Shafeeq

Hi,

We have an issue with certain workloads and are looking for a hint regarding what
to do in our case. Maybe there are ways of improving this in BTRFS module,
or there is a way for us to avoid certain behaviors.

Issue:
Occasionally, we have a pretty intensive delete-only workload with
tens to hundreds of gigabytes of data being removed during a rather short period
of time as the filesystem allows. It's not 'rm -rf' of a tree but removes
in different locations of directory hierarchy tree by individual requests.
Also, in order to maintain consistency after possible power off/crash events
we fsync these directories with a few background threads relatively frequently
(period of sync is about 5 seconds).

After a few minutes under such load, those background fsyncs are getting stuck
for varying amounts of time. It might be 30 seconds, usually 200-600, but
we've once seen over 13K seconds. Remove operations continue, but since
fsyncs are required operations and we don't allow directories to be
unsynced 'forever', we eventually block foreground operations in application.
After that we don't produce more removes and the transaction has a better
chance to finish 'quickly'.

Hardware: 6 HDDs merged into MDRAID 0 with 8M stripe.
Mount options: relatime,space_cache=v2.

For testing we used files around 100-250M. They had different number of
extents. Half of them had usually less than 4, but on average around 200-250.
Files might be located in snapshots as well (so they have usually unmodified
copies in other BTRFS subvolumes). There are several foreground threads
which may perform some usual activity (search/modify database, etc.) then
remove a file from a tree. This seems similar to having an number of
scripts each picks up a random file from random directory, remove it,
then sleep fraction of a second, and then repeat.

Before getting stuck or transaction commit begin, we deleted at a rate of
300-400 files/s. After an hour, this degraded to an average of just
18 file/s.

It seems that creating sparse files with similar average number of extents
shows same behavior.

Tests were conducted on OpenSUSE Leap:
 * 15.0 (kernel-default-4.12.14-lp150.11.4 but btrfs is modified -
   several patches from 15.1 applied)
 * 15.2 (kernel-default-5.3.18-lp152.19.2)
 * 15.0 with kernel updated to kernel-default-5.12.6-1.1.gfe25271

Analysis with trace points added using 'perf' showed a few scenarios:
A) Several fsync() calls are blocked on 'btrfs_run_delayed_refs(trans, 0)' in
   btrfs_commit_transaction.
   This issue seems to be fixed by
     '[PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit'
   which was already applied to kernel-default-5.12.6-1.1.gfe25271 where most of tests
   were conducted.

B) When our background thread calls fsync() on a directory, its inode already has
   BTRFS_INODE_NEEDS_FULL_SYNC flag set, unfortunately it was rare case and we
   didn't trace where it came from.
   In this case we perform a real-world transaction commit and are doing full
   amount of delayed-refs job instead of btrfs-transaction kthread.

C) btrfs-transaction kthread wakes up and works on committing current transaction.
   While it is on 'btrfs_run_delayed_refs(trans, 0)' stage everything doesn't
   seem to be so bad, but we are adding more delayed refs to be processed
   and this count is usually much more than initial triggering count that this call
   to btrfs_run_delayed_refs is going to take care of.
   Then when we change transaction state to TRANS_STATE_COMMIT_START
   fsync() threads are just stuck at wait_for_commit for a whole time.

I think there might be other scenarios when we move away from pure-delete
workload, but I think these are enough.

More or less happy scenario here is when btrfs-transaction kthread is in charge
of commit (it's sleep interval doesn't really makes difference - 30 or 5 seconds).
But when one of our background fsync threads or foreground threads are in charge
that's a potential disaster because all delayed-refs job is going to block
this thread, especially foreground one.

As I understand, this type of workload is rather extreme and
current BTRFS design doesn't allow to for a way to split delayed-refs handling
across several transactions to allow more granular commits.

Also, an open question for us is how snapshot delete is different in this case?
Will we and when would we get stuck on fsync for snapshot deletion? Would it be
as long?

I assume that snapshot delete is a bit more efficient operation since
directories don't need to be updated and fsync'ed. Is this correct?
I once conducted a test of removing several huge files manually vs removing
them as a subvolume. It seem faster to delete as a subvolume.

I expected some throttling in BTRFS when deleting files with many extents,
but I don't see such behavior (and of course there might be delete of some 50G file
eventually, which is VM image and super-sparse - it will have zillion extents
and throttling applied only to this delete will not help).
Maybe there is something that might be improved within BTRFS internals?

We must delete files, but we potentially may delay this operation by just
moving them to a 'trashbin' directory, then deleting slowly.
But even in this case we would need to have extra information so we can throttle this
well enough so that the next fsync will not take minutes (less than 30 seconds would be ideal).
To my understanding, there is no such feedback mechanism that we can use.
The only indicator for us is file size and number of extents that we may obtain via FIEMAP.

An additional complication are snapshots that might be taken when we still have a bunch
of files sitting in this 'trashbin' directory waiting for our background deletion - at least
from metadata side that means increased amount of work with each snapshot created.

As an option we considered usage of reflink ioctl to first move portions
of file (depending extents count) to some huge file located in different
volume (on the same device of course), then delete it if that makes
internal operations faster, but that definitely adds cost of reflink.

Any advice or comments on this issue and my thoughts on mitigations
would be much appreciated.

Best regards,
  Aliaksei

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Btrfs: Issues with remove-intensive workload
  2021-06-07 20:23 Btrfs: Issues with remove-intensive workload Karaliou, Aliaksei
@ 2021-06-16  2:01 ` Qu Wenruo
  2021-06-16 23:01   ` Karaliou, Aliaksei
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2021-06-16  2:01 UTC (permalink / raw)
  To: Karaliou, Aliaksei, linux-btrfs; +Cc: Sinnamohideen, Shafeeq, Josef Bacik



On 2021/6/8 上午4:23, Karaliou, Aliaksei wrote:
> Hi,
>
> We have an issue with certain workloads and are looking for a hint regarding what
> to do in our case. Maybe there are ways of improving this in BTRFS module,
> or there is a way for us to avoid certain behaviors.
>
> Issue:
> Occasionally, we have a pretty intensive delete-only workload with
> tens to hundreds of gigabytes of data being removed during a rather short period
> of time as the filesystem allows.

First thing first, even I hate to admit, btrfs has a much larger
metadata overhead compared to other fses.

We have csum, larger file extent structure, and mandatory metadata CoW,
so metadata performance is indeed worse compared to other fses.

> It's not 'rm -rf' of a tree but removes
> in different locations of directory hierarchy tree by individual requests.
> Also, in order to maintain consistency after possible power off/crash events
> we fsync these directories with a few background threads relatively frequently
> (period of sync is about 5 seconds).

Fsync is another point of pain of btrfs.

But in your case, I don't think fsync is really needed.

Btrfs maintains its metadata correctness by mandatory COW.
Meaning you either see old metadata (the file not yet deleted) or the
new metadata (file at least unlinked).

Thus I don't think the extra fsync is needed, and fsync itself is also
pretty heavy in btrfs, even with the log tree optimization.

Or do you mean, even seeing the older files are not acceptable in your
use case?

But anyway, your problem describe indeed seems to be a problem, we
should not block for so long time no matter whatever the operations are.

>
> After a few minutes under such load, those background fsyncs are getting stuck
> for varying amounts of time. It might be 30 seconds, usually 200-600, but
> we've once seen over 13K seconds. Remove operations continue, but since
> fsyncs are required operations and we don't allow directories to be
> unsynced 'forever', we eventually block foreground operations in application.
> After that we don't produce more removes and the transaction has a better
> chance to finish 'quickly'.
>
> Hardware: 6 HDDs merged into MDRAID 0 with 8M stripe.
> Mount options: relatime,space_cache=v2.
>
> For testing we used files around 100-250M. They had different number of
> extents. Half of them had usually less than 4, but on average around 200-250.
> Files might be located in snapshots as well (so they have usually unmodified
> copies in other BTRFS subvolumes). There are several foreground threads
> which may perform some usual activity (search/modify database, etc.) then
> remove a file from a tree. This seems similar to having an number of
> scripts each picks up a random file from random directory, remove it,
> then sleep fraction of a second, and then repeat.
>
> Before getting stuck or transaction commit begin, we deleted at a rate of
> 300-400 files/s. After an hour, this degraded to an average of just
> 18 file/s.
>
> It seems that creating sparse files with similar average number of extents
> shows same behavior.
>
> Tests were conducted on OpenSUSE Leap:
>   * 15.0 (kernel-default-4.12.14-lp150.11.4 but btrfs is modified -
>     several patches from 15.1 applied)
>   * 15.2 (kernel-default-5.3.18-lp152.19.2)
>   * 15.0 with kernel updated to kernel-default-5.12.6-1.1.gfe25271
>
> Analysis with trace points added using 'perf' showed a few scenarios:
> A) Several fsync() calls are blocked on 'btrfs_run_delayed_refs(trans, 0)' in
>     btrfs_commit_transaction.
>     This issue seems to be fixed by
>       '[PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit'
>     which was already applied to kernel-default-5.12.6-1.1.gfe25271 where most of tests
>     were conducted.
>
> B) When our background thread calls fsync() on a directory, its inode already has
>     BTRFS_INODE_NEEDS_FULL_SYNC flag set, unfortunately it was rare case and we
>     didn't trace where it came from.
>     In this case we perform a real-world transaction commit and are doing full
>     amount of delayed-refs job instead of btrfs-transaction kthread.
>
> C) btrfs-transaction kthread wakes up and works on committing current transaction.
>     While it is on 'btrfs_run_delayed_refs(trans, 0)' stage everything doesn't
>     seem to be so bad, but we are adding more delayed refs to be processed
>     and this count is usually much more than initial triggering count that this call
>     to btrfs_run_delayed_refs is going to take care of.
>     Then when we change transaction state to TRANS_STATE_COMMIT_START
>     fsync() threads are just stuck at wait_for_commit for a whole time.
>
> I think there might be other scenarios when we move away from pure-delete
> workload, but I think these are enough.
>
> More or less happy scenario here is when btrfs-transaction kthread is in charge
> of commit (it's sleep interval doesn't really makes difference - 30 or 5 seconds).
> But when one of our background fsync threads or foreground threads are in charge
> that's a potential disaster because all delayed-refs job is going to block
> this thread, especially foreground one.

I believe Josef has more experience on this delayed refs problem.
Adding him to CC.

>
> As I understand, this type of workload is rather extreme and
> current BTRFS design doesn't allow to for a way to split delayed-refs handling
> across several transactions to allow more granular commits.
>
> Also, an open question for us is how snapshot delete is different in this case?

At least I can answer this.

For subvolume/snapshot deletion, we skip the whole tree balancing during
operations, and delete trees without balancing the tree itself (thus no
COW).

This reduce the amount of delayed ref, but can only work when no-one but
btrfs itself can modify the tree, thus it only works when the whole tree
is unlinked, thus can only happen in subvolume/snapshot deletion.

> Will we and when would we get stuck on fsync for snapshot deletion? Would it be
> as long?

Snapshot deletion can go across several transactions, thus I don't think
it could be a problem.

>
> I assume that snapshot delete is a bit more efficient operation since
> directories don't need to be updated and fsync'ed. Is this correct?

It's more or less correct.

For snapshot deletion, we no longer care about directories at all, all
we care are:
- Tree nodes
- Tree leaves
- Data extents

We just delete them, without CoW the tree blocks, thus only half of the
delayed refs are generated, furthermore subvolume deletion can go across
several transactions, unlike unlink which must happen in one go.

> I once conducted a test of removing several huge files manually vs removing
> them as a subvolume. It seem faster to delete as a subvolume.
>
> I expected some throttling in BTRFS when deleting files with many extents,
> but I don't see such behavior (and of course there might be delete of some 50G file
> eventually, which is VM image and super-sparse - it will have zillion extents
> and throttling applied only to this delete will not help).
> Maybe there is something that might be improved within BTRFS internals?
>
> We must delete files, but we potentially may delay this operation by just
> moving them to a 'trashbin' directory, then deleting slowly.

In fact, btrfs is already delaying the deletion.

When we delete a file, we just unlink it and mark it as orphan.
This part will happen in current transaction, then later deletion of the
file extents and its csum will happen in later transactions.

But even unlinking just one inode, will make us to CoW the tree, and
generate some delayed refs.
When this accumulates, it's quite a lot.

Moving it to trash bin is even worse IMHO.
Moving the inodes means we need to unlink and link, double the workload
than just unlink.
And you still need to unlink anyway, thus I don't think it's really
worthy though.

Thanks,
Qu

> But even in this case we would need to have extra information so we can throttle this
> well enough so that the next fsync will not take minutes (less than 30 seconds would be ideal).
> To my understanding, there is no such feedback mechanism that we can use.
> The only indicator for us is file size and number of extents that we may obtain via FIEMAP.
>
> An additional complication are snapshots that might be taken when we still have a bunch
> of files sitting in this 'trashbin' directory waiting for our background deletion - at least
> from metadata side that means increased amount of work with each snapshot created.
>
> As an option we considered usage of reflink ioctl to first move portions
> of file (depending extents count) to some huge file located in different
> volume (on the same device of course), then delete it if that makes
> internal operations faster, but that definitely adds cost of reflink.
>
> Any advice or comments on this issue and my thoughts on mitigations
> would be much appreciated.
>
> Best regards,
>    Aliaksei
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Btrfs: Issues with remove-intensive workload
  2021-06-16  2:01 ` Qu Wenruo
@ 2021-06-16 23:01   ` Karaliou, Aliaksei
  0 siblings, 0 replies; 3+ messages in thread
From: Karaliou, Aliaksei @ 2021-06-16 23:01 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Sinnamohideen, Shafeeq, Josef Bacik

> On 2021/6/8 上午4:23, Karaliou, Aliaksei wrote:
> > Hi,
> >
> > We have an issue with certain workloads and are looking for a hint regarding what
> > to do in our case. Maybe there are ways of improving this in BTRFS module,
> > or there is a way for us to avoid certain behaviors.
> >
> > Issue:
> > Occasionally, we have a pretty intensive delete-only workload with
> > tens to hundreds of gigabytes of data being removed during a rather short period
> > of time as the filesystem allows.
>
> First thing first, even I hate to admit, btrfs has a much larger
> metadata overhead compared to other fses.
>
> We have csum, larger file extent structure, and mandatory metadata CoW,
> so metadata performance is indeed worse compared to other fses.

Even if that's a bit sad, that's great to know. No matter how general-purpose FS is
supposed to be, better to know more about it from different angles.

> > It's not 'rm -rf' of a tree but removes
> > in different locations of directory hierarchy tree by individual requests.
> > Also, in order to maintain consistency after possible power off/crash events
> > we fsync these directories with a few background threads relatively frequently
> > (period of sync is about 5 seconds).
>
> Fsync is another point of pain of btrfs.
>
> But in your case, I don't think fsync is really needed.
>
> Btrfs maintains its metadata correctness by mandatory COW.
> Meaning you either see old metadata (the file not yet deleted) or the
> new metadata (file at least unlinked).
>
> Thus I don't think the extra fsync is needed, and fsync itself is also
> pretty heavy in btrfs, even with the log tree optimization.
>
> Or do you mean, even seeing the older files are not acceptable in your
> use case?

Fsyncs are really necessity for us. We can deal with inconsistencies after crashes
(read: we hope), but can hold only a little piece of additional information to handle it.
Think about it as btrfs transactions - once in a while it should be committed.
Same with fsyncs, for data and metadata. And as soon as fsync is completed we kindly rely
on BTRFS internal housekeeping to keep everything in consistent and proper state.

> But anyway, your problem describe indeed seems to be a problem, we
> should not block for so long time no matter whatever the operations are.
>
> >
> > After a few minutes under such load, those background fsyncs are getting stuck
> > for varying amounts of time. It might be 30 seconds, usually 200-600, but
> > we've once seen over 13K seconds. Remove operations continue, but since
> > fsyncs are required operations and we don't allow directories to be
> > unsynced 'forever', we eventually block foreground operations in application.
> > After that we don't produce more removes and the transaction has a better
> > chance to finish 'quickly'.
> >
> > Hardware: 6 HDDs merged into MDRAID 0 with 8M stripe.
> > Mount options: relatime,space_cache=v2.
> >
> > For testing we used files around 100-250M. They had different number of
> > extents. Half of them had usually less than 4, but on average around 200-250.
> > Files might be located in snapshots as well (so they have usually unmodified
> > copies in other BTRFS subvolumes). There are several foreground threads
> > which may perform some usual activity (search/modify database, etc.) then
> > remove a file from a tree. This seems similar to having an number of
> > scripts each picks up a random file from random directory, remove it,
> > then sleep fraction of a second, and then repeat.
> >
> > Before getting stuck or transaction commit begin, we deleted at a rate of
> > 300-400 files/s. After an hour, this degraded to an average of just
> > 18 file/s.
> >
> > It seems that creating sparse files with similar average number of extents
> > shows same behavior.
> >
> > Tests were conducted on OpenSUSE Leap:
> >   * 15.0 (kernel-default-4.12.14-lp150.11.4 but btrfs is modified -
> >     several patches from 15.1 applied)
> >   * 15.2 (kernel-default-5.3.18-lp152.19.2)
> >   * 15.0 with kernel updated to kernel-default-5.12.6-1.1.gfe25271
> >
> > Analysis with trace points added using 'perf' showed a few scenarios:
> > A) Several fsync() calls are blocked on 'btrfs_run_delayed_refs(trans, 0)' in
> >     btrfs_commit_transaction.
> >     This issue seems to be fixed by
> >       '[PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit'
> >     which was already applied to kernel-default-5.12.6-1.1.gfe25271 where most of tests
> >     were conducted.
> >
> > B) When our background thread calls fsync() on a directory, its inode already has
> >     BTRFS_INODE_NEEDS_FULL_SYNC flag set, unfortunately it was rare case and we
> >     didn't trace where it came from.
> >     In this case we perform a real-world transaction commit and are doing full
> >     amount of delayed-refs job instead of btrfs-transaction kthread.
> >
> > C) btrfs-transaction kthread wakes up and works on committing current transaction.
> >     While it is on 'btrfs_run_delayed_refs(trans, 0)' stage everything doesn't
> >     seem to be so bad, but we are adding more delayed refs to be processed
> >     and this count is usually much more than initial triggering count that this call
> >     to btrfs_run_delayed_refs is going to take care of.
> >     Then when we change transaction state to TRANS_STATE_COMMIT_START
> >     fsync() threads are just stuck at wait_for_commit for a whole time.
> >
> > I think there might be other scenarios when we move away from pure-delete
> > workload, but I think these are enough.
> >
> > More or less happy scenario here is when btrfs-transaction kthread is in charge
> > of commit (it's sleep interval doesn't really makes difference - 30 or 5 seconds).
> > But when one of our background fsync threads or foreground threads are in charge
> > that's a potential disaster because all delayed-refs job is going to block
> > this thread, especially foreground one.
>
> I believe Josef has more experience on this delayed refs problem.
> Adding him to CC.
>
> >
> > As I understand, this type of workload is rather extreme and
> > current BTRFS design doesn't allow to for a way to split delayed-refs handling
> > across several transactions to allow more granular commits.
> >
> > Also, an open question for us is how snapshot delete is different in this case?
>
> At least I can answer this.
>
> For subvolume/snapshot deletion, we skip the whole tree balancing during
> operations, and delete trees without balancing the tree itself (thus no
> COW).
>
> This reduce the amount of delayed ref, but can only work when no-one but
> btrfs itself can modify the tree, thus it only works when the whole tree
> is unlinked, thus can only happen in subvolume/snapshot deletion.
>
> > Will we and when would we get stuck on fsync for snapshot deletion? Would it be
> > as long?
>
> Snapshot deletion can go across several transactions, thus I don't think
> it could be a problem.
>
> >
> > I assume that snapshot delete is a bit more efficient operation since
> > directories don't need to be updated and fsync'ed. Is this correct?
>
> It's more or less correct.
>
> For snapshot deletion, we no longer care about directories at all, all
> we care are:
> - Tree nodes
> - Tree leaves
> - Data extents
>
> We just delete them, without CoW the tree blocks, thus only half of the
> delayed refs are generated, furthermore subvolume deletion can go across
> several transactions, unlike unlink which must happen in one go.
>
> > I once conducted a test of removing several huge files manually vs removing
> > them as a subvolume. It seem faster to delete as a subvolume.
> >
> > I expected some throttling in BTRFS when deleting files with many extents,
> > but I don't see such behavior (and of course there might be delete of some 50G file
> > eventually, which is VM image and super-sparse - it will have zillion extents
> > and throttling applied only to this delete will not help).
> > Maybe there is something that might be improved within BTRFS internals?
> >
> > We must delete files, but we potentially may delay this operation by just
> > moving them to a 'trashbin' directory, then deleting slowly.
>
> In fact, btrfs is already delaying the deletion.
>
> When we delete a file, we just unlink it and mark it as orphan.
> This part will happen in current transaction, then later deletion of the
> file extents and its csum will happen in later transactions.
>
> But even unlinking just one inode, will make us to CoW the tree, and
> generate some delayed refs.
> When this accumulates, it's quite a lot.
>
> Moving it to trash bin is even worse IMHO.
> Moving the inodes means we need to unlink and link, double the workload
> than just unlink.
> And you still need to unlink anyway, thus I don't think it's really
> worthy though.
>
> Thanks, Qu

I understand that trashbin is definitely and extra job to be done, but that allows
us to perform our lovely fsync and be sure that this file is in 'safe' location from now
and software is not going to bump into it, no extra measures to keep track of such
files. And then try to apply 'delayed remove' strategies.

Thanks for info Qu, I appreciate any input.

Best regards,
  Aliaksei.

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
Sent: Wednesday, June 16, 2021 5:01 AM
To: Karaliou, Aliaksei <akaraliou@panasas.com>; linux-btrfs@vger.kernel.org <linux-btrfs@vger.kernel.org>
Cc: Sinnamohideen, Shafeeq <shafeeqs@panasas.com>; Josef Bacik <josef@toxicpanda.com>
Subject: Re: Btrfs: Issues with remove-intensive workload 
 


On 2021/6/8 上午4:23, Karaliou, Aliaksei wrote:
> Hi,
>
> We have an issue with certain workloads and are looking for a hint regarding what
> to do in our case. Maybe there are ways of improving this in BTRFS module,
> or there is a way for us to avoid certain behaviors.
>
> Issue:
> Occasionally, we have a pretty intensive delete-only workload with
> tens to hundreds of gigabytes of data being removed during a rather short period
> of time as the filesystem allows.

First thing first, even I hate to admit, btrfs has a much larger
metadata overhead compared to other fses.

We have csum, larger file extent structure, and mandatory metadata CoW,
so metadata performance is indeed worse compared to other fses.

> It's not 'rm -rf' of a tree but removes
> in different locations of directory hierarchy tree by individual requests.
> Also, in order to maintain consistency after possible power off/crash events
> we fsync these directories with a few background threads relatively frequently
> (period of sync is about 5 seconds).

Fsync is another point of pain of btrfs.

But in your case, I don't think fsync is really needed.

Btrfs maintains its metadata correctness by mandatory COW.
Meaning you either see old metadata (the file not yet deleted) or the
new metadata (file at least unlinked).

Thus I don't think the extra fsync is needed, and fsync itself is also
pretty heavy in btrfs, even with the log tree optimization.

Or do you mean, even seeing the older files are not acceptable in your
use case?

But anyway, your problem describe indeed seems to be a problem, we
should not block for so long time no matter whatever the operations are.

>
> After a few minutes under such load, those background fsyncs are getting stuck
> for varying amounts of time. It might be 30 seconds, usually 200-600, but
> we've once seen over 13K seconds. Remove operations continue, but since
> fsyncs are required operations and we don't allow directories to be
> unsynced 'forever', we eventually block foreground operations in application.
> After that we don't produce more removes and the transaction has a better
> chance to finish 'quickly'.
>
> Hardware: 6 HDDs merged into MDRAID 0 with 8M stripe.
> Mount options: relatime,space_cache=v2.
>
> For testing we used files around 100-250M. They had different number of
> extents. Half of them had usually less than 4, but on average around 200-250.
> Files might be located in snapshots as well (so they have usually unmodified
> copies in other BTRFS subvolumes). There are several foreground threads
> which may perform some usual activity (search/modify database, etc.) then
> remove a file from a tree. This seems similar to having an number of
> scripts each picks up a random file from random directory, remove it,
> then sleep fraction of a second, and then repeat.
>
> Before getting stuck or transaction commit begin, we deleted at a rate of
> 300-400 files/s. After an hour, this degraded to an average of just
> 18 file/s.
>
> It seems that creating sparse files with similar average number of extents
> shows same behavior.
>
> Tests were conducted on OpenSUSE Leap:
>   * 15.0 (kernel-default-4.12.14-lp150.11.4 but btrfs is modified -
>     several patches from 15.1 applied)
>   * 15.2 (kernel-default-5.3.18-lp152.19.2)
>   * 15.0 with kernel updated to kernel-default-5.12.6-1.1.gfe25271
>
> Analysis with trace points added using 'perf' showed a few scenarios:
> A) Several fsync() calls are blocked on 'btrfs_run_delayed_refs(trans, 0)' in
>     btrfs_commit_transaction.
>     This issue seems to be fixed by
>       '[PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit'
>     which was already applied to kernel-default-5.12.6-1.1.gfe25271 where most of tests
>     were conducted.
>
> B) When our background thread calls fsync() on a directory, its inode already has
>     BTRFS_INODE_NEEDS_FULL_SYNC flag set, unfortunately it was rare case and we
>     didn't trace where it came from.
>     In this case we perform a real-world transaction commit and are doing full
>     amount of delayed-refs job instead of btrfs-transaction kthread.
>
> C) btrfs-transaction kthread wakes up and works on committing current transaction.
>     While it is on 'btrfs_run_delayed_refs(trans, 0)' stage everything doesn't
>     seem to be so bad, but we are adding more delayed refs to be processed
>     and this count is usually much more than initial triggering count that this call
>     to btrfs_run_delayed_refs is going to take care of.
>     Then when we change transaction state to TRANS_STATE_COMMIT_START
>     fsync() threads are just stuck at wait_for_commit for a whole time.
>
> I think there might be other scenarios when we move away from pure-delete
> workload, but I think these are enough.
>
> More or less happy scenario here is when btrfs-transaction kthread is in charge
> of commit (it's sleep interval doesn't really makes difference - 30 or 5 seconds).
> But when one of our background fsync threads or foreground threads are in charge
> that's a potential disaster because all delayed-refs job is going to block
> this thread, especially foreground one.

I believe Josef has more experience on this delayed refs problem.
Adding him to CC.

>
> As I understand, this type of workload is rather extreme and
> current BTRFS design doesn't allow to for a way to split delayed-refs handling
> across several transactions to allow more granular commits.
>
> Also, an open question for us is how snapshot delete is different in this case?

At least I can answer this.

For subvolume/snapshot deletion, we skip the whole tree balancing during
operations, and delete trees without balancing the tree itself (thus no
COW).

This reduce the amount of delayed ref, but can only work when no-one but
btrfs itself can modify the tree, thus it only works when the whole tree
is unlinked, thus can only happen in subvolume/snapshot deletion.

> Will we and when would we get stuck on fsync for snapshot deletion? Would it be
> as long?

Snapshot deletion can go across several transactions, thus I don't think
it could be a problem.

>
> I assume that snapshot delete is a bit more efficient operation since
> directories don't need to be updated and fsync'ed. Is this correct?

It's more or less correct.

For snapshot deletion, we no longer care about directories at all, all
we care are:
- Tree nodes
- Tree leaves
- Data extents

We just delete them, without CoW the tree blocks, thus only half of the
delayed refs are generated, furthermore subvolume deletion can go across
several transactions, unlike unlink which must happen in one go.

> I once conducted a test of removing several huge files manually vs removing
> them as a subvolume. It seem faster to delete as a subvolume.
>
> I expected some throttling in BTRFS when deleting files with many extents,
> but I don't see such behavior (and of course there might be delete of some 50G file
> eventually, which is VM image and super-sparse - it will have zillion extents
> and throttling applied only to this delete will not help).
> Maybe there is something that might be improved within BTRFS internals?
>
> We must delete files, but we potentially may delay this operation by just
> moving them to a 'trashbin' directory, then deleting slowly.

In fact, btrfs is already delaying the deletion.

When we delete a file, we just unlink it and mark it as orphan.
This part will happen in current transaction, then later deletion of the
file extents and its csum will happen in later transactions.

But even unlinking just one inode, will make us to CoW the tree, and
generate some delayed refs.
When this accumulates, it's quite a lot.

Moving it to trash bin is even worse IMHO.
Moving the inodes means we need to unlink and link, double the workload
than just unlink.
And you still need to unlink anyway, thus I don't think it's really
worthy though.

Thanks,
Qu

> But even in this case we would need to have extra information so we can throttle this
> well enough so that the next fsync will not take minutes (less than 30 seconds would be ideal).
> To my understanding, there is no such feedback mechanism that we can use.
> The only indicator for us is file size and number of extents that we may obtain via FIEMAP.
>
> An additional complication are snapshots that might be taken when we still have a bunch
> of files sitting in this 'trashbin' directory waiting for our background deletion - at least
> from metadata side that means increased amount of work with each snapshot created.
>
> As an option we considered usage of reflink ioctl to first move portions
> of file (depending extents count) to some huge file located in different
> volume (on the same device of course), then delete it if that makes
> internal operations faster, but that definitely adds cost of reflink.
>
> Any advice or comments on this issue and my thoughts on mitigations
> would be much appreciated.
>
> Best regards,
>    Aliaksei
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-16 23:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 20:23 Btrfs: Issues with remove-intensive workload Karaliou, Aliaksei
2021-06-16  2:01 ` Qu Wenruo
2021-06-16 23:01   ` Karaliou, Aliaksei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.