linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] Btrfs: send, skip backreference walking for extents with many references
Date: Sat, 23 Nov 2019 13:33:33 +0000	[thread overview]
Message-ID: <CAL3q7H52LYCnDdFoObCCWjrCQqLOpcPUYeD28pXtET25-ycM9w@mail.gmail.com> (raw)
In-Reply-To: <20191123052741.GJ22121@hungrycats.org>

On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Backreference walking, which is used by send to figure if it can issue
> > clone operations instead of write operations, can be very slow and use too
> > much memory when extents have many references. This change simply skips
> > backreference walking when an extent has more than 64 references, in which
> > case we fallback to a write operation instead of a clone operation. This
> > limit is conservative and in practice I observed no signicant slowdown
> > with up to 100 references and still low memory usage up to that limit.
> >
> > This is a temporary workaround until there are speedups in the backref
> > walking code, and as such it does not attempt to add extra interfaces or
> > knobs to tweak the threshold.
> >
> > Reported-by: Atemu <atemu.main@gmail.com>
> > Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > index 123ac54af071..518ec1265a0c 100644
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -25,6 +25,14 @@
> >  #include "compression.h"
> >
> >  /*
> > + * Maximum number of references an extent can have in order for us to attempt to
> > + * issue clone operations instead of write operations. This currently exists to
> > + * avoid hitting limitations of the backreference walking code (taking a lot of
> > + * time and using too much memory for extents with large number of references).
> > + */
> > +#define SEND_MAX_EXTENT_REFS 64
> > +
> > +/*
> >   * A fs_path is a helper to dynamically build path names with unknown size.
> >   * It reallocates the internal buffer on demand.
> >   * It allows fast adding of path elements on the right side (normal path) and
> > @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
> >       struct clone_root *cur_clone_root;
> >       struct btrfs_key found_key;
> >       struct btrfs_path *tmp_path;
> > +     struct btrfs_extent_item *ei;
> >       int compressed;
> >       u32 i;
> >
> > @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
> >       ret = extent_from_logical(fs_info, disk_byte, tmp_path,
> >                                 &found_key, &flags);
> >       up_read(&fs_info->commit_root_sem);
> > -     btrfs_release_path(tmp_path);
> >
> >       if (ret < 0)
> >               goto out;
> > @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
> >               goto out;
> >       }
> >
> > +     ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> > +                         struct btrfs_extent_item);
> > +     /*
> > +      * Backreference walking (iterate_extent_inodes() below) is currently
> > +      * too expensive when an extent has a large number of references, both
> > +      * in time spent and used memory. So for now just fallback to write
> > +      * operations instead of clone operations when an extent has more than
> > +      * a certain amount of references.
> > +      */
> > +     if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
>
> So...does this...work?

I wouldn't send it if it didn't work, at least for the case reported.

There's even a test case for this, btrfs/130 from fstests:

https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/btrfs/130

It used to take several hours to complete on a test vm I use
frequently, now it completes in 1 or 2 seconds.

While that is about an extent shared only within the same file, the
same happens if the extent is reflinked to many other files (either
all part of the same subvolume or belonging to different subvolumes).

>
> I ask because I looked at this a few years ago as a way to spend less
> time doing LOGICAL_INO calls during dedupe (and especially avoid the

To give some context to those not familiar with you, you are
mentioning a user space dedupe tool that doesn't call the btrfs dedup
ioctl, and calls the clone ioctl directly, while also using the
LOGICAL_INO ioctl.

> 8-orders-of-magnitude performance degradation in the bad cases), but
> I found that it was useless: it wasn't proportional to the number of
> extents, nor was it an upper or lower bound, nor could extents be sorted
> by number of references using extent.refs as a key, nor could it predict
> the amount of time it would take for iterate_extent_inodes() to run.
> I guess I'm surprised you got a different result.
>
> If the 'refs' field is 1, the extent might have somewhere between 1
> and 3500 root/inode/offset references.

Yes, that's due to snapshotting and the way it works.
Creating a snapshot consists of COWing the root node of a tree only,
which will not increment extent references (for data extents,
referenced in leafs) if the tree has an height of 2 or more.

Example:

1) You have a subvolume tree with 2 levels (root at level 1, leafs at level 0);

2) You have a file there that refers to extent A, not shared with
anyone else, and has a reference count of 1 in the extent item at the
extent tree;

3) After you snapshot the subvolume, extent A will still have 1
reference count in its item at the extent tree.

If you do anything that changes the leaf that refers to extent A
(resulting in it being COWed), in either tree, btrfs will take care of
adding a new backreference to the extent and bump the reference count
to 2 for extent A.

If on the other hand, if the source subvolume was really small,
consisting of a single node/leaf (root at level 0), then you would get
2 references for extent A immediately after making the snapshot.

> But it's not a lower bound,
> e.g. here is an extent on a random filesystem where extent_refs is a
> big number:
>
>         item 87 key (1344172032 EXTENT_ITEM 4096) itemoff 11671 itemsize 24
>                 refs 897 gen 2668358 flags DATA
>
> LOGICAL_INO_V2 only finds 170 extent references:
>
>         # btrfs ins log 1344172032 -P . | wc -l
>         170

Pass "-s 65536", the default buffer is likely not large enough to all
inode/root numbers.

>
> Is there a good primer somewhere on how the reference counting works in
> btrfs?

Dunno. Source code, and an old paper about btrfs:

https://domino.research.ibm.com/library/cyberdig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf

Maybe the wiki has more references.

>
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
> > +     btrfs_release_path(tmp_path);
> > +
> >       /*
> >        * Setup the clone roots.
> >        */
> > --
> > 2.11.0
> >

  reply	other threads:[~2019-11-23 13:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-30 12:23 [PATCH] Btrfs: send, skip backreference walking for extents with many references fdmanana
2019-10-30 12:47 ` Qu Wenruo
2019-11-01 14:50 ` David Sterba
2019-11-23  5:27 ` Zygo Blaxell
2019-11-23 13:33   ` Filipe Manana [this message]
2019-11-24  1:33     ` Zygo Blaxell
2019-11-25 15:31       ` Filipe Manana
2019-11-26 23:02         ` Zygo Blaxell
2019-11-27 11:42           ` Filipe Manana
2019-11-27 18:09             ` Zygo Blaxell
2019-11-27 18:20               ` Zygo Blaxell
2019-11-29 15:25                 ` Filipe Manana
2019-11-26 17:52 ` Martin Raiber
2020-01-18 18:41   ` Martin Raiber
2020-01-20 10:57     ` Filipe Manana
2020-01-20 11:30       ` Martin Raiber
     [not found]       ` <269ad5af-c73e-ac38-5e60-39c3caa11202@urbackup.org>
2020-01-22  1:22         ` Martin Raiber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H52LYCnDdFoObCCWjrCQqLOpcPUYeD28pXtET25-ycM9w@mail.gmail.com \
    --to=fdmanana@kernel.org \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).