linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Axel Burri <axel@tty0.ch>
To: Graham Cobb <g.btrfs@cobb.uk.net>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Used disk size of a received subvolume?
Date: Thu, 23 May 2019 18:06:31 +0200	[thread overview]
Message-ID: <f5df69f6-300e-25c7-3d0d-0798f140dc8d@tty0.ch> (raw)
In-Reply-To: <811bcd96-5a8e-cb10-7efb-22c1046e0f42@cobb.uk.net>


[-- Attachment #1.1: Type: text/plain, Size: 5776 bytes --]

On 17/05/2019 17.28, Graham Cobb wrote:
> On 17/05/2019 14:57, Axel Burri wrote:
>> btrfs fi du shows me the information wanted, but only for the last
>> received subvolume (as you said it changes over time, and any later
>> child will share data with it). For all others, it merely shows "this
>> is what gets freed if you delete this subvolume".
> 
> It doesn't even show you that: it is possible to have shared (not
> exclusive) data which is only shared between files within the subvolume,
> and which will be freed if the subvolume is deleted. And, of course, the
> obvious problem that if you only count exclusive then no one is being
> charged for all the shared segments ("Oh, my backup is getting a bit
> expensive. Hmm. I know! I will back up all my files to two different
> destinations, and make sure btrfs is sharing the data between both
> locations! Then no one pays for it! Whoopee!")
> 
> In my opinion, the shared/exclusive information in btrfs fi du is worse
> than useless: it confuses people who think it means something different
> from what it does. And, in btrfs, it isn't really useful to know whether
> something is "exclusive" or not -- what people care about is always
> something else (which is dependent on **where** it is shared, and by whom).

Agreed. Sadly btrfs-filesystem(8) does not give much information on how
"exclusive" should be interpreted.

> The biggest problem is that you haven't defined what **you** (in your
> particular use case) mean by the "size" of a subvolume. For btrfs that
> doesn't have any single obvious definition.
> 
> Most commonly, I think, people mean "how much space on disk would be
> freed up if I deleted this subvolume and all subvolumes contained within
> it", although quite often they mean the similar (but not identical) "how
> much space on disk would be freed up if I deleted just this subvolume".
> And sometimes they actually mean "how much space on disk would be freed
> up if I deleted this subvolume, the subvolumes contained with in, and
> all the snapshots I have taken but are lying around forgotten about in
> some other directory tree somewhere".
> 
> But often they mean something else completely, such as "how much space
> is taken up by the data which was originally created in this subvolume
> but which has been cloned into all sorts of places now and may not even
> be referred to from this subvolume any more" (typically this is the case
> if you want to charge the subvolume owner for the data usage).
> 
> And, of course, another reading of your question would be "how much data
> was transferred during this send/receive operation" (relevant if you are
> running a backup service and want to charge people by how much they are
> sending to the service rather than the amount of data stored).

I actually meant "how much space is taken up by the data compared to the
previous received subvolume", or any similar question which gives
insight on how much disk space is being used over time by send/receive
backups of snapshots of a source subvolume.

After a couple of years of running btrbk I have many backup subvolumes,
and I want to be able to get some statistics on which ones eat up how
much space on disk.

> That is why I created my "extents-list" stuff. This is a horrible hack
> (one day I will rewrite it using the python library) which lets me
> answer questions like: "how much space am I wasting by keeping
> historical snapshots", "how much data is being shared between two
> subvolumes", "how much of the data in my latest snapshot is unique to
> that snapshot" and "how much space would I actually free up if I removed
> (just) these particular directories". None of which can be answered from
> the existing btrfs command line tools (unless I have missed something).
> 
>> And it is pretty slow: on my backup disk (spinning rust, ~2000
>> subvolumes, ~100 sharing data), btrfs fi du takes around 5min for a
>> subvolume of 20GB, while btrfs find-new takes only seconds.
> 
> Yes. Answering the real questions involves taking the FIEMAP data for
> every file involved (which, for some questions, is actually every file
> on the disk!) so it takes a very long time. Days for my multi-terabyte
> backup disk.
> 
>> Summing up, what I'm looking for would be something like:
>>
>>   btrfs fi du -s --exclusive-relative-to=<other-subvol> <subvol>
> 
> You can do that with FIEMAP data. Feel free to look extents-lists. Also
> feel free to shout "this is a gross hack" and scream at me!
> 
> If you really just need it for two subvols like that
> 
> extents-expr -s <subvol> - <other-subvol>
> 
> will tell you how much space is in extents used in <subvol> but not used
> in <other-subvol>.

Thanks a lot, your scripts are very useful and answer my question.

While I love their bashyness, I re-hacked parts of it in perl last
night, so that I can use it within btrbk (not sure though if I want to
unleash this to the masses, as many people will mis-interpret the data
and shout at me on how slow this is).

Here's what I got by now:

  # git clone -b extents-diff https://github.com/digint/btrbk.git
  # ./btrbk extents-diff /home --dry-run
  # ./btrbk extents-diff /home
  # ./btrbk extents-diff <subvol>...

If called with a single argument, btrbk looks for all related subvolumes
and prints the difference to the previous one, sorted by gen (transid).
While this is usually fine for snapshots, parent-uuid chains get broken
for received subvolume as soon as an intermediate subvolume is deleted
(and thus need to be passed as additional arguments).

The hacky perl module is here:
https://github.com/digint/btrbk/blob/extents-diff/lib/Linux/ExtentsMap.pm


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      parent reply	other threads:[~2019-05-23 16:06 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-16 14:54 Used disk size of a received subvolume? Axel Burri
2019-05-16 17:09 ` Remi Gauvin
2019-05-17 14:14   ` Axel Burri
2019-05-17 16:22     ` Remi Gauvin
2019-05-16 17:12 ` Hugo Mills
2019-05-17 13:57   ` Axel Burri
2019-05-17 15:28     ` Graham Cobb
2019-05-17 16:39       ` Steven Davies
2019-05-17 23:15         ` Graham Cobb
2019-05-23 16:06       ` Axel Burri [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5df69f6-300e-25c7-3d0d-0798f140dc8d@tty0.ch \
    --to=axel@tty0.ch \
    --cc=g.btrfs@cobb.uk.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).