linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Odin Hultgren van der Horst <odin@digitalgarden.no>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Extent to files
Date: Fri, 8 Nov 2019 16:23:44 -0500	[thread overview]
Message-ID: <20191108212343.GQ22121@hungrycats.org> (raw)
In-Reply-To: <20191104113519.htdigcg6lzbes6v7@T580.localdomain>

[-- Attachment #1: Type: text/plain, Size: 3493 bytes --]

On Mon, Nov 04, 2019 at 12:35:19PM +0100, Odin Hultgren van der Horst wrote:
> I did a ioctl(FICLONE) IOCTL-FICLONERANGE(2) at some point later I want to be
> able to check if the new file still shares all its physical storage with just
> knowing the name of the new file.

"Shares all its physical storage" is not very specific.  You could run
'filefrag -v' and count extents with and without the "shared" flag.
If either number is 0, the file is all-unique or all-shared.

If the extents are marked shared, filefrag doesn't tell you what is doing
the sharing.  A file can, and often does, share extents with itself.
e.g. you write a 1MB extent, then write 4K in the middle, now you have
two smaller references to the 1MB extent separated by the 4K in the
middle.  Having two references will set the "shared" bit in FIEMAP
even though all references are in the same file.

> I found some people suggesting to compare the files extents.
> 
> But the implementation I looked at knew both files used in the comparison,
> so I was wondering if there a way to get all files that references a extent
> in user space?

Use TREE_SEARCH_V2 and the subvol and inode numbers of the target file
to read the file's EXTENT_ITEM metadata to get all the extent bytenrs
in a file.  You need the raw extent bytenr ("physical") field from each
extent metadata item in the file.

You can use 'btrfs ins dump-tree' to see the metadata in the filesystem,
and as an example of how to decode the various metadata objects.  I used
'btrfs sub find-new' as an example for walking the metadata trees with
TREE_SEARCH_V2.

Use the LOGICAL_INO_V2 ioctl with the extent bytenrs obtained from
TREE_SEARCH_V2 to discover the (subvol, ino, offset) tuples referencing
each extent.  'btrfs ins logical' does this with LOGICAL_INO v1.  You want
to use the V2 BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET flag so you get all
the referencing extent items (V1 requires repeating the call for every
block in the extent, V2 gets all references at once).

Use 'btrfs ins subvolid-resolve' to map subvol IDs to paths in the
filesystem.  You will need to open these paths to use INO_TO_PATHS.

Use INO_TO_PATHS ioctl to convert (subvol_fd, inode) numbers
into filenames.  For the FD argument, use the paths obtained from
'subvolid-resolve'.  This tells you the filename relative to the subvol.
Combine with the subvol's name for the full path of the file.

All of the above require root or CAP_SYS_ADMIN privileges to work.

> In reality I want a count off clones/(identical files) to a given file
> in user space.

Repeat the above for each extent in the target file to build a list of
all extents and what files reference them.

Partial matches between files and extents are possible, so you will
need to decide what to do about them (include in result set, exclude
from result set, diffstat-style output, percentage overlap, make a Venn
diagram, etc).

It's also possible to have two files referring to the same extents in
different orders or at different offsets within the extents, so two
files could share 100% of their space but not be identical.

If you only care about the number of files that have one or more blocks
shared, you can skip some of the steps, i.e. you only need the total
number of unique (subvol, inode) pairs and you can skip the path lookups,
but if you do this, you can't tell if the files are identical, only that
they are at least partly shared.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

      reply	other threads:[~2019-11-08 21:24 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-04 11:35 Extent to files Odin Hultgren van der Horst
2019-11-08 21:23 ` Zygo Blaxell [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191108212343.GQ22121@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=odin@digitalgarden.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).