linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Extent to files
@ 2019-11-04 11:35 Odin Hultgren van der Horst
  2019-11-08 21:23 ` Zygo Blaxell
  0 siblings, 1 reply; 2+ messages in thread
From: Odin Hultgren van der Horst @ 2019-11-04 11:35 UTC (permalink / raw)
  To: linux-btrfs

I did a ioctl(FICLONE) IOCTL-FICLONERANGE(2) at some point later I want to be
able to check if the new file still shares all its physical storage with just
knowing the name of the new file.

I found some people suggesting to compare the files extents.

But the implementation I looked at knew both files used in the comparison,
so I was wondering if there a way to get all files that references a extent
in user space?

In reality I want a count off clones/(identical files) to a given file
in user space.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Extent to files
  2019-11-04 11:35 Extent to files Odin Hultgren van der Horst
@ 2019-11-08 21:23 ` Zygo Blaxell
  0 siblings, 0 replies; 2+ messages in thread
From: Zygo Blaxell @ 2019-11-08 21:23 UTC (permalink / raw)
  To: Odin Hultgren van der Horst; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3493 bytes --]

On Mon, Nov 04, 2019 at 12:35:19PM +0100, Odin Hultgren van der Horst wrote:
> I did a ioctl(FICLONE) IOCTL-FICLONERANGE(2) at some point later I want to be
> able to check if the new file still shares all its physical storage with just
> knowing the name of the new file.

"Shares all its physical storage" is not very specific.  You could run
'filefrag -v' and count extents with and without the "shared" flag.
If either number is 0, the file is all-unique or all-shared.

If the extents are marked shared, filefrag doesn't tell you what is doing
the sharing.  A file can, and often does, share extents with itself.
e.g. you write a 1MB extent, then write 4K in the middle, now you have
two smaller references to the 1MB extent separated by the 4K in the
middle.  Having two references will set the "shared" bit in FIEMAP
even though all references are in the same file.

> I found some people suggesting to compare the files extents.
> 
> But the implementation I looked at knew both files used in the comparison,
> so I was wondering if there a way to get all files that references a extent
> in user space?

Use TREE_SEARCH_V2 and the subvol and inode numbers of the target file
to read the file's EXTENT_ITEM metadata to get all the extent bytenrs
in a file.  You need the raw extent bytenr ("physical") field from each
extent metadata item in the file.

You can use 'btrfs ins dump-tree' to see the metadata in the filesystem,
and as an example of how to decode the various metadata objects.  I used
'btrfs sub find-new' as an example for walking the metadata trees with
TREE_SEARCH_V2.

Use the LOGICAL_INO_V2 ioctl with the extent bytenrs obtained from
TREE_SEARCH_V2 to discover the (subvol, ino, offset) tuples referencing
each extent.  'btrfs ins logical' does this with LOGICAL_INO v1.  You want
to use the V2 BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET flag so you get all
the referencing extent items (V1 requires repeating the call for every
block in the extent, V2 gets all references at once).

Use 'btrfs ins subvolid-resolve' to map subvol IDs to paths in the
filesystem.  You will need to open these paths to use INO_TO_PATHS.

Use INO_TO_PATHS ioctl to convert (subvol_fd, inode) numbers
into filenames.  For the FD argument, use the paths obtained from
'subvolid-resolve'.  This tells you the filename relative to the subvol.
Combine with the subvol's name for the full path of the file.

All of the above require root or CAP_SYS_ADMIN privileges to work.

> In reality I want a count off clones/(identical files) to a given file
> in user space.

Repeat the above for each extent in the target file to build a list of
all extents and what files reference them.

Partial matches between files and extents are possible, so you will
need to decide what to do about them (include in result set, exclude
from result set, diffstat-style output, percentage overlap, make a Venn
diagram, etc).

It's also possible to have two files referring to the same extents in
different orders or at different offsets within the extents, so two
files could share 100% of their space but not be identical.

If you only care about the number of files that have one or more blocks
shared, you can skip some of the steps, i.e. you only need the total
number of unique (subvol, inode) pairs and you can skip the path lookups,
but if you do this, you can't tell if the files are identical, only that
they are at least partly shared.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-11-08 21:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-04 11:35 Extent to files Odin Hultgren van der Horst
2019-11-08 21:23 ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).