On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > > > 	- is documenting rejection on request alignment grounds
> > > > > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > > > > 	  developers to understand what is going on here?
> > > > > 
> > > > > I think so.  The manpage says: "The filesystem does not support
> > > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > > this case of not supporting dedupe of EOF blocks.

[snip]

> tl;dr we don't need a new clone or dedupe API

Well, I wouldn't say _that_ yet.

It seems less likely that the API we need will be a minor revision of the
existing clone or dedupe and more likely that it will be a complementary
API for OOB dedupe engines, or a dedupe API that takes (device,
physical, length) tuples instead of (fd, offset, length) tuples.

> > For future development I've abandoned the entire dedupe_file_range
> > approach.  I need to be able to read and dedupe the data blocks of
> > the filesystem directly without having to deal with details like which
> > files those blocks belong to, especially on filesystems with lots of
> > existing deduped blocks and snapshots. 
> 
> IOWs, your desired OOB dedupe algorithm is:
> 
> 	a) ask the filesystem where all it's file data is

Actually, it's "ask the filesystem where all the *new* file data is"
since we don't want to read any unique data twice on subsequent runs.

> 	b) read that used space to build a data hash index
> 	c) on all the data hash collisions find the owners of the
> 	   colliding blocks
> 	d) if the block data is the same dedupe it
> 
> I agree - that's a simple and effective algorithm. It's also the
> obvious solution to an expert in the field.

It's obvious, but if you literally implement that, you end up with a
filesystem shattered into billions of fragments.

At step c) I need to figure out _which_ colliding block I want to keep.
That's where the algorithm needs to inspect the higher-level file
structure to determine the physical and logical situation for each block.
That leads to one of:

	a) store the hashes and block locations in the hash table
	b) relocate the blocks (defrag)
	c) replace some blocks with a reference to other blocks (dedupe)

This sounds a bit like what xfs_fsr does (or will do?).

Bees also operates under a constant-RAM constraint, so it doesn't operate
in two distinct "collect data" and "act on data collected" passes,
and cannot spend memory to store data about more than a few extents at
any time.  bees dedupes contiguous blocks pairwise as the second duplicate
block is discovered--sometimes keeping the second block, if the selection
algorithm prefers new blocks over blocks previously encountered (e.g. if
a defragmented version of an extent comes along, and we don't want to
undo the defrag operation by deduping non-defragged extents over it).

This does not produce the fastest possible OOB dedupe result--that comes
from an algorithm that does operate in two passes and sorts duplicate
extents by length between the passes.  The fast algorithm produces a
free space vs time curve that starts with a long flat interval followed
by a steep slope that gradually flattens over time, and is suitable
for filesystems that are small enough that the entire filesystem can
be processed in a batch run.  The bees algorithm produces a continuous
straight-ish line that is more suitable for continuous incremental
operation using O(1) memory.

> > The file structure is frankly
> > just noise for dedupe on large filesystems.
> 
> We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
> the SGI^WHPE HSM scanning tools, etc all avoid the directory
> structure because it's so slow. XFS's bulkstat interface, OTOH, can
> scan for target inodes at a over a million inodes/sec if you've got
> the IO and CPU to throw at it....

A million inodes/sec?  btrfs TREE_SEARCH_V2 barely manages 2000 inodes/sec
on a 2.2GHz AMD A6--and that's just in the kernel, not counting delivery
of the data to userspace.  I'm clearly using the wrong filesystem here.

> > I'm building a translation
> > layer for bees that does this--i.e. the main dedupe loop works only with
> > raw data blocks, and the translation layer maps read(blocknr, length)
> > and dedupe(block1, block2, length) requests onto the existing kernel
> > read(fd, offset, length) and dedupe(fd1, off1, fd2, off2, length)i
> 
> That's FS_IOC_GETFSMAP. :P

No, that's the layer that _uses_ FS_IOC_GETFSMAP.  ;)

> FYI, in 2012 I came up with a plan for dedupe in XFS:
> 
> 	a) GETFSMAP to query reverse map tree to find file
> 	   data blocks (don't try to dedupe unused space)
> 	b) read the underlying block device in ascending block
> 	   order and hash each block to build a collision tree.
> 	   Trivially parallelised.
> 	c) GETFSMAP to query rmap for the owners of colliding
> 	   blocks
> 	d) For all file data owner maps of a colliding block
> 		- bulkstat the inode numbers returned by GETFSMAP
> 		  and open_by_handle to get fd's
> 		- run dedupe syscall

I think steps c and d show room for improvement.

I'd like the filesystem to receive a pair of contiguous colliding block
ranges, and have the filesystem locate and update all the owners at once
after performing the comparison (skip updating any owners that have
changed after the comparison).  That effectively merges c and d into
a single syscall.  There are a few advantages to doing it this way,
especially when it comes to requirements for locking (e.g. no need to
lock files against modification during the block compare).

With the existing dedupe syscall, the app has to dig up from the (device,
physical) tuple through the filesystem to get (fd, offset) tuples to pass
to dedupe, which will dig down through the filesystem to turn them back
into (device, physical) tuples again for comparison and then dig back
up again to modify the (inode, offset) tuples for any other references.
It seems like the filesystem could skip one leg of this journey if it
starts with the block address.  If the data changed underneath so the
comparison fails, none of the other work is necessary, and most of the
other work requires seeks.

> You've taken a dozen or so failures to realise what was obvious to
> me at the outset

It was obvious to me too; however, I have to deliver a dedupe engine that
doesn't blow up the system it's running on, and calling LOGICAL_INO on
btrfs more than absolutely necessary has historically been a bad idea.

I tried many alternatives to avoid this (basically ordering the dedupe
syscalls to try to converge on the correct result eventually), but some
use cases (especially snapshots) cannot be done sanely any other way,
so I'm pressing ahead with pure physical block addressing and hoping
that the filesystems catch up.

> - your translation layer will essentially hold the
> same information that XFS already maintains on disk in it's rmap
> btrees. IIRC btrfs has semi-equivalent reverse mapping functionality
> on disk (back pointers?), so maybe your best bet would be to
> implement FSMAP for btrfs and then you have a dedupe app that works
> efficiently on both XFS and btrfs...

btrfs has LOGICAL_INO_V2 and TREE_SEARCH_V2 ioctls which provide the
same sort of data as GETFSMAP in a differently structured address space.
INO_PATHS lets me implement an equivalent of open_by_handle.

The existing translation layer in bees uses the information in the
existing filesystem structures to map logical to physical and back
as required.  The new translation layer operates _only_ in the physical
space at the top, so as far as the core dedupe/defrag engine is concerned,
the filesystem underneath is just an iterable collection of contiguous
data extents that can be read and deduped directly.

The only data bees stores outside of the filesystem is the dedupe block
hash table, the current value of the extent iterator (aka the low key for
the next GETFSMAP call), and the transid of the filesystem the last time
the extent iterator was reset.  btrfs embeds the transid (an indicator
of age) into every data extent record and into metadata page headers, so
bees can rapidly skip old extents in SEARCH_V2 for fast incremental scans.

The translation layer can work as long as every physical block has a
unique address in a data type that has addition, subtraction, less-than
and bisection operators, and the filesystem can map a position in physical
address space to a position in logical address space and back.  I only
need bisection for cases that require backward iteration from a known
key (btrfs csums require this, and the new translation layer enables the
dedupe code to use the existing csums instead of reading and hashing the
data blocks).

On btrfs I open the files to do reading instead of scraping the
block device directly--IMHO a reasonable decision, given that btrfs
implements RAID and compression to make raw block read from a device
very complicated.  Profiling so far has never shown open() to have
significant overhead compared to the other operations; however, if
btrfs had a "I am root, read this arbitrary data block for me" ioctl,
I'd eliminate even that little overhead.

It looks like "open device, seek to physical, read block" is sufficient
for XFS (no compression or other encoding to worry about).

btrfs uses a 64-bit virtual address space (compatible with FIEMAP),
GETFSMAP uses a 32-bit device ID and a 64-bit offset.  This means all
bees DDT entries are immediately 25% bigger on XFS, unless I enumerate the
device IDs and recycle some zero bits in .physical as a device ID index.

I wonder if a future btrfs implementation of GETFSMAP will use the
device ID and physical offset fields, or just return a constant device
ID with .physical carrying the 64-bit virtual address that the rest of
the filesystem uses.

I wonder how compressed data would work with GETFSMAP?  In btrfs the
LOGICAL_INO and FIEMAP ioctls treat the entire compressed extent as a
single entity with no distinct blocks inside.  Bees needs a unique address
for every block, even the compressed ones, so some of that is handled
by emulation using TREE_SEARCH_V2 in the translation layer (the physical
data type is 64-bit extent base plus an uncompressed block offset).

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>