All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	fdmanana@kernel.org, fstests@vger.kernel.org,
	linux-btrfs@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
Date: Thu, 6 Sep 2018 18:38:09 +1000	[thread overview]
Message-ID: <20180906083809.GF27618@dastard> (raw)
In-Reply-To: <20180831051045.GI26989@hungrycats.org>

On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > > 	- is documenting rejection on request alignment grounds
> > > > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > > > 	  developers to understand what is going on here?
> > > > 
> > > > I think so.  The manpage says: "The filesystem does not support
> > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > this case of not supporting dedupe of EOF blocks.
> > > 
> > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > > files (one extent) cannot be deduped at all, because the EOF block holds
> > > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > > last extent of the file as the benefit would be zero or slightly negative.
> > 
> > That's a filesystem implementation issue, not an API or application
> > issue.
> 
> The API and application issue remains even if btrfs is not considered.
> btrfs is just the worst case outcome.  Other filesystems still have
> fragmentation issues, and applications have efficiency-vs-capability
> tradeoffs to make if they can't rely on dedupe-to-EOF being available.
> 
> Tools like 'cp --reflink=auto' work by trying the best case, then falling
> back to a second choice if the first choice returns an error.

Well, yes. That's necessary for the "cp" tool to behave according to
user expectations.  That's not a kernel API issue - that's just an
implementation of an *application requirement*.  Indeed, this is
identical to the behaviour of rename() in "mv" - if rename fails
with -EXDEV, mv needs to fall back to a manual copy because the user
expects the file to be moved.

IOWS, these application level requirements you talk about are just
not relevant to the kernel API for dedupe/clone operations.

[snip]

> It is something that naive dedupe apps will do.  "naive" here meaning
> "does not dive deeply into the filesystem's physical structure (or survey
> the entire filesystem with FIEMAP) to determine that the thousand-refs
> situation does not exist at dst prior to invoking the dedupe() call."

/me sighs and points at FS_IOC_GETFSMAP

$ man ioctl_getfsmap
....
DESCRIPTION
       This ioctl(2) operation retrieves physical extent mappings
       for a filesystem.  This information can be used to discover
       which files are mapped to a physical block, examine free
       space, or find known bad blocks, among other things.
.....

I don't really care about "enabling" naive, inefficient
applications. I care about applications that scale to huge
filesystems and can get the job done quickly and efficiently.

> > XFS doesn't have partial overlaps, we don't have nodatacow hacks,
> > and the subvol snapshot stuff I'm working on just uses shared data
> > extents so it's 100% compatible with dedupe.
> 
> If you allow this sequence of operations, you get partial overlaps:
> 
> 	dedupe(fd1, 0, fd2, 0, 1048576);
> 
> 	dedupe(fd2, 16384, fd3, 0, 65536);

Oh, I misunderstood - I thought you were refering to sharing partial
filesystem blocks (like at EOF) because that's what this discussion
was originally about. XFS supports the above just fine.

[snip]

tl;dr we don't need a new clone or dedupe API

> For future development I've abandoned the entire dedupe_file_range
> approach.  I need to be able to read and dedupe the data blocks of
> the filesystem directly without having to deal with details like which
> files those blocks belong to, especially on filesystems with lots of
> existing deduped blocks and snapshots. 

IOWs, your desired OOB dedupe algorithm is:

	a) ask the filesystem where all it's file data is
	b) read that used space to build a data hash index
	c) on all the data hash collisions find the owners of the
	   colliding blocks
	d) if the block data is the same dedupe it

I agree - that's a simple and effective algorithm. It's also the
obvious solution to an expert in the field.

> The file structure is frankly
> just noise for dedupe on large filesystems.

We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
the SGI^WHPE HSM scanning tools, etc all avoid the directory
structure because it's so slow. XFS's bulkstat interface, OTOH, can
scan for target inodes at a over a million inodes/sec if you've got
the IO and CPU to throw at it....

> I'm building a translation
> layer for bees that does this--i.e. the main dedupe loop works only with
> raw data blocks, and the translation layer maps read(blocknr, length)
> and dedupe(block1, block2, length) requests onto the existing kernel
> read(fd, offset, length) and dedupe(fd1, off1, fd2, off2, length)i

That's FS_IOC_GETFSMAP. :P

FYI, in 2012 I came up with a plan for dedupe in XFS:

	a) GETFSMAP to query reverse map tree to find file
	   data blocks (don't try to dedupe unused space)
	b) read the underlying block device in ascending block
	   order and hash each block to build a collision tree.
	   Trivially parallelised.
	c) GETFSMAP to query rmap for the owners of colliding
	   blocks
	d) For all file data owner maps of a colliding block
		- bulkstat the inode numbers returned by GETFSMAP
		  and open_by_handle to get fd's
		- run dedupe syscall

We've finally got all the infrastructure we need to write this app
for XFS - we just need to write it and integrate it into xfs_fsr....

> calls.
> If the end result of that development work meets my performance
> targets--I've built over a dozen dedupe apps, and only two were good
> enough to release so far--then I'd propose moving those primitives into
> the kernel.

dedupe has been around for a long time and there hasn't been any
breakthroughs in IO efficient OOB dedupe scanning algorithms for
years.  OOB dedupe is not actually that difficult to do efficiently,
it's just that most people trying to do it lack the knowledge and/or
insight to make the jump that it requires reverse block mapping, not
directory traversal.

You've taken a dozen or so failures to realise what was obvious to
me at the outset - your translation layer will essentially hold the
same information that XFS already maintains on disk in it's rmap
btrees. IIRC btrfs has semi-equivalent reverse mapping functionality
on disk (back pointers?), so maybe your best bet would be to
implement FSMAP for btrfs and then you have a dedupe app that works
efficiently on both XFS and btrfs...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-09-06 13:12 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-17  8:39 [PATCH] generic: test for deduplication between different files fdmanana
2018-08-19 14:07 ` Eryu Guan
2018-08-19 15:41   ` Filipe Manana
2018-08-19 16:19     ` Eryu Guan
2018-08-19 16:21       ` Filipe Manana
2018-08-19 23:11 ` Dave Chinner
2018-08-20  1:09   ` [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files) Dave Chinner
2018-08-20 15:33     ` Darrick J. Wong
2018-08-21  0:49       ` Dave Chinner
2018-08-21  1:17         ` Eric Sandeen
2018-08-21  4:49           ` Dave Chinner
2018-08-23 12:58       ` Zygo Blaxell
2018-08-24  2:19         ` Zygo Blaxell
2018-08-30  6:27         ` Dave Chinner
2018-08-31  5:10           ` Zygo Blaxell
2018-09-06  8:38             ` Dave Chinner [this message]
2018-09-07  3:53               ` Zygo Blaxell
2018-09-10  9:06                 ` Dave Chinner
2018-09-19  4:12                   ` Zygo Blaxell
2018-09-21  2:59                     ` Dave Chinner
2018-09-21  4:40                       ` Zygo Blaxell
2018-10-01 20:34                         ` Andreas Dilger
2018-08-21 15:55     ` Filipe Manana
2018-08-21 15:57   ` [PATCH] generic: test for deduplication between different files Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180906083809.GF27618@dastard \
    --to=david@fromorbit.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=darrick.wong@oracle.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=fstests@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.