All of lore.kernel.org
 help / color / mirror / Atom feed
* sub-file dedup
@ 2019-05-30  1:13 Newbugreport
  2019-05-30 11:36 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 2+ messages in thread
From: Newbugreport @ 2019-05-30  1:13 UTC (permalink / raw)
  To: linux-btrfs

I'm experimenting with the rsync algorithm for btrfs deduplication. Every other deduplication tool I've seen works against whole files. I'm concerned about deduping chunks under 4k and about files with scattered extents.

Are there best practices for deduplication on btrfs?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: sub-file dedup
  2019-05-30  1:13 sub-file dedup Newbugreport
@ 2019-05-30 11:36 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 2+ messages in thread
From: Austin S. Hemmelgarn @ 2019-05-30 11:36 UTC (permalink / raw)
  To: Newbugreport, linux-btrfs

On 2019-05-29 21:13, Newbugreport wrote:
> I'm experimenting with the rsync algorithm for btrfs deduplication. Every other deduplication tool I've seen works against whole files. I'm concerned about deduping chunks under 4k and about files with scattered extents.
AFAIK, regions smaller than the FS block size cannot be deduplicated, so 
you're limited to 4k in most cases, possibly larger on some systems.

Also, pretty much every tool I know of for deduplication on BTRFS 
operates not on whole files, but on blocks.  duperemove, for example, 
scans whole files at whatever chunk size you tell it to, figures out 
duplicated extents (that is, runs of sequential duplicate chunks), and 
then passes the resultant extents to the dedupe ioctl.  That approach 
even works to deduplicate data _within_ files.
> 
> Are there best practices for deduplication on btrfs?
> General thoughts:

* Minimize the number of calls you make to the actual dedupe ioctl as 
much as possible, it does a bytewise comparison of all the regions 
passed in, and has to freeze I/O to the files the regions are in until 
it's done, so it's both expensive in terms of time and processing power, 
and it can slow down the filesystem.  The clone ioctl can be used 
instead (and is far faster), but runs the risk of data loss if the files 
are in active use.

* Doing a custom script or tool for finding duplicate regions for your 
data that actually understands the structure of the data will almost 
always get you better deduplication results and run much faster than one 
of the generic tools.  For example, I've got a couple of directories on 
one of my systems where if two files have the same name and relative 
path under those directories, they _should_ be identical, so all I need 
to deduplicate that data is a simple path-matching tool that passes 
whole files to the dedupe ioctl.

* Deduplicating really small blocks within files is almost never worth 
it (which is part of why most dedupe tools default to operating on 
chunks of 128k or larger) because:
     - A single reflink for a 4k block actually takes up at least the 
same amount of space as just having the block there, and it might take 
up more depending on how the extents are split (if the existing 4k block 
is part of an extent, then it may not be freed when you replace it with 
the reflink).
     - Having huge numbers of reflinks can actually negatively impact 
filesystem performance.  Even ignoring the potential issues with stuff 
like qgroups, the fragmentation introduced by using lots of reflinks 
increases the overhead of reading files by a non-negligible amount (even 
on SSD's).

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-05-30 11:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30  1:13 sub-file dedup Newbugreport
2019-05-30 11:36 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.