* sub-file dedup
@ 2019-05-30 1:13 Newbugreport
2019-05-30 11:36 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 2+ messages in thread
From: Newbugreport @ 2019-05-30 1:13 UTC (permalink / raw)
To: linux-btrfs
I'm experimenting with the rsync algorithm for btrfs deduplication. Every other deduplication tool I've seen works against whole files. I'm concerned about deduping chunks under 4k and about files with scattered extents.
Are there best practices for deduplication on btrfs?
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: sub-file dedup
2019-05-30 1:13 sub-file dedup Newbugreport
@ 2019-05-30 11:36 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 2+ messages in thread
From: Austin S. Hemmelgarn @ 2019-05-30 11:36 UTC (permalink / raw)
To: Newbugreport, linux-btrfs
On 2019-05-29 21:13, Newbugreport wrote:
> I'm experimenting with the rsync algorithm for btrfs deduplication. Every other deduplication tool I've seen works against whole files. I'm concerned about deduping chunks under 4k and about files with scattered extents.
AFAIK, regions smaller than the FS block size cannot be deduplicated, so
you're limited to 4k in most cases, possibly larger on some systems.
Also, pretty much every tool I know of for deduplication on BTRFS
operates not on whole files, but on blocks. duperemove, for example,
scans whole files at whatever chunk size you tell it to, figures out
duplicated extents (that is, runs of sequential duplicate chunks), and
then passes the resultant extents to the dedupe ioctl. That approach
even works to deduplicate data _within_ files.
>
> Are there best practices for deduplication on btrfs?
> General thoughts:
* Minimize the number of calls you make to the actual dedupe ioctl as
much as possible, it does a bytewise comparison of all the regions
passed in, and has to freeze I/O to the files the regions are in until
it's done, so it's both expensive in terms of time and processing power,
and it can slow down the filesystem. The clone ioctl can be used
instead (and is far faster), but runs the risk of data loss if the files
are in active use.
* Doing a custom script or tool for finding duplicate regions for your
data that actually understands the structure of the data will almost
always get you better deduplication results and run much faster than one
of the generic tools. For example, I've got a couple of directories on
one of my systems where if two files have the same name and relative
path under those directories, they _should_ be identical, so all I need
to deduplicate that data is a simple path-matching tool that passes
whole files to the dedupe ioctl.
* Deduplicating really small blocks within files is almost never worth
it (which is part of why most dedupe tools default to operating on
chunks of 128k or larger) because:
- A single reflink for a 4k block actually takes up at least the
same amount of space as just having the block there, and it might take
up more depending on how the extents are split (if the existing 4k block
is part of an extent, then it may not be freed when you replace it with
the reflink).
- Having huge numbers of reflinks can actually negatively impact
filesystem performance. Even ignoring the potential issues with stuff
like qgroups, the fragmentation introduced by using lots of reflinks
increases the overhead of reading files by a non-negligible amount (even
on SSD's).
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2019-05-30 11:36 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30 1:13 sub-file dedup Newbugreport
2019-05-30 11:36 ` Austin S. Hemmelgarn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.