All of lore.kernel.org
 help / color / mirror / Atom feed
* Trying to understand duperemove failure to deduplicate
@ 2022-03-09  6:55 Andy Smith
  2022-03-09  7:58 ` Nikolay Borisov
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Smith @ 2022-03-09  6:55 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I was hoping to use duperemove to dedupe a set of large backups on a
btrfs fs.

I did a test run and saw hardly any savings. I expected several
hundred GB to be found; duperemove actually reported about 98GB but
"df" only shows around 30GB. So I looked a bit harder.

FS mount options:

/dev/mapper/backupenc /data/backup btrfs rw,noatime,compress=zstd:15,space_cache,subvolid=5,subvol=/ 0 0

Kernel version 5.10.0-11-amd64, Debian 11.

Take for example these two files:

$ stat daily.{0,1}/cacti/var/lib/debconf_selections 
  File: daily.0/cacti/var/lib/debconf_selections
  Size: 94065           Blocks: 184        IO Block: 4096   regular file
Device: 26h/38d Inode: 136346107   Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-01-26 01:45:24.281602018 +0000
Modify: 2019-11-12 08:25:03.528065556 +0000
Change: 2022-03-08 11:28:27.862447446 +0000
 Birth: 2022-03-08 11:28:27.834447672 +0000
  File: daily.1/cacti/var/lib/debconf_selections
  Size: 94065           Blocks: 184        IO Block: 4096   regular file
Device: 26h/38d Inode: 134478113   Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-01-26 01:45:24.281602018 +0000
Modify: 2019-11-12 08:25:03.528065556 +0000
Change: 2022-03-07 20:37:22.993579274 +0000
 Birth: 2022-03-07 20:37:22.993579274 +0000

They have identical content:

$ md5sum daily.{0,1}/cacti/var/lib/debconf_selections 
c5633915f9d847394a6640c77c55f83a  daily.0/cacti/var/lib/debconf_selections
c5633915f9d847394a6640c77c55f83a  daily.1/cacti/var/lib/debconf_selections

They don't currently share extents:

$ filefrag -v daily.[01]/cacti/var/lib/debconf_selections
Filesystem type is: 9123683e
File size of daily.0/cacti/var/lib/debconf_selections is 94065 (23 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      22:  374427125.. 374427147:     23:             last,encoded,eof
daily.0/cacti/var/lib/debconf_selections: 1 extent found
File size of daily.1/cacti/var/lib/debconf_selections is 94065 (23 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      19:     399511..    399530:     20:             encoded,shared
   1:       20..      22:     306304..    306306:      3:     399531: last,encoded,shared,eof
daily.1/cacti/var/lib/debconf_selections: 2 extents found

So I would expect if I ran duperemove on these two files it would
work out that these 3 extents could be replaced by 1 or 2. But:

$ sudo /usr/local/sbin/duperemove -b 4096 -drhv daily.{0,1}/cacti/var/lib/debconf_selections
Increased open file limit from 1024 to 1048576.
Using 4K blocks
Using hash: murmur3
Using extent-based hashing
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /data/backup/daily.0/cacti/var/lib/debconf_selections
[2/2] (100.00%) csum: /data/backup/daily.1/cacti/var/lib/debconf_selections
Total files:  2
Total extent hashes: 3
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.

Am I misunderstanding something about how dedupe works in btrfs, or
duperemove itself?

Is it because this filesystem has compression enabled? Though after
reading the earlier really useful reply from Zygo about dedupe and
compression I had thought this wasn't going to be much of an issue
with duperemove.

I haven't yet tried bees to see if it sees things differently.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Trying to understand duperemove failure to deduplicate
  2022-03-09  6:55 Trying to understand duperemove failure to deduplicate Andy Smith
@ 2022-03-09  7:58 ` Nikolay Borisov
  2022-03-09  8:23   ` Andy Smith
  0 siblings, 1 reply; 4+ messages in thread
From: Nikolay Borisov @ 2022-03-09  7:58 UTC (permalink / raw)
  To: Andy Smith, linux-btrfs



On 9.03.22 г. 8:55 ч., Andy Smith wrote:
> Hi,
> 
> I was hoping to use duperemove to dedupe a set of large backups on a
> btrfs fs.
> 
> I did a test run and saw hardly any savings. I expected several
> hundred GB to be found; duperemove actually reported about 98GB but
> "df" only shows around 30GB. So I looked a bit harder.
> 
> FS mount options:
> 
> /dev/mapper/backupenc /data/backup btrfs rw,noatime,compress=zstd:15,space_cache,subvolid=5,subvol=/ 0 0
> 
> Kernel version 5.10.0-11-amd64, Debian 11.
> 
> Take for example these two files:
> 
> $ stat daily.{0,1}/cacti/var/lib/debconf_selections
>    File: daily.0/cacti/var/lib/debconf_selections
>    Size: 94065           Blocks: 184        IO Block: 4096   regular file
> Device: 26h/38d Inode: 136346107   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-01-26 01:45:24.281602018 +0000
> Modify: 2019-11-12 08:25:03.528065556 +0000
> Change: 2022-03-08 11:28:27.862447446 +0000
>   Birth: 2022-03-08 11:28:27.834447672 +0000
>    File: daily.1/cacti/var/lib/debconf_selections
>    Size: 94065           Blocks: 184        IO Block: 4096   regular file
> Device: 26h/38d Inode: 134478113   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-01-26 01:45:24.281602018 +0000
> Modify: 2019-11-12 08:25:03.528065556 +0000
> Change: 2022-03-07 20:37:22.993579274 +0000
>   Birth: 2022-03-07 20:37:22.993579274 +0000
> 
> They have identical content:
> 
> $ md5sum daily.{0,1}/cacti/var/lib/debconf_selections
> c5633915f9d847394a6640c77c55f83a  daily.0/cacti/var/lib/debconf_selections
> c5633915f9d847394a6640c77c55f83a  daily.1/cacti/var/lib/debconf_selections
> 
> They don't currently share extents:
> 
> $ filefrag -v daily.[01]/cacti/var/lib/debconf_selections
> Filesystem type is: 9123683e
> File size of daily.0/cacti/var/lib/debconf_selections is 94065 (23 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..      22:  374427125.. 374427147:     23:             last,encoded,eof
> daily.0/cacti/var/lib/debconf_selections: 1 extent found
> File size of daily.1/cacti/var/lib/debconf_selections is 94065 (23 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..      19:     399511..    399530:     20:             encoded,shared
>     1:       20..      22:     306304..    306306:      3:     399531: last,encoded,shared,eof
> daily.1/cacti/var/lib/debconf_selections: 2 extents found

The problem is in duperemove, not btrfs. Basically in the default mode 
of operation duperemove works based on extents, however those 2 files 
have identical content but its logical structure is different 1 vs 2 
extents. Unfortunately duperemove is not able to cope with this, if you 
want to dedupe those file you should be using the block-based dedupe 
mode. This is explained in duperemove's FAQ in the man page:


.SS I got two identical files, why are they not deduped?

Duperemove by default works on extent granularity. What this means is if 
there
are two files which are logically identical (have the same content) but are
laid out on disk with different extent structure they won't be deduped. For
example if 2 files are 128k each and their content are identical but one of
them consists of a single 128k extent and the other of 2 x 64k extents then
they won't be deduped. This behavior is dependent on the current 
implementation
and is subject to change as duperemove is being improved.

<snip>


> Thanks,
> Andy
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Trying to understand duperemove failure to deduplicate
  2022-03-09  7:58 ` Nikolay Borisov
@ 2022-03-09  8:23   ` Andy Smith
  2022-03-09  8:26     ` Nikolay Borisov
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Smith @ 2022-03-09  8:23 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Hi Nikolay,

On Wed, Mar 09, 2022 at 09:58:27AM +0200, Nikolay Borisov wrote:
> The problem is in duperemove, not btrfs. Basically in the default mode of
> operation duperemove works based on extents, however those 2 files have
> identical content but its logical structure is different 1 vs 2 extents.

Ah okay, thanks.

> Unfortunately duperemove is not able to cope with this, if you want to
> dedupe those file you should be using the block-based dedupe mode.

Is that a mode of duperemove or did you mean to use a different tool?

I saw duperemove's "--lookup-extents" option and tried with that:

$ sudo /usr/local/sbin/duperemove -b 4096 -drhv --lookup-extents=no test/daily.{0,1}/cacti/var/lib/debconf_selections
Increased open file limit from 1024 to 1048576.
Using 4K blocks
Using hash: murmur3
Using block-based hashing
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /data/backup/test/daily.0/cacti/var/lib/debconf_selections
[2/2] (100.00%) csum: /data/backup/test/daily.1/cacti/var/lib/debconf_selections
Total files:  2
Total extent hashes: 46
Loading only duplicated hashes from hashfile.
Hashing completed. Using 4 threads to calculate duplicate extents. This may take some time.
Process 2 files.
Compare files "/data/backup/test/daily.1/cacti/var/lib/debconf_selections" and "/data/backup/test/daily.0/cacti/var/lib/debconf_selections"
Compare files "/data/backup/test/daily.1/cacti/var/lib/debconf_selections" and "/data/backup/test/daily.0/cacti/var/lib/debconf_selections"
Process 2 files.
Removing overlapping extents
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 91.9KB with id ab62a5e4
Start           Filename
0.0B    "/data/backup/test/daily.1/cacti/var/lib/debconf_selections"
0.0B    "/data/backup/test/daily.0/cacti/var/lib/debconf_selections"
Using 8 threads for dedupe phase
[0x5597f1e7b1e0] (1/1) Try to dedupe extents with id ab62a5e4
[0x5597f1e7b1e0] Add extent for file "/data/backup/test/daily.1/cacti/var/lib/debconf_selections" at offset 0.0B (3)
[0x5597f1e7b1e0] Add extent for file "/data/backup/test/daily.0/cacti/var/lib/debconf_selections" at offset 0.0B (4)
[0x5597f1e7b1e0] Dedupe 1 extents (id: ab62a5e4) with target: (0.0B, 91.9KB), "/data/backup/test/daily.1/cacti/var/lib/debconf_selections"
Kernel processed data (excludes target files): 91.9KB
Comparison of extent info shows a net change in shared extents of: 207.7KB

This does seem to have now resulted in a deduplication:

$ filefrag -v test/daily.{0,1}/cacti/var/lib/debconf_selections
Filesystem type is: 9123683e
File size of test/daily.0/cacti/var/lib/debconf_selections is 94065 (23 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      19:     399511..    399530:     20:             encoded,shared
   1:       20..      22:     306304..    306306:      3:     399531: last,encoded,shared,eof
test/daily.0/cacti/var/lib/debconf_selections: 2 extents found
File size of test/daily.1/cacti/var/lib/debconf_selections is 94065 (23 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      19:     399511..    399530:     20:             encoded,shared
   1:       20..      22:     306304..    306306:      3:     399531: last,encoded,shared,eof
test/daily.1/cacti/var/lib/debconf_selections: 2 extents found

So now I've got 2 extents total instead of 3, and that seemed to
work, but perhaps there is a better tool.

> This is explained in duperemove's FAQ in the man page:
> 
> 
> .SS I got two identical files, why are they not deduped?

Ah right, I had missed that - the man page on duperemove's web site
is out of date but I see it in the locally-installed copy.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Trying to understand duperemove failure to deduplicate
  2022-03-09  8:23   ` Andy Smith
@ 2022-03-09  8:26     ` Nikolay Borisov
  0 siblings, 0 replies; 4+ messages in thread
From: Nikolay Borisov @ 2022-03-09  8:26 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-btrfs



On 9.03.22 г. 10:23 ч., Andy Smith wrote:
> Hi Nikolay,
> 
> On Wed, Mar 09, 2022 at 09:58:27AM +0200, Nikolay Borisov wrote:
>> The problem is in duperemove, not btrfs. Basically in the default mode of
>> operation duperemove works based on extents, however those 2 files have
>> identical content but its logical structure is different 1 vs 2 extents.
> 
> Ah okay, thanks.
> 
>> Unfortunately duperemove is not able to cope with this, if you want to
>> dedupe those file you should be using the block-based dedupe mode.
> 
> Is that a mode of duperemove or did you mean to use a different tool?
> 
> I saw duperemove's "--lookup-extents" option and tried with that:

Yes, using lookup-extents=no makes duperemove utilize the block-dedupe 
mode of operation.

<snip>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-03-09  8:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-09  6:55 Trying to understand duperemove failure to deduplicate Andy Smith
2022-03-09  7:58 ` Nikolay Borisov
2022-03-09  8:23   ` Andy Smith
2022-03-09  8:26     ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.