All of lore.kernel.org
 help / color / mirror / Atom feed
* dduper - Offline btrfs deduplication tool
@ 2018-08-24  4:31 Lakshmipathi.G
  2018-09-05 16:00 ` Timofey Titovets
  0 siblings, 1 reply; 6+ messages in thread
From: Lakshmipathi.G @ 2018-08-24  4:31 UTC (permalink / raw)
  To: linux-btrfs

Hi -

dduper is an offline dedupe tool. Instead of reading whole file blocks and
computing checksum, It works by fetching checksum from BTRFS csum tree. This 
hugely improves the performance. 

dduper works like:
	- Read csum for given two files.
	- Find matching location.
	- Pass the location to ioctl_ficlonerange directly
  	  instead of ioctl_fideduperange

By default, dduper adds safty check to above steps by creating a 
backup reflink file and compares the md5sum after dedupe. 
If the backup file matches new deduped file, then backup file is 
removed. You can skip this check by passing --skip option. Here is 
sample cli usage [1] and quick demo [2]  

Some performance numbers: (with -skip option)

Dedupe two 1GB files with same  content - 1.2 seconds
Dedupe two 5GB files with same  content - 8.2 seconds
Dedupe two 10GB files with same  content - 13.8 seconds

dduper requires `btrfs inspect-internal dump-csum` command, you can use 
this branch [3] or apply patch by yourself [4] 

[1] https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md
[2] http://giis.co.in/btrfs_dedupe.gif
[3] git clone https://gitlab.collabora.com/laks/btrfs-progs.git -b  dump_csum
[4] https://patchwork.kernel.org/patch/10540229/ 

Please remember its version-0.1, so test it out, if you plan to use dduper real data.
Let me know, if you have suggestions or feedback or bugs :)

Cheers.
Lakshmipathi.G

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dduper - Offline btrfs deduplication tool
  2018-08-24  4:31 dduper - Offline btrfs deduplication tool Lakshmipathi.G
@ 2018-09-05 16:00 ` Timofey Titovets
  2018-09-07  3:57   ` Lakshmipathi.G
  0 siblings, 1 reply; 6+ messages in thread
From: Timofey Titovets @ 2018-09-05 16:00 UTC (permalink / raw)
  To: lakshmipathi.g; +Cc: linux-btrfs

пт, 24 авг. 2018 г. в 7:41, Lakshmipathi.G <lakshmipathi.g@giis.co.in>:
>
> Hi -
>
> dduper is an offline dedupe tool. Instead of reading whole file blocks and
> computing checksum, It works by fetching checksum from BTRFS csum tree. This
> hugely improves the performance.
>
> dduper works like:
>         - Read csum for given two files.
>         - Find matching location.
>         - Pass the location to ioctl_ficlonerange directly
>           instead of ioctl_fideduperange
>
> By default, dduper adds safty check to above steps by creating a
> backup reflink file and compares the md5sum after dedupe.
> If the backup file matches new deduped file, then backup file is
> removed. You can skip this check by passing --skip option. Here is
> sample cli usage [1] and quick demo [2]
>
> Some performance numbers: (with -skip option)
>
> Dedupe two 1GB files with same  content - 1.2 seconds
> Dedupe two 5GB files with same  content - 8.2 seconds
> Dedupe two 10GB files with same  content - 13.8 seconds
>
> dduper requires `btrfs inspect-internal dump-csum` command, you can use
> this branch [3] or apply patch by yourself [4]
>
> [1] https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md
> [2] http://giis.co.in/btrfs_dedupe.gif
> [3] git clone https://gitlab.collabora.com/laks/btrfs-progs.git -b  dump_csum
> [4] https://patchwork.kernel.org/patch/10540229/
>
> Please remember its version-0.1, so test it out, if you plan to use dduper real data.
> Let me know, if you have suggestions or feedback or bugs :)
>
> Cheers.
> Lakshmipathi.G
>

One question:
Why not ioctl_fideduperange?
i.e. you kill most of benefits from that ioctl - atomicity.


-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dduper - Offline btrfs deduplication tool
  2018-09-05 16:00 ` Timofey Titovets
@ 2018-09-07  3:57   ` Lakshmipathi.G
  2018-09-07 14:31     ` Adam Borowski
  2018-09-07 23:32     ` Zygo Blaxell
  0 siblings, 2 replies; 6+ messages in thread
From: Lakshmipathi.G @ 2018-09-07  3:57 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-btrfs

> 
> One question:
> Why not ioctl_fideduperange?
> i.e. you kill most of benefits from that ioctl - atomicity.
> 
I plan to add fideduperange as an option too. User can
choose between fideduperange and ficlonerange call.

If I'm not wrong, with fideduperange, kernel performs
comparsion check before dedupe. And it will increase
time to dedupe files.

I believe the risk involved with ficlonerange is  minimized 
by having a backup file(reflinked). We can revert to older 
original file, if we encounter some problems.

> 
> -- 
> Have a nice day,
> Timofey.

Cheers.
Lakshmipathi.G

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dduper - Offline btrfs deduplication tool
  2018-09-07  3:57   ` Lakshmipathi.G
@ 2018-09-07 14:31     ` Adam Borowski
  2018-10-02 16:05       ` Lakshmipathi.G
  2018-09-07 23:32     ` Zygo Blaxell
  1 sibling, 1 reply; 6+ messages in thread
From: Adam Borowski @ 2018-09-07 14:31 UTC (permalink / raw)
  To: Lakshmipathi.G; +Cc: Timofey Titovets, linux-btrfs

On Fri, Sep 07, 2018 at 09:27:28AM +0530, Lakshmipathi.G wrote:
> > One question:
> > Why not ioctl_fideduperange?
> > i.e. you kill most of benefits from that ioctl - atomicity.
> > 
> I plan to add fideduperange as an option too. User can
> choose between fideduperange and ficlonerange call.
> 
> If I'm not wrong, with fideduperange, kernel performs
> comparsion check before dedupe. And it will increase
> time to dedupe files.

You already read the files to md5sum them, so you have no speed gain.
You get nasty data-losing races, and risk collisions as well.  md5sum is
safe against random occurences (compared eg. to the chance of lightning
hitting you today), but is exploitable by a hostile user.  On the other
hand, full bit-to-bit comparison is faster and 100% safe.

You can't skip verification -- the checksums are only 32-bit.  They have a
1:4G chance to mismatch, which means you can expect one false positive with
64K extents, rising quadratically as the number of files grows.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄⠀⠀⠀⠀ preimage for double rot13!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dduper - Offline btrfs deduplication tool
  2018-09-07  3:57   ` Lakshmipathi.G
  2018-09-07 14:31     ` Adam Borowski
@ 2018-09-07 23:32     ` Zygo Blaxell
  1 sibling, 0 replies; 6+ messages in thread
From: Zygo Blaxell @ 2018-09-07 23:32 UTC (permalink / raw)
  To: Lakshmipathi.G; +Cc: Timofey Titovets, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]

On Fri, Sep 07, 2018 at 09:27:28AM +0530, Lakshmipathi.G wrote:
> > 
> > One question:
> > Why not ioctl_fideduperange?
> > i.e. you kill most of benefits from that ioctl - atomicity.
> > 
> I plan to add fideduperange as an option too. User can
> choose between fideduperange and ficlonerange call.
> 
> If I'm not wrong, with fideduperange, kernel performs
> comparsion check before dedupe. And it will increase
> time to dedupe files.

Creating the backup reflink file takes far more time than you will ever
save from fideduperange.

You don't need the md5sum either, unless you have a data set that is
full of crc32 collisions (e.g. a file format that puts a CRC32 at the
end of each 4K block).  The few people who have such a data set can
enable md5sums, everyone else can have md5sums disabled by default.

> I believe the risk involved with ficlonerange is  minimized 
> by having a backup file(reflinked). We can revert to older 
> original file, if we encounter some problems.

With fideduperange the risk is more than minimized--it's completely
eliminated.

If you don't use fideduperange you can't use the tool on a live data
set at all.

> > 
> > -- 
> > Have a nice day,
> > Timofey.
> 
> Cheers.
> Lakshmipathi.G

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dduper - Offline btrfs deduplication tool
  2018-09-07 14:31     ` Adam Borowski
@ 2018-10-02 16:05       ` Lakshmipathi.G
  0 siblings, 0 replies; 6+ messages in thread
From: Lakshmipathi.G @ 2018-10-02 16:05 UTC (permalink / raw)
  To: Adam Borowski, ce3g8jdj; +Cc: Timofey Titovets, linux-btrfs

Apologies for pretty delay in response.  Thanks for the suggestions 
and comments.

>On the other hand, full bit-to-bit comparison is faster and 100% safe

>With fideduperange the risk is more than minimized--it's completely
>eliminated.

Okay got it, will use fideduperange() as default option for the tool and
keep ficlonerange() as secondary option. Will make the code changes and
send a new patch soon. thanks! 

Cheers.
Lakshmipathi.G 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-02 16:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-24  4:31 dduper - Offline btrfs deduplication tool Lakshmipathi.G
2018-09-05 16:00 ` Timofey Titovets
2018-09-07  3:57   ` Lakshmipathi.G
2018-09-07 14:31     ` Adam Borowski
2018-10-02 16:05       ` Lakshmipathi.G
2018-09-07 23:32     ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.