Possible to dedpulicate read-only snapshots for space-efficient backups

All of lore.kernel.org
 help / color / mirror / Atom feed

* Possible to dedpulicate read-only snapshots for space-efficient backups
@ 2013-05-05 10:07 Kai Krakow
  2013-05-05 12:55 ` Gabriel de Perthuis
  2013-05-06  6:15 ` Possible to dedpulicate " Jan Schmidt
  0 siblings, 2 replies; 11+ messages in thread
From: Kai Krakow @ 2013-05-05 10:07 UTC (permalink / raw)
  To: linux-btrfs

Hey list,

I wonder if it is possible to deduplicate read-only snapshots.

Background:

I'm using an bash/rsync script[1] to backup my whole system on a nightly 
basis to an attached USB3 drive into a scratch area, then take a snapshot of 
this area. I'd like to have these snapshots immutable, so they should be 
read-only.

Since rsync won't discover moved files but instead place a new copy of that 
in the backup, I'm running the wonderful bedup application[2] to deduplicate 
my backup drive from time to time and it almost always gains back a good 
pile of gigabytes. The rest of storage space issues is taken care of by 
using rsync's inplace option (although this won't cover the case of files 
moved and changed between backup runs) and using compress-force=gzip.

Since bedup sets the immutable attribute during touching the files, I 
suspect the process will no longer work when I make the snapshots read-only.

I've read about ongoing work to integrate offline (and even online) 
deduplication into the kernel so that this process can be made atomic (and 
even block-based instead of file-based). This would - to my understandings - 
result in the immutable attribute no longer needed. So, given the fact above 
and for the case read-only snapshots cannot be used for this application 
currently, will these patches address the problem and read-only snapshots 
could be deduplicated? Or are read-only snapshots meant to be what the name 
suggests: Immutable, even for deduplication?

Regards,
Kai

[1]: https://gist.github.com/kakra/5520370
[2]: https://github.com/g2p/bedup

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-05 10:07 Possible to dedpulicate read-only snapshots for space-efficient backups Kai Krakow
@ 2013-05-05 12:55 ` Gabriel de Perthuis
  2013-05-05 17:22   ` Kai Krakow
  2013-05-06  6:15 ` Possible to dedpulicate " Jan Schmidt
  1 sibling, 1 reply; 11+ messages in thread
From: Gabriel de Perthuis @ 2013-05-05 12:55 UTC (permalink / raw)
  To: linux-btrfs

On Sun, 05 May 2013 12:07:17 +0200, Kai Krakow wrote:
> Hey list,
> 
> I wonder if it is possible to deduplicate read-only snapshots.
> 
> Background:
> 
> I'm using an bash/rsync script[1] to backup my whole system on a nightly 
> basis to an attached USB3 drive into a scratch area, then take a snapshot of 
> this area. I'd like to have these snapshots immutable, so they should be 
> read-only.
> 
> Since rsync won't discover moved files but instead place a new copy of that 
> in the backup, I'm running the wonderful bedup application[2] to deduplicate 
> my backup drive from time to time and it almost always gains back a good 
> pile of gigabytes. The rest of storage space issues is taken care of by 
> using rsync's inplace option (although this won't cover the case of files 
> moved and changed between backup runs) and using compress-force=gzip.

> I've read about ongoing work to integrate offline (and even online) 
> deduplication into the kernel so that this process can be made atomic (and 
> even block-based instead of file-based). This would - to my understandings - 
> result in the immutable attribute no longer needed. So, given the fact above 
> and for the case read-only snapshots cannot be used for this application 
> currently, will these patches address the problem and read-only snapshots 
> could be deduplicated? Or are read-only snapshots meant to be what the name 
> suggests: Immutable, even for deduplication?

There's no deep reason read-only snapshots should keep their storage
immutable, they can be affected by raid rebalancing for example.

The current bedup restriction comes from the clone call; Mark Fasheh's
dedup ioctl[3] appears to be fine with snapshots.  The bedup integration
(in a branch) is a work in progress at the moment.  I need to fix a scan
bug, tweak parameters for the latest kernel dedup patch, remove a lot of
logic that is now unnecessary, and figure out the compatibility story.

> Regards,
> Kai
> 
> [1]: https://gist.github.com/kakra/5520370
> [2]: https://github.com/g2p/bedup

[3]: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25062



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-05 12:55 ` Gabriel de Perthuis
@ 2013-05-05 17:22   ` Kai Krakow
  2013-05-07 22:07     ` Gabriel de Perthuis
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2013-05-05 17:22 UTC (permalink / raw)
  To: linux-btrfs

Gabriel de Perthuis <g2p.code@gmail.com> schrieb:

> There's no deep reason read-only snapshots should keep their storage
> immutable, they can be affected by raid rebalancing for example.

Sounds logical, and good...

> The current bedup restriction comes from the clone call; Mark Fasheh's
> dedup ioctl[3] appears to be fine with snapshots.  The bedup integration
> (in a branch) is a work in progress at the moment.  I need to fix a scan
> bug, tweak parameters for the latest kernel dedup patch, remove a lot of
> logic that is now unnecessary, and figure out the compatibility story.

I'd be eager to test as soon as the patches arrived in the official kernel 
distribution.

Do you plan to support deduplication on a finer grained basis than file 
level? As an example, in the end it could be interesting to deduplicate 1M 
blocks of huge files. Backups of VM images come to my mind as a good 
candidate. While my current backup script[1] takes care of this by using 
"rsync --inplace" it won't consider files moved between two backup cycles. 
This is the main purpose I'm using bedup for on my backup drive.

Maybe you could define another cutoff value to consider huge files for 
block-level deduplication?

Regards,
Kai

[1]: https://gist.github.com/kakra/5520370

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-05 10:07 Possible to dedpulicate read-only snapshots for space-efficient backups Kai Krakow
  2013-05-05 12:55 ` Gabriel de Perthuis
@ 2013-05-06  6:15 ` Jan Schmidt
  2013-05-06  7:44   ` Kai Krakow
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Schmidt @ 2013-05-06  6:15 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Sun, May 05, 2013 at 12:07 (+0200), Kai Krakow wrote:
> I'm using an bash/rsync script[1] to backup my whole system on a nightly 
> basis to an attached USB3 drive into a scratch area, then take a snapshot of 
> this area. I'd like to have these snapshots immutable, so they should be 
> read-only.

Have you considered using btrfs send / receive for that purpose? You would just
save the dedup step.

-Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-06  6:15 ` Possible to dedpulicate " Jan Schmidt
@ 2013-05-06  7:44   ` Kai Krakow
  2013-05-06 14:35     ` james northrup
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2013-05-06  7:44 UTC (permalink / raw)
  To: linux-btrfs

Jan Schmidt <list.btrfs@jan-o-sch.net> schrieb:

>> I'm using an bash/rsync script[1] to backup my whole system on a nightly
>> basis to an attached USB3 drive into a scratch area, then take a snapshot
>> of this area. I'd like to have these snapshots immutable, so they should
>> be read-only.
> 
> Have you considered using btrfs send / receive for that purpose? You would
> just save the dedup step.

This is planned for later. In the first step I want to stay as file system 
agnostic for the source as possible. But I've put it on my todo list in the 
gist.

Regards,
Kai


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-06  7:44   ` Kai Krakow
@ 2013-05-06 14:35     ` james northrup
  2013-05-06 20:48       ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: james northrup @ 2013-05-06 14:35 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

tried a git based backup? sounds spot-on as a compromise prior to
applying btrfs tweaks.  snapshotting the git binaries would have the
dedupe characteristics.

On Mon, May 6, 2013 at 12:44 AM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> Jan Schmidt <list.btrfs@jan-o-sch.net> schrieb:
>
>>> I'm using an bash/rsync script[1] to backup my whole system on a nightly
>>> basis to an attached USB3 drive into a scratch area, then take a snapshot
>>> of this area. I'd like to have these snapshots immutable, so they should
>>> be read-only.
>>
>> Have you considered using btrfs send / receive for that purpose? You would
>> just save the dedup step.
>
> This is planned for later. In the first step I want to stay as file system
> agnostic for the source as possible. But I've put it on my todo list in the
> gist.
>
> Regards,
> Kai
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-06 14:35     ` james northrup
@ 2013-05-06 20:48       ` Kai Krakow
  0 siblings, 0 replies; 11+ messages in thread
From: Kai Krakow @ 2013-05-06 20:48 UTC (permalink / raw)
  To: linux-btrfs

james northrup <northrup.james@gmail.com> schrieb:

> tried a git based backup? sounds spot-on as a compromise prior to
> applying btrfs tweaks.  snapshotting the git binaries would have the
> dedupe characteristics.

Git is efficient with space, yes. But if you have a lot of binary files, and 
a lot of them are big, git becomes really slow really fast. Checking out and 
in can be very slow and resource intensive then. And I don't think it would 
track ownership and permissions correctly.

Git is great, it's an everyday tool for me, but it is just not made for 
binary files.

Regards,
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-05 17:22   ` Kai Krakow
@ 2013-05-07 22:07     ` Gabriel de Perthuis
  2013-05-07 23:04       ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: Gabriel de Perthuis @ 2013-05-07 22:07 UTC (permalink / raw)
  To: linux-btrfs

> Do you plan to support deduplication on a finer grained basis than file 
> level? As an example, in the end it could be interesting to deduplicate 1M 
> blocks of huge files. Backups of VM images come to my mind as a good 
> candidate. While my current backup script[1] takes care of this by using 
> "rsync --inplace" it won't consider files moved between two backup cycles. 
> This is the main purpose I'm using bedup for on my backup drive.
> 
> Maybe you could define another cutoff value to consider huge files for 
> block-level deduplication?

I'm considering deduplicating aligned blocks of large files sharing the
same size (VMs with the same baseline.  Those would ideally come
pre-cowed, but rsync or scp could have broken that).

It sounds simple, and was sort-of prompted by the new syscall taking
short ranges, but it is tricky figuring out a sane heuristic (when to
hash, when to bail, when to submit without comparing, what should be the
source in the last case), and it's not something I have an immediate
need for.  It is also possible to use 9p (with standard cow and/or
small-file dedup) and trade a bit of configuration for much more
space-efficient VMs.

Finer-grained tracking of which ranges have changed, and maybe some
caching of range hashes, would be a good first step before doing any
crazy large-file heuristics.  The hash caching would actually benefit
all use cases.

> Regards,
> Kai
> 
> [1]: https://gist.github.com/kakra/5520370

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-07 22:07     ` Gabriel de Perthuis
@ 2013-05-07 23:04       ` Kai Krakow
  2013-05-07 23:22         ` Kai Krakow
  2013-05-07 23:35         ` Possible to deduplicate " Gabriel de Perthuis
  0 siblings, 2 replies; 11+ messages in thread
From: Kai Krakow @ 2013-05-07 23:04 UTC (permalink / raw)
  To: linux-btrfs

Gabriel de Perthuis <g2p.code@gmail.com> schrieb:

> It sounds simple, and was sort-of prompted by the new syscall taking
> short ranges, but it is tricky figuring out a sane heuristic (when to
> hash, when to bail, when to submit without comparing, what should be the
> source in the last case), and it's not something I have an immediate
> need for.  It is also possible to use 9p (with standard cow and/or
> small-file dedup) and trade a bit of configuration for much more
> space-efficient VMs.
> 
> Finer-grained tracking of which ranges have changed, and maybe some
> caching of range hashes, would be a good first step before doing any
> crazy large-file heuristics.  The hash caching would actually benefit
> all use cases.

Looking back to good old peer-2-peer days (I think we all got in touch with 
that the one or the other way), one title pops back into my mind: tiger-
tree-hash...

I'm not really into it, but would it be possible to use tiger-tree-hashes to 
find identical blocks? Even accross different sized files...

Regards,
Kai




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to dedpulicate read-only snapshots for space-efficient backups
  2013-05-07 23:04       ` Kai Krakow
@ 2013-05-07 23:22         ` Kai Krakow
  2013-05-07 23:35         ` Possible to deduplicate " Gabriel de Perthuis
  1 sibling, 0 replies; 11+ messages in thread
From: Kai Krakow @ 2013-05-07 23:22 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow <hurikhan77+btrfs@gmail.com> schrieb:

> Gabriel de Perthuis <g2p.code@gmail.com> schrieb:
> 
>> It sounds simple, and was sort-of prompted by the new syscall taking
>> short ranges, but it is tricky figuring out a sane heuristic (when to
>> hash, when to bail, when to submit without comparing, what should be the
>> source in the last case), and it's not something I have an immediate
>> need for.  It is also possible to use 9p (with standard cow and/or
>> small-file dedup) and trade a bit of configuration for much more
>> space-efficient VMs.
>> 
>> Finer-grained tracking of which ranges have changed, and maybe some
>> caching of range hashes, would be a good first step before doing any
>> crazy large-file heuristics.  The hash caching would actually benefit
>> all use cases.
> 
> Looking back to good old peer-2-peer days (I think we all got in touch
> with that the one or the other way), one title pops back into my mind:
> tiger- tree-hash...
> 
> I'm not really into it, but would it be possible to use tiger-tree-hashes
> to find identical blocks? Even accross different sized files...

While thinking about it: That hash was probably invented for the purpose of 
distributing the same content to multiple peers in as small deltas as 
possible. Well, deduplication is somehow the other way around: Coalescing 
all those wild distribution back into a single source of content. So some 
"inverse" of tiger-tree would probably work better / more efficient.

Regards,
Kai


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible to deduplicate read-only snapshots for space-efficient backups
  2013-05-07 23:04       ` Kai Krakow
  2013-05-07 23:22         ` Kai Krakow
@ 2013-05-07 23:35         ` Gabriel de Perthuis
  1 sibling, 0 replies; 11+ messages in thread
From: Gabriel de Perthuis @ 2013-05-07 23:35 UTC (permalink / raw)
  To: linux-btrfs

On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:
> Gabriel de Perthuis <g2p.code@gmail.com> schrieb:
>> It sounds simple, and was sort-of prompted by the new syscall taking
>> short ranges, but it is tricky figuring out a sane heuristic (when to
>> hash, when to bail, when to submit without comparing, what should be the
>> source in the last case), and it's not something I have an immediate
>> need for.  It is also possible to use 9p (with standard cow and/or
>> small-file dedup) and trade a bit of configuration for much more
>> space-efficient VMs.
>> 
>> Finer-grained tracking of which ranges have changed, and maybe some
>> caching of range hashes, would be a good first step before doing any
>> crazy large-file heuristics.  The hash caching would actually benefit
>> all use cases.
> 
> Looking back to good old peer-2-peer days (I think we all got in touch with 
> that the one or the other way), one title pops back into my mind: tiger-
> tree-hash...
> 
> I'm not really into it, but would it be possible to use tiger-tree-hashes to 
> find identical blocks? Even accross different sized files...

Possible, but bedup is all about doing as little io as it can get away
with, doing streaming reads only when it has sampled that the files are
likely duplicates and not spending a ton of disk space for indexing.

Hashing everything in the hope that there are identical blocks at
unrelated places on the disk is a much more resource-intensive approach;
Liu Bo is working on that, following ZFS's design choices.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-05-07 23:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-05 10:07 Possible to dedpulicate read-only snapshots for space-efficient backups Kai Krakow
2013-05-05 12:55 ` Gabriel de Perthuis
2013-05-05 17:22   ` Kai Krakow
2013-05-07 22:07     ` Gabriel de Perthuis
2013-05-07 23:04       ` Kai Krakow
2013-05-07 23:22         ` Kai Krakow
2013-05-07 23:35         ` Possible to deduplicate " Gabriel de Perthuis
2013-05-06  6:15 ` Possible to dedpulicate " Jan Schmidt
2013-05-06  7:44   ` Kai Krakow
2013-05-06 14:35     ` james northrup
2013-05-06 20:48       ` Kai Krakow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.