All of lore.kernel.org
 help / color / mirror / Atom feed
* Copy on write of unmodified data
@ 2016-05-25  8:58 H. Peter Anvin
  2016-05-25  9:29 ` Hugo Mills
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: H. Peter Anvin @ 2016-05-25  8:58 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I'm looking at using a btrfs with snapshots to implement a generational
backup capacity.  However, doing it the naïve way would have the side
effect that for a file that has been partially modified, after
snapshotting the file would be written with *mostly* the same data.  How
does btrfs' COW algorithm deal with that?  If necessary I might want to
write some smarter user space utilities for this.

	-hpa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25  8:58 Copy on write of unmodified data H. Peter Anvin
@ 2016-05-25  9:29 ` Hugo Mills
  2016-05-25 11:00   ` H. Peter Anvin
  2016-05-25 13:06   ` Dmitry Katsubo
  2016-05-25 11:45 ` Austin S. Hemmelgarn
  2016-05-25 16:16 ` Henk Slager
  2 siblings, 2 replies; 9+ messages in thread
From: Hugo Mills @ 2016-05-25  9:29 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 910 bytes --]

On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote:
> Hi,
> 
> I'm looking at using a btrfs with snapshots to implement a generational
> backup capacity.  However, doing it the naïve way would have the side
> effect that for a file that has been partially modified, after
> snapshotting the file would be written with *mostly* the same data.  How
> does btrfs' COW algorithm deal with that?  If necessary I might want to
> write some smarter user space utilities for this.

   Sounds like it might be a job for one of the dedup tools
(deupremove, bedup), or, if you're writing your own, the safe
deduplication ioctl which underlies those tools.

   Hugo.

-- 
Hugo Mills             | Well, sir, the floor is yours. But remember, the
hugo@... carfax.org.uk | roof is ours!
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                             The Goons

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25  9:29 ` Hugo Mills
@ 2016-05-25 11:00   ` H. Peter Anvin
  2016-05-25 11:07     ` Hugo Mills
  2016-05-25 13:06   ` Dmitry Katsubo
  1 sibling, 1 reply; 9+ messages in thread
From: H. Peter Anvin @ 2016-05-25 11:00 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On 05/25/16 02:29, Hugo Mills wrote:
> On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote:
>> Hi,
>>
>> I'm looking at using a btrfs with snapshots to implement a generational
>> backup capacity.  However, doing it the naïve way would have the side
>> effect that for a file that has been partially modified, after
>> snapshotting the file would be written with *mostly* the same data.  How
>> does btrfs' COW algorithm deal with that?  If necessary I might want to
>> write some smarter user space utilities for this.
> 
>    Sounds like it might be a job for one of the dedup tools
> (deupremove, bedup), or, if you're writing your own, the safe
> deduplication ioctl which underlies those tools.
> 

I guess I would prefer if data wasn't first duplicated and then
deduplicated if possible.  It sounds like I ought to write a "smart
copy-overwrite" tool for this.

	-hpa




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25 11:00   ` H. Peter Anvin
@ 2016-05-25 11:07     ` Hugo Mills
  2016-05-25 11:32       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 9+ messages in thread
From: Hugo Mills @ 2016-05-25 11:07 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On Wed, May 25, 2016 at 04:00:00AM -0700, H. Peter Anvin wrote:
> On 05/25/16 02:29, Hugo Mills wrote:
> > On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote:
> >> Hi,
> >>
> >> I'm looking at using a btrfs with snapshots to implement a generational
> >> backup capacity.  However, doing it the naïve way would have the side
> >> effect that for a file that has been partially modified, after
> >> snapshotting the file would be written with *mostly* the same data.  How
> >> does btrfs' COW algorithm deal with that?  If necessary I might want to
> >> write some smarter user space utilities for this.
> > 
> >    Sounds like it might be a job for one of the dedup tools
> > (deupremove, bedup), or, if you're writing your own, the safe
> > deduplication ioctl which underlies those tools.
> > 
> 
> I guess I would prefer if data wasn't first duplicated and then
> deduplicated if possible.  It sounds like I ought to write a "smart
> copy-overwrite" tool for this.

   I _think_ rsync --in-place may help here. IIRC, it'll only
overwrite the sections of files that have changed, rather than write
and replace the whole file. (I may be wrong about that, though. I
haven't tested it at that level).

   There's also the in-band dedup patches that have been on the
mailing list recently. That's probably going to need massive amounts
of RAM, though.

   Hugo.

-- 
Hugo Mills             | Putting U back in Honor, Valor, and Trth.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25 11:07     ` Hugo Mills
@ 2016-05-25 11:32       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-25 11:32 UTC (permalink / raw)
  To: Hugo Mills, H. Peter Anvin, linux-btrfs

On 2016-05-25 07:07, Hugo Mills wrote:
> On Wed, May 25, 2016 at 04:00:00AM -0700, H. Peter Anvin wrote:
>> On 05/25/16 02:29, Hugo Mills wrote:
>>> On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote:
>>>> Hi,
>>>>
>>>> I'm looking at using a btrfs with snapshots to implement a generational
>>>> backup capacity.  However, doing it the naïve way would have the side
>>>> effect that for a file that has been partially modified, after
>>>> snapshotting the file would be written with *mostly* the same data.  How
>>>> does btrfs' COW algorithm deal with that?  If necessary I might want to
>>>> write some smarter user space utilities for this.
>>>
>>>    Sounds like it might be a job for one of the dedup tools
>>> (deupremove, bedup), or, if you're writing your own, the safe
>>> deduplication ioctl which underlies those tools.
>>>
>>
>> I guess I would prefer if data wasn't first duplicated and then
>> deduplicated if possible.  It sounds like I ought to write a "smart
>> copy-overwrite" tool for this.
>
>    I _think_ rsync --in-place may help here. IIRC, it'll only
> overwrite the sections of files that have changed, rather than write
> and replace the whole file. (I may be wrong about that, though. I
> haven't tested it at that level).
This is absolutely correct, and I actually use rsync instead of cp on a 
regular basis partly for this reason.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25  8:58 Copy on write of unmodified data H. Peter Anvin
  2016-05-25  9:29 ` Hugo Mills
@ 2016-05-25 11:45 ` Austin S. Hemmelgarn
  2016-05-25 12:28   ` Hugo Mills
  2016-05-25 16:16 ` Henk Slager
  2 siblings, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-25 11:45 UTC (permalink / raw)
  To: H. Peter Anvin, linux-btrfs

On 2016-05-25 04:58, H. Peter Anvin wrote:
> Hi,
>
> I'm looking at using a btrfs with snapshots to implement a generational
> backup capacity.  However, doing it the naïve way would have the side
> effect that for a file that has been partially modified, after
> snapshotting the file would be written with *mostly* the same data.  How
> does btrfs' COW algorithm deal with that?  If necessary I might want to
> write some smarter user space utilities for this.
>
I might be completely incorrect about this, but here's what I believe 
happens in this case:
1. If the file is small enough that it gets stored in-line in the 
metadata, you can't avoid COW for the whole file.
2. If the file is less than the block size (16k is the current default 
in mkfs.btrfs for reasonably sized filesystems), then you also can't 
avoid COW for the whole file.
3. If the file is larger than the block size, COW will only happen 
per-block, and extents will get split at block boundaries to minimize 
the amount of duplication.

This of course requires that the updates are done by partial re-writes 
instead of a replace-by-rename semantic which is particularly popular 
among various software tools.

FWIW, while I don't use BTRFS like this (I just use snapshots to get a 
consistent state to copy out for backups, usually doing the actual 
backup using SquashFS), one of my friends uses rsync together with BTRFS 
to do incremental backups of his personal systems.  He runs rsync with 
--in-place on the system being backed up to copy things out to a 
dedicated subvolume on his backup device, and then snapshots the 
subvolume after each backup (and uses a snapshot thinning system similar 
to that used by snapper).  While it's not quite as efficient as it could 
be, it's still works well.

Alternatively, if you're backing up a BTRFS filesystem to another one, 
you can keep around the previous backup snapshot and do an incremental 
send against that, which will result in proper sharing of blocks.  I 
used to use this before I decided that I wanted better space efficiency 
for backups than BTRFS can currently offer.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25 11:45 ` Austin S. Hemmelgarn
@ 2016-05-25 12:28   ` Hugo Mills
  0 siblings, 0 replies; 9+ messages in thread
From: Hugo Mills @ 2016-05-25 12:28 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: H. Peter Anvin, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3020 bytes --]

On Wed, May 25, 2016 at 07:45:23AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-25 04:58, H. Peter Anvin wrote:
> >Hi,
> >
> >I'm looking at using a btrfs with snapshots to implement a generational
> >backup capacity.  However, doing it the naïve way would have the side
> >effect that for a file that has been partially modified, after
> >snapshotting the file would be written with *mostly* the same data.  How
> >does btrfs' COW algorithm deal with that?  If necessary I might want to
> >write some smarter user space utilities for this.
> >
> I might be completely incorrect about this, but here's what I
> believe happens in this case:
> 1. If the file is small enough that it gets stored in-line in the
> metadata, you can't avoid COW for the whole file.
> 2. If the file is less than the block size (16k is the current
> default in mkfs.btrfs for reasonably sized filesystems), then you
> also can't avoid COW for the whole file.
> 3. If the file is larger than the block size, COW will only happen
> per-block, and extents will get split at block boundaries to
> minimize the amount of duplication.
> 
> This of course requires that the updates are done by partial
> re-writes instead of a replace-by-rename semantic which is
> particularly popular among various software tools.

   The reason it's popular is that it can be made atomic -- either the
updates all make it to the named file, or they don't (obviously, only
if it's done in the right way, which many applications don't). If you
overwrite in place, then it can't be an atomic update.

   You could get both effects (minimal replacement and atomic update)
if you reflink copy the file, update in place on the copy, and then
replace it atomically, but that of course needs the tool to support it
and fall back to a sane default if reflinks aren't available.

   Hugo.

> FWIW, while I don't use BTRFS like this (I just use snapshots to get
> a consistent state to copy out for backups, usually doing the actual
> backup using SquashFS), one of my friends uses rsync together with
> BTRFS to do incremental backups of his personal systems.  He runs
> rsync with --in-place on the system being backed up to copy things
> out to a dedicated subvolume on his backup device, and then
> snapshots the subvolume after each backup (and uses a snapshot
> thinning system similar to that used by snapper).  While it's not
> quite as efficient as it could be, it's still works well.
> 
> Alternatively, if you're backing up a BTRFS filesystem to another
> one, you can keep around the previous backup snapshot and do an
> incremental send against that, which will result in proper sharing
> of blocks.  I used to use this before I decided that I wanted better
> space efficiency for backups than BTRFS can currently offer.

-- 
Hugo Mills             | A diverse working environment: Di longer you vork
hugo@... carfax.org.uk | here, di verse it gets
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25  9:29 ` Hugo Mills
  2016-05-25 11:00   ` H. Peter Anvin
@ 2016-05-25 13:06   ` Dmitry Katsubo
  1 sibling, 0 replies; 9+ messages in thread
From: Dmitry Katsubo @ 2016-05-25 13:06 UTC (permalink / raw)
  To: linux-btrfs

On 2016-05-25 11:29, Hugo Mills wrote:
> On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote:
>> Hi,
>> 
>> I'm looking at using a btrfs with snapshots to implement a 
>> generational
>> backup capacity.  However, doing it the naïve way would have the side
>> effect that for a file that has been partially modified, after
>> snapshotting the file would be written with *mostly* the same data.  
>> How
>> does btrfs' COW algorithm deal with that?  If necessary I might want 
>> to
>> write some smarter user space utilities for this.
> 
> Sounds like it might be a job for one of the dedup tools
> (deupremove, bedup), or, if you're writing your own, the safe
> deduplication ioctl which underlies those tools.
> 
> Hugo.

Perhaps it really makes sense to delegate de-duplication to 3-rd party
software like BackupPC [1]. I am not sure if btrfs can manage it more
effectively, as in order to find duplicates it would need to scan / 
analyse
all blocks, so at least it would take longer.

[1] https://sourceforge.net/projects/backuppc/

-- 
With best regards,
Dmitry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Copy on write of unmodified data
  2016-05-25  8:58 Copy on write of unmodified data H. Peter Anvin
  2016-05-25  9:29 ` Hugo Mills
  2016-05-25 11:45 ` Austin S. Hemmelgarn
@ 2016-05-25 16:16 ` Henk Slager
  2 siblings, 0 replies; 9+ messages in thread
From: Henk Slager @ 2016-05-25 16:16 UTC (permalink / raw)
  To: linux-btrfs

On Wed, May 25, 2016 at 10:58 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> Hi,
>
> I'm looking at using a btrfs with snapshots to implement a generational
> backup capacity.  However, doing it the naïve way would have the side
> effect that for a file that has been partially modified, after
> snapshotting the file would be written with *mostly* the same data.  How
> does btrfs' COW algorithm deal with that?  If necessary I might want to
> write some smarter user space utilities for this.

Assuming 'snapshots' plural refers incremental snapshots of a
subvolume, you might want to use the send ioctl of the kernel.
Userspace btrfs-progs  btrfs send --no-data  output might give some hints.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-05-25 16:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-25  8:58 Copy on write of unmodified data H. Peter Anvin
2016-05-25  9:29 ` Hugo Mills
2016-05-25 11:00   ` H. Peter Anvin
2016-05-25 11:07     ` Hugo Mills
2016-05-25 11:32       ` Austin S. Hemmelgarn
2016-05-25 13:06   ` Dmitry Katsubo
2016-05-25 11:45 ` Austin S. Hemmelgarn
2016-05-25 12:28   ` Hugo Mills
2016-05-25 16:16 ` Henk Slager

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.