All of lore.kernel.org
 help / color / mirror / Atom feed
* BackupPC, per-dir hard link limit, Debian packaging
@ 2010-03-02  2:29 Robert Collins
  2010-03-02 13:09 ` Hubert Kario
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Collins @ 2010-03-02  2:29 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]

I realise that the hard link limit is in the queue to fix, and I read
the recent thread as well as the older (october I think) thread.

I just wanted to note that BackupPC *does* in fact run into the hard
link limit, and its due to the dpkg configuration scripts.

BackupPC hard links files with the same content together by scanning new
files and linking them together, whether or not they started as a hard
link in the backed up source PCs.

It also builds a directory structure precisely matching the source
machine (basically it rsyncs across, then hardlinks aggressively).

If you back up a Debian host, /var/lib/dpkg/info contains many identical
files because debhelper generates the same script in the common case:
ls /var/lib/dpkg/info/*.postinst | xargs -n1 sha1sum | awk '{ print
$1 }' | sort -u | wc -l
862
ls /var/lib/dpkg/info/*.postinst | wc -l
1533

As I say, I realise this is queued to get addressed anyway, but it seems
like a realistic thing for people to do (use BackupPC on btrfs) - even
if something better still can be written to replace the BackupPC store
in the future. I will note though, that simple snapshots won't achieve
the deduplication level that BackupPC does, because the fils don't start
out as the same: they are identified as being identical post-backup.

Cheers,
Rob

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BackupPC, per-dir hard link limit, Debian packaging
  2010-03-02  2:29 BackupPC, per-dir hard link limit, Debian packaging Robert Collins
@ 2010-03-02 13:09 ` Hubert Kario
  2010-03-02 23:22   ` jim owens
  0 siblings, 1 reply; 4+ messages in thread
From: Hubert Kario @ 2010-03-02 13:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Robert Collins

On Tuesday 02 March 2010 03:29:05 Robert Collins wrote:
> As I say, I realise this is queued to get addressed anyway, but it se=
ems
> like a realistic thing for people to do (use BackupPC on btrfs) - eve=
n
> if something better still can be written to replace the BackupPC stor=
e
> in the future. I will note though, that simple snapshots won't achiev=
e
> the deduplication level that BackupPC does, because the fils don't st=
art
> out as the same: they are identified as being identical post-backup.

Isn't the main idea behind deduplication to merge identical parts of fi=
les=20
together using cow? This way you could have many very similar images of=
=20
virtual machines, run the deduplication process and reduce massively th=
e space=20
used while maintaining the differences between images.

If memory serves me right, the plan is to do it in userland on a post-f=
act=20
filesystem, not when the data is being saved. If such a daemon or progr=
am was=20
available you would run it on the system after rsyncing the workstation=
s.

Though the question remains which system would reduce space usage more =
in your=20
use case. From my experience, hardlinks take less space on disk, I don'=
t know=20
whatever it could be possible to optimise btrfs cow system for files th=
at are=20
exactly the same.

>=20
> Cheers,
> Rob
>=20

--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=C3=B3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarz=C4=85dzania Jako=C5=9Bci=C4=85
zgodny z norm=C4=85 ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BackupPC, per-dir hard link limit, Debian packaging
  2010-03-02 13:09 ` Hubert Kario
@ 2010-03-02 23:22   ` jim owens
  2010-03-03  0:05     ` Hubert Kario
  0 siblings, 1 reply; 4+ messages in thread
From: jim owens @ 2010-03-02 23:22 UTC (permalink / raw)
  To: Hubert Kario; +Cc: linux-btrfs, Robert Collins

Hubert Kario wrote:
> On Tuesday 02 March 2010 03:29:05 Robert Collins wrote:
>> As I say, I realise this is queued to get addressed anyway, but it seems
>> like a realistic thing for people to do (use BackupPC on btrfs) - even
>> if something better still can be written to replace the BackupPC store
>> in the future. I will note though, that simple snapshots won't achieve
>> the deduplication level that BackupPC does, because the fils don't start
>> out as the same: they are identified as being identical post-backup.
> 
> Isn't the main idea behind deduplication to merge identical parts of files 
> together using cow? This way you could have many very similar images of 
> virtual machines, run the deduplication process and reduce massively the space 
> used while maintaining the differences between images.
> 
> If memory serves me right, the plan is to do it in userland on a post-fact 
> filesystem, not when the data is being saved. If such a daemon or program was 
> available you would run it on the system after rsyncing the workstations.
> 
> Though the question remains which system would reduce space usage more in your 
> use case. From my experience, hardlinks take less space on disk, I don't know 
> whatever it could be possible to optimise btrfs cow system for files that are 
> exactly the same.

Space use is not the key difference between these methods.
The btrfs COW makes data sharing safe.  The hard link method
means changing a file invalidates the content of all linked files.

So a BackupPC output should be read-only.

jim

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BackupPC, per-dir hard link limit, Debian packaging
  2010-03-02 23:22   ` jim owens
@ 2010-03-03  0:05     ` Hubert Kario
  0 siblings, 0 replies; 4+ messages in thread
From: Hubert Kario @ 2010-03-03  0:05 UTC (permalink / raw)
  To: jim owens; +Cc: linux-btrfs, Robert Collins

On Wednesday 03 March 2010 00:22:31 jim owens wrote:
> Hubert Kario wrote:
> > On Tuesday 02 March 2010 03:29:05 Robert Collins wrote:
> >> As I say, I realise this is queued to get addressed anyway, but it=
 seems
> >> like a realistic thing for people to do (use BackupPC on btrfs) - =
even
> >> if something better still can be written to replace the BackupPC s=
tore
> >> in the future. I will note though, that simple snapshots won't ach=
ieve
> >> the deduplication level that BackupPC does, because the fils don't=
 start
> >> out as the same: they are identified as being identical post-backu=
p.
> >=20
> > Isn't the main idea behind deduplication to merge identical parts o=
f
> > files together using cow? This way you could have many very similar
> > images of virtual machines, run the deduplication process and reduc=
e
> > massively the space used while maintaining the differences between
> > images.
> >=20
> > If memory serves me right, the plan is to do it in userland on a
> > post-fact filesystem, not when the data is being saved. If such a d=
aemon
> > or program was available you would run it on the system after rsync=
ing
> > the workstations.
> >=20
> > Though the question remains which system would reduce space usage m=
ore in
> > your use case. From my experience, hardlinks take less space on dis=
k, I
> > don't know whatever it could be possible to optimise btrfs cow syst=
em
> > for files that are exactly the same.
>=20
> Space use is not the key difference between these methods.
> The btrfs COW makes data sharing safe.  The hard link method
> means changing a file invalidates the content of all linked files.
>=20
> So a BackupPC output should be read-only.

I know that, but if you're using "dumb" tools to replicate systems (say=
=20
rsync), you don't want them to overwrite different versions of files an=
d you=20
still want to reclaim disk space used by essentially the same data.

My idea behind btrfs as backup storage and using cow not hardlinks for=20
duplicated files comes from need to keep archival copies (something not=
 really=20
possible with hardlinks) in a way similar to rdiff-backup.

As first backup I just rsync to backup server from all workstations.
But on subsequent backups I copy the last version to a .snapshot/todays=
-date =20
directory using cow, rsync from workstations and then run deduplication=
=20
daemon.

This way I get both reduced storage and old copies (handy for user home=
=20
directories...).

With such use-case, the ability to use cow while needing similar amount=
s of=20
space as hardlinks would be at least useful if not very desired.

That's why I asked if it's possible to optimise btrfs cow mechanism for=
=20
identical files.

=46rom my testing (directory 584MiB in size, 17395 files, Arch kernel 2=
=2E6.32.9,=20
coreutils 8.4, btrfs-progs 0.19, 10GiB partition, default mkfs and moun=
t=20
options):
cp -al
free space decrease: 6176KiB

cp -a --reflink=3Dalways
free space decrease: 23296KiB

and in the second run:
cp -al
free space decrease: 6064KiB

cp -a --reflink=3Dalways
free space decrease: 23324KiB

that's nearly 4 times more!
--=20
Hubert Kario
QBS - Quality Business Software
ul. Ksawer=C3=B3w 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-03-03  0:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-02  2:29 BackupPC, per-dir hard link limit, Debian packaging Robert Collins
2010-03-02 13:09 ` Hubert Kario
2010-03-02 23:22   ` jim owens
2010-03-03  0:05     ` Hubert Kario

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.