All of lore.kernel.org
 help / color / mirror / Atom feed
* Data Deduplication with the help of an online filesystem check
@ 2009-04-27  3:33 Thomas Glanzmann
  2009-04-27 13:37 ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-27  3:33 UTC (permalink / raw)
  To: linux-btrfs

Hello,
I would like to know if it would be possible to implement the following
feature in btrfs:

Have an online filesystem check which accounts for possible duplicated
data blocks (maybe with the help of already implemented checksums: Are
these checksums for the whole file or block based?) and de duplicate
these blocks?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-27  3:33 Data Deduplication with the help of an online filesystem check Thomas Glanzmann
@ 2009-04-27 13:37 ` Chris Mason
  2009-04-28  5:22   ` Thomas Glanzmann
  2009-04-28 15:59   ` Thomas Glanzmann
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-04-27 13:37 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: linux-btrfs

On Mon, 2009-04-27 at 05:33 +0200, Thomas Glanzmann wrote:
> Hello,
> I would like to know if it would be possible to implement the following
> feature in btrfs:
> 
> Have an online filesystem check which accounts for possible duplicated
> data blocks (maybe with the help of already implemented checksums: Are
> these checksums for the whole file or block based?) and de duplicate
> these blocks?

There is a btrfs ioctl to clone individual files, and this could be used
to implement an online dedup.  But, since it is happening from userland,
you can't lock out all of the other users of a given file.

So, the dedup application would be responsible for making sure a given
file was not being changed while the dedup scan was running.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-27 13:37 ` Chris Mason
@ 2009-04-28  5:22   ` Thomas Glanzmann
  2009-04-28 10:02     ` Chris Mason
  2009-04-28 15:59   ` Thomas Glanzmann
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28  5:22 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello Chris,

> There is a btrfs ioctl to clone individual files, and this could be used
> to implement an online dedup.  But, since it is happening from userland,
> you can't lock out all of the other users of a given file.

> So, the dedup application would be responsible for making sure a given
> file was not being changed while the dedup scan was running.

I see, does that mean that I can not do ,,dedup'' for files that are
currently opened by a userland program?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28  5:22   ` Thomas Glanzmann
@ 2009-04-28 10:02     ` Chris Mason
  2009-04-28 13:49       ` Andrey Kuzmin
  2009-04-28 16:10       ` Anthony Roberts
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-04-28 10:02 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: linux-btrfs

On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > There is a btrfs ioctl to clone individual files, and this could be used
> > to implement an online dedup.  But, since it is happening from userland,
> > you can't lock out all of the other users of a given file.
> 
> > So, the dedup application would be responsible for making sure a given
> > file was not being changed while the dedup scan was running.
> 
> I see, does that mean that I can not do ,,dedup'' for files that are
> currently opened by a userland program?

No, but it does mean the dedup done from userland is racey.  Picture
this:

process A:
    create some_file # some_file matches the contents of another file

dedup proc:
    check some_file
    decide to start block dedup

process A:
     modify some_file

dedup proc:
     progress through block dedup

So, this will happily replace blocks in some_file with the dedup blocks.
But there's no way to atomically swap them.

We could create new ioctls for this, basically a variant of the clone
file ioctl that makes sure a given set of pages has a given sum (or
strict memory contents) before doing the swap.

But they don't exist yet.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 10:02     ` Chris Mason
@ 2009-04-28 13:49       ` Andrey Kuzmin
  2009-04-28 13:58         ` Chris Mason
  2009-04-28 13:58         ` jim owens
  2009-04-28 16:10       ` Anthony Roberts
  1 sibling, 2 replies; 67+ messages in thread
From: Andrey Kuzmin @ 2009-04-28 13:49 UTC (permalink / raw)
  To: Chris Mason; +Cc: Thomas Glanzmann, linux-btrfs

On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote:
>> Hello Chris,
>>
>> > There is a btrfs ioctl to clone individual files, and this could b=
e used
>> > to implement an online dedup. =A0But, since it is happening from u=
serland,
>> > you can't lock out all of the other users of a given file.
>>
>> > So, the dedup application would be responsible for making sure a g=
iven
>> > file was not being changed while the dedup scan was running.
>>
>> I see, does that mean that I can not do ,,dedup'' for files that are
>> currently opened by a userland program?
>
> No, but it does mean the dedup done from userland is racey. =A0Pictur=
e

Race disappears if (background) dedupe is run against snapshot(s).

Regards,
Andrey

> this:
>
> process A:
> =A0 =A0create some_file # some_file matches the contents of another f=
ile
>
> dedup proc:
> =A0 =A0check some_file
> =A0 =A0decide to start block dedup
>
> process A:
> =A0 =A0 modify some_file
>
> dedup proc:
> =A0 =A0 progress through block dedup
>
> So, this will happily replace blocks in some_file with the dedup bloc=
ks.
> But there's no way to atomically swap them.
>
> We could create new ioctls for this, basically a variant of the clone
> file ioctl that makes sure a given set of pages has a given sum (or
> strict memory contents) before doing the swap.
>
> But they don't exist yet.
>
> -chris
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 13:49       ` Andrey Kuzmin
@ 2009-04-28 13:58         ` Chris Mason
  2009-04-28 14:04           ` Thomas Glanzmann
  2009-04-28 13:58         ` jim owens
  1 sibling, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-28 13:58 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: Thomas Glanzmann, linux-btrfs

On Tue, 2009-04-28 at 17:49 +0400, Andrey Kuzmin wrote:
> On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason <chris.mason@oracle.com> wrote:
> > On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote:
> >> Hello Chris,
> >>
> >> > There is a btrfs ioctl to clone individual files, and this could be used
> >> > to implement an online dedup.  But, since it is happening from userland,
> >> > you can't lock out all of the other users of a given file.
> >>
> >> > So, the dedup application would be responsible for making sure a given
> >> > file was not being changed while the dedup scan was running.
> >>
> >> I see, does that mean that I can not do ,,dedup'' for files that are
> >> currently opened by a userland program?
> >
> > No, but it does mean the dedup done from userland is racey.  Picture
> 
> Race disappears if (background) dedupe is run against snapshot(s).
> 

True, but then you're only changing the blocks pointed to by the
snapshot.  The 'master' copy still points to the unduplicated blocks.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 13:49       ` Andrey Kuzmin
  2009-04-28 13:58         ` Chris Mason
@ 2009-04-28 13:58         ` jim owens
  1 sibling, 0 replies; 67+ messages in thread
From: jim owens @ 2009-04-28 13:58 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: Chris Mason, Thomas Glanzmann, linux-btrfs

Andrey Kuzmin wrote:
> On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason <chris.mason@oracle.com> wrote:
>> On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote:
>>> Hello Chris,
>>>
>>>> There is a btrfs ioctl to clone individual files, and this could be used
>>>> to implement an online dedup.  But, since it is happening from userland,
>>>> you can't lock out all of the other users of a given file.
>>>> So, the dedup application would be responsible for making sure a given
>>>> file was not being changed while the dedup scan was running.
>>> I see, does that mean that I can not do ,,dedup'' for files that are
>>> currently opened by a userland program?
>> No, but it does mean the dedup done from userland is racey.  Picture
> 
> Race disappears if (background) dedupe is run against snapshot(s).

only if the snapshots are made read-only.

btrfs has writable snapshots.

jim

> 
> Regards,
> Andrey
> 
>> this:
>>
>> process A:
>>    create some_file # some_file matches the contents of another file
>>
>> dedup proc:
>>    check some_file
>>    decide to start block dedup
>>
>> process A:
>>     modify some_file
>>
>> dedup proc:
>>     progress through block dedup
>>
>> So, this will happily replace blocks in some_file with the dedup blocks.
>> But there's no way to atomically swap them.
>>
>> We could create new ioctls for this, basically a variant of the clone
>> file ioctl that makes sure a given set of pages has a given sum (or
>> strict memory contents) before doing the swap.
>>
>> But they don't exist yet.
>>
>> -chris
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 13:58         ` Chris Mason
@ 2009-04-28 14:04           ` Thomas Glanzmann
  2009-04-28 17:21             ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 14:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrey Kuzmin, linux-btrfs

Chris,
what blocksizes can I choose with btrfs? Do you think that it is
possible for an outsider like me to submit patches to btrfs which enable
dedup in three fulltime days?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-27 13:37 ` Chris Mason
  2009-04-28  5:22   ` Thomas Glanzmann
@ 2009-04-28 15:59   ` Thomas Glanzmann
  2009-04-28 16:04     ` Tomasz Chmielewski
  2009-04-28 17:23     ` Chris Mason
  1 sibling, 2 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 15:59 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello,
I have a few more questions to this:

        - Is there a checksum for every block in btrfs?

        - Is it possible to retrieve these checksums from userland?

        - Is it possible to use a blocksize of 4 or 8 kbyte with btrfs?

To get a bit more specific: If it is relatively easy to identify and
deduplicate blocks, and if btrfs supports relatively small block sizes
like 4 / 8 kbyte, it is the perfect candidate for VMs. To give you some
data. I took 300 Gbyte (note this is the disk space that is used not the
provisioned space (the space that isn't currently used by the VM so it's the
data that are in use) of VMs running different operating systems and used a
perl script to identify how many data could be deduped give a specific
blocksize:

300 Gbyte of used storage of several productive VMs with the following
Operatings systems running:
\begin{itemize}
        \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5)
        \item SuSE Linux 32 and 64 Bit (SLES 9 and 10)
        \item Windows 2003 Std. Edition 32 Bit
        \item Windows 2003 Enterprise Edition 64 Bit
\end{itemize}
\begin{tabular}{r|r|r|l}
blocksize & Deduplicated Data \\
\hline
128k      &  29.9 G \\
 64k      &  41.3 G \\
 32k      &  59.2 G \\
 16k      &  82   G \\
  8k      & 112   G \\
\

Bottom line with 8 K blocksize you can get more than 33% of deduped data
running a productive set of VMs.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 15:59   ` Thomas Glanzmann
@ 2009-04-28 16:04     ` Tomasz Chmielewski
  2009-04-28 17:29       ` Edward Shishkin
  2009-04-28 17:32       ` Thomas Glanzmann
  2009-04-28 17:23     ` Chris Mason
  1 sibling, 2 replies; 67+ messages in thread
From: Tomasz Chmielewski @ 2009-04-28 16:04 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: Chris Mason, linux-btrfs

Thomas Glanzmann schrieb:

> 300 Gbyte of used storage of several productive VMs with the following
> Operatings systems running:
> \begin{itemize}
>         \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5)
>         \item SuSE Linux 32 and 64 Bit (SLES 9 and 10)
>         \item Windows 2003 Std. Edition 32 Bit
>         \item Windows 2003 Enterprise Edition 64 Bit
> \end{itemize}
> \begin{tabular}{r|r|r|l}
> blocksize & Deduplicated Data \\
> \hline
> 128k      &  29.9 G \\
>  64k      &  41.3 G \\
>  32k      &  59.2 G \\
>  16k      &  82   G \\
>   8k      & 112   G \\
> \
> 
> Bottom line with 8 K blocksize you can get more than 33% of deduped data
> running a productive set of VMs.

Did you just compare checksums, or did you also compare the data "bit 
after bit" if the checksums matched?


-- 
Tomasz Chmielewski

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 10:02     ` Chris Mason
  2009-04-28 13:49       ` Andrey Kuzmin
@ 2009-04-28 16:10       ` Anthony Roberts
  1 sibling, 0 replies; 67+ messages in thread
From: Anthony Roberts @ 2009-04-28 16:10 UTC (permalink / raw)
  To: Chad Clark; +Cc: linux-btrfs

I don't know if this strikes me as a 1.0 feature, but it doesn't seem like
something that wouldn't need changes to the disk format. :)

Some ioctls occured to me that might be helpful here, without unduly
loading up the kernel with dedup logic:

-A speculative dedup that you could open, do whatever verification you need
(it would be up to the application), and then commit. I believe there's
already an ioctl to retrieve checksums, or your app could compare the bytes
if you were paranoid; this approach would be agnostic to that. Your dedup
handle would be automatically invalidated if any changes happened in the
specified ranges of bytes in the specified files, so it would be safe.

-A way to subscribe to writes in a subvolume, something similar to the
inotify API. You'd get structs that would tell you the
file/offset/size/csum of writes, allowing you to dedup on the fly without
having to crawl the whole filesystem. I'm not sure about the feasibility of
dealing with the volume of traffic you'd get from something like that, but
if it's something where you can keep a hash table in memory it might be
doable.

-Anthony

On Tue, 28 Apr 2009 06:02:51 -0400, Chris Mason <chris.mason@oracle.com>
wrote:
> On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote:
>> Hello Chris,
>> 
>> > There is a btrfs ioctl to clone individual files, and this could be
>> > used
>> > to implement an online dedup.  But, since it is happening from
>> > userland,
>> > you can't lock out all of the other users of a given file.
>> 
>> > So, the dedup application would be responsible for making sure a given
>> > file was not being changed while the dedup scan was running.
>> 
>> I see, does that mean that I can not do ,,dedup'' for files that are
>> currently opened by a userland program?
> 
> No, but it does mean the dedup done from userland is racey.  Picture
> this:
> 
> process A:
>     create some_file # some_file matches the contents of another file
> 
> dedup proc:
>     check some_file
>     decide to start block dedup
> 
> process A:
>      modify some_file
> 
> dedup proc:
>      progress through block dedup
> 
> So, this will happily replace blocks in some_file with the dedup blocks.
> But there's no way to atomically swap them.
> 
> We could create new ioctls for this, basically a variant of the clone
> file ioctl that makes sure a given set of pages has a given sum (or
> strict memory contents) before doing the swap.
> 
> But they don't exist yet.
> 
> -chris
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 14:04           ` Thomas Glanzmann
@ 2009-04-28 17:21             ` Chris Mason
  2009-04-28 20:10               ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-28 17:21 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: Andrey Kuzmin, linux-btrfs

On Tue, 2009-04-28 at 16:04 +0200, Thomas Glanzmann wrote:
> Chris,
> what blocksizes can I choose with btrfs? 

Right now the blocksize can only be the same as the page size.  For this
external dedup program you have in mind, you could use any multiple of
the page size.

> Do you think that it is
> possible for an outsider like me to submit patches to btrfs which enable
> dedup in three fulltime days?

Three days is probably not quite enough ;)  I'd honestly prefer the
dedup happen entirely in the kernel in a setup similar to the
compression code.

But, that would use _lots_ of CPU, so an offline dedup is probably a
good feature even if we have transparent dedup.

You'd have to:

Wire up a userland database that stores checksums and points to
file,offset tuples

Make the ioctl to replace a given file extent if and only if the file
contents match a given checksum over a range of bytes.  The ioctl should
be able to optionally do a byte compare of the src and destination pages
to make 100% sure the data is really the same.

Make another ioctl to report on which parts of a file have changed since
a given transaction.  This will greatly reduce the time spent scanning
for new blocks.

It isn't painfully hard, but you're looking at about 3 weeks total time.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 15:59   ` Thomas Glanzmann
  2009-04-28 16:04     ` Tomasz Chmielewski
@ 2009-04-28 17:23     ` Chris Mason
  2009-04-28 17:37       ` Thomas Glanzmann
  2009-04-28 20:24       ` Thomas Glanzmann
  1 sibling, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-04-28 17:23 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: linux-btrfs

On Tue, 2009-04-28 at 17:59 +0200, Thomas Glanzmann wrote:
> Hello,
> I have a few more questions to this:
> 
>         - Is there a checksum for every block in btrfs?

Yes, but they are only crc32c.

> 
>         - Is it possible to retrieve these checksums from userland?
> 

Not today.  The sage developers sent a patch to make an ioctl for this,
but since it was hard coded to crc32c I haven't taken it yet.

>         - Is it possible to use a blocksize of 4 or 8 kbyte with btrfs?
> 

Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes are
fine.

> To get a bit more specific: If it is relatively easy to identify and
> deduplicate blocks, and if btrfs supports relatively small block sizes
> like 4 / 8 kbyte, it is the perfect candidate for VMs. To give you some
> data. I took 300 Gbyte (note this is the disk space that is used not the
> provisioned space (the space that isn't currently used by the VM so it's the
> data that are in use) of VMs running different operating systems and used a
> perl script to identify how many data could be deduped give a specific
> blocksize:
> 

Virtual machines are the ideal dedup workload.  But, you do get a big
portion of the dedup benefits by just starting with a common image and
cloning it instead of doing copies of each vm.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 16:04     ` Tomasz Chmielewski
@ 2009-04-28 17:29       ` Edward Shishkin
  2009-04-28 17:34         ` Thomas Glanzmann
  2009-04-28 17:32       ` Thomas Glanzmann
  1 sibling, 1 reply; 67+ messages in thread
From: Edward Shishkin @ 2009-04-28 17:29 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Thomas Glanzmann, Chris Mason, linux-btrfs

Tomasz Chmielewski wrote:
> Thomas Glanzmann schrieb:
>
>> 300 Gbyte of used storage of several productive VMs with the following
>> Operatings systems running:
>> \begin{itemize}
>>         \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5)
>>         \item SuSE Linux 32 and 64 Bit (SLES 9 and 10)
>>         \item Windows 2003 Std. Edition 32 Bit
>>         \item Windows 2003 Enterprise Edition 64 Bit
>> \end{itemize}
>> \begin{tabular}{r|r|r|l}
>> blocksize & Deduplicated Data \\
>> \hline
>> 128k      &  29.9 G \\
>>  64k      &  41.3 G \\
>>  32k      &  59.2 G \\
>>  16k      &  82   G \\
>>   8k      & 112   G \\
>> \
>>
>> Bottom line with 8 K blocksize you can get more than 33% of deduped data
>> running a productive set of VMs.
>
> Did you just compare checksums, 

I wouldn't rely on crc32: it is not a strong hash,
Such deduplication can lead to various problems,
including security ones.

> or did you also compare the data "bit after bit" if the checksums 
> matched?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 16:04     ` Tomasz Chmielewski
  2009-04-28 17:29       ` Edward Shishkin
@ 2009-04-28 17:32       ` Thomas Glanzmann
  2009-04-28 17:41         ` Michael Tharp
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 17:32 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Chris Mason, linux-btrfs

Hello Tomasz,

> Did you just compare checksums, or did you also compare the data "bit
> after bit" if the checksums matched?

no, I just used the md5 checksum. And even if I have a hash escalation
which is highly unlikely it still gives a good house number.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:29       ` Edward Shishkin
@ 2009-04-28 17:34         ` Thomas Glanzmann
  2009-04-28 17:38           ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 17:34 UTC (permalink / raw)
  To: Edward Shishkin; +Cc: Tomasz Chmielewski, Chris Mason, linux-btrfs

Hello,

> I wouldn't rely on crc32: it is not a strong hash,
> Such deduplication can lead to various problems,
> including security ones.

sure thing, did you think of replacing crc32 with sha1 or md5, is this
even possible (is there enough space reserved so that the change can be
done without changing the filesystem layout) at the moment with btrfs?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:23     ` Chris Mason
@ 2009-04-28 17:37       ` Thomas Glanzmann
  2009-04-28 17:43         ` Chris Mason
  2009-04-28 20:24       ` Thomas Glanzmann
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 17:37 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello Chris,

> > Is there a checksum for every block in btrfs?

> Yes, but they are only crc32c.

I see, is it easily possible to exchange that with sha-1 or md5?

> > Is it possible to retrieve these checksums from userland?

> Not today.  The sage developers sent a patch to make an ioctl for
> this, but since it was hard coded to crc32c I haven't taken it yet.

I see.

> Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes
> are fine.

Does that mean that I can dedup 4k blocks even if you use extents?

> Virtual machines are the ideal dedup workload.  But, you do get a big
> portion of the dedup benefits by just starting with a common image and
> cloning it instead of doing copies of each vm.

True, the operating system can be almost completely deduped but as soon
as you start patching you loose the benefit.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:34         ` Thomas Glanzmann
@ 2009-04-28 17:38           ` Chris Mason
  2009-04-28 17:43             ` Thomas Glanzmann
  2009-04-28 17:45             ` Heinz-Josef Claes
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-04-28 17:38 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
> Hello,
> 
> > I wouldn't rely on crc32: it is not a strong hash,
> > Such deduplication can lead to various problems,
> > including security ones.
> 
> sure thing, did you think of replacing crc32 with sha1 or md5, is this
> even possible (is there enough space reserved so that the change can be
> done without changing the filesystem layout) at the moment with btrfs?

It is possible, there's room in the metadata for about about 4k of
checksum for each 4k of data.  The initial btrfs code used sha256, but
the real limiting factor is the CPU time used.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:32       ` Thomas Glanzmann
@ 2009-04-28 17:41         ` Michael Tharp
  2009-04-28 20:14           ` Thomas Glanzmann
  2009-05-04 14:29           ` Ric Wheeler
  0 siblings, 2 replies; 67+ messages in thread
From: Michael Tharp @ 2009-04-28 17:41 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: Tomasz Chmielewski, Chris Mason, linux-btrfs

Thomas Glanzmann wrote:
> no, I just used the md5 checksum. And even if I have a hash escalation
> which is highly unlikely it still gives a good house number.

I'd start with a crc32 and/or MD5 to find candidate blocks, then do a 
bytewise comparison before actually merging them. Even the risk of an 
accidental collision is too high, and considering there are plenty of 
birthday-style MD5 attacks it would not be extraordinarily difficult to 
construct a block that collides with e.g. a system library.

Keep in mind that although digests do a fairly good job of making unique 
identifiers for larger chunks of data, they can only hold so many unique 
combinations. Considering you're comparing blocks of a few kibibytes in 
size it's best to just do a foolproof comparison. There's nothing wrong 
with using a checksum/digest as a screening mechanism though.

-- m. tharp

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:37       ` Thomas Glanzmann
@ 2009-04-28 17:43         ` Chris Mason
  2009-04-28 20:15           ` Thomas Glanzmann
  2009-04-28 21:19           ` Dmitri Nikulin
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-04-28 17:43 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: linux-btrfs

On Tue, 2009-04-28 at 19:37 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > > Is there a checksum for every block in btrfs?
> 
> > Yes, but they are only crc32c.
> 
> I see, is it easily possible to exchange that with sha-1 or md5?
> 

Yes, but for the purposes of dedup, it's not exactly what you want.  You
want an index by checksum, and the current btrfs code indexes by logical
byte number in the disk.

So you need an extra index either way.  It makes sense to keep the
crc32c csums for fast verification of the data read from disk and only
use the expensive csums for dedup. 

> > > Is it possible to retrieve these checksums from userland?
> 
> > Not today.  The sage developers sent a patch to make an ioctl for
> > this, but since it was hard coded to crc32c I haven't taken it yet.
> 
> I see.
> 
> > Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes
> > are fine.
> 
> Does that mean that I can dedup 4k blocks even if you use extents?

Yes.

> 
> > Virtual machines are the ideal dedup workload.  But, you do get a big
> > portion of the dedup benefits by just starting with a common image and
> > cloning it instead of doing copies of each vm.
> 
> True, the operating system can be almost completely deduped but as soon
> as you start patching you loose the benefit.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:38           ` Chris Mason
@ 2009-04-28 17:43             ` Thomas Glanzmann
  2009-04-28 17:45             ` Heinz-Josef Claes
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 17:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello,

> It is possible, there's room in the metadata for about about 4k of
> checksum for each 4k of data.  The initial btrfs code used sha256, but
> the real limiting factor is the CPU time used.

I see. There a very efficient md5 algorithms out there, for example,
especially if the code is written in assembler.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:38           ` Chris Mason
  2009-04-28 17:43             ` Thomas Glanzmann
@ 2009-04-28 17:45             ` Heinz-Josef Claes
  2009-04-28 20:16               ` Thomas Glanzmann
  2009-05-06 15:16               ` Sander
  1 sibling, 2 replies; 67+ messages in thread
From: Heinz-Josef Claes @ 2009-04-28 17:45 UTC (permalink / raw)
  To: Chris Mason
  Cc: Thomas Glanzmann, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason:
> On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
> > Hello,
> >
> > > I wouldn't rely on crc32: it is not a strong hash,
> > > Such deduplication can lead to various problems,
> > > including security ones.
> >
> > sure thing, did you think of replacing crc32 with sha1 or md5, is this
> > even possible (is there enough space reserved so that the change can be
> > done without changing the filesystem layout) at the moment with btrfs?
>
> It is possible, there's room in the metadata for about about 4k of
> checksum for each 4k of data.  The initial btrfs code used sha256, but
> the real limiting factor is the CPU time used.
>
> -chris
>
It's not only cpu time, it's also memory. You need 32 byte for each 4k block. 
It needs to be in RAM for performance reason.

hjc
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:21             ` Chris Mason
@ 2009-04-28 20:10               ` Thomas Glanzmann
  2009-04-28 20:29                 ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:10 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrey Kuzmin, linux-btrfs

Hello Chris,

> Right now the blocksize can only be the same as the page size.  For
> this external dedup program you have in mind, you could use any
> multiple of the page size.

perfect. Exactly what I need.

> Three days is probably not quite enough ;)  I'd honestly prefer the
> dedup happen entirely in the kernel in a setup similar to the
> compression code.

I see. I think that it wouldn't scale because than all the checksums
need to be stored in memory or at least in an efficient b*tree. For a
1 Tbyte filesystem with 4 kbyte blocks that would mean more 5 G (!)
(assuming a 16 kbyte checksum and 4 byte block identifier and that
leaves out the b*tree overhead for fast searching) of memory.

> But, that would use _lots_ of CPU, so an offline dedup is probably a
> good feature even if we have transparent dedup.

I think that is the right way to go.

> Wire up a userland database that stores checksums and points to
> file, offset tuples

exactly. And if there is a way to retrieve the already calculated
checksums from kernel land, than it would be possible to implement a
,,systemcall'' that gives the kernel a hint of a possible duplicated
block (like providing a list of lists of blocks to the kernel that might
be duplicated because they have the same checksum). Than the kernel code
could dedup the block after byte-byte comparing it.

> Make the ioctl to replace a given file extent if and only if the file
> contents match a given checksum over a range of bytes.  The ioctl should
> be able to optionally do a byte compare of the src and destination pages
> to make 100% sure the data is really the same.

Exactly.

> Make another ioctl to report on which parts of a file have changed
> since a given transaction. This will greatly reduce the time spent
> scanning for new blocks.

That would be perfect. Even better would be a systemcall that reports
all the blocks that have been touched since a specific transaction. Like
a bitmap that sets a ,,1'' for every block that has been touched.

> It isn't painfully hard, but you're looking at about 3 weeks total
> time.

I see, so no quick hack to get it going.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:41         ` Michael Tharp
@ 2009-04-28 20:14           ` Thomas Glanzmann
  2009-05-04 14:29           ` Ric Wheeler
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:14 UTC (permalink / raw)
  To: Michael Tharp; +Cc: Tomasz Chmielewski, Chris Mason, linux-btrfs

Hello Michael,

> I'd start with a crc32 and/or MD5 to find candidate blocks, then do a 
> bytewise comparison before actually merging them. Even the risk of an 
> accidental collision is too high, and considering there are plenty of 
> birthday-style MD5 attacks it would not be extraordinarily difficult to 
> construct a block that collides with e.g. a system library.

I agree. But using a crc32 to identify blocks might me give to much
false positives, but actually someone need to try that in practice and
run some statics on real data to tell if it is the case.

> Keep in mind that although digests do a fairly good job of making
> unique identifiers for larger chunks of data, they can only hold so
> many unique combinations. Considering you're comparing blocks of a few
> kibibytes in size it's best to just do a foolproof comparison. There's
> nothing wrong with using a checksum/digest as a screening mechanism
> though.

Again absolutly agreed.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:43         ` Chris Mason
@ 2009-04-28 20:15           ` Thomas Glanzmann
  2009-04-28 21:19           ` Dmitri Nikulin
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:15 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello Chris,

> Yes, but for the purposes of dedup, it's not exactly what you want.
> You want an index by checksum, and the current btrfs code indexes by
> logical byte number in the disk.

that would be good for online dedup, but in practice that is not going
to work or I don't see how.

> So you need an extra index either way.  It makes sense to keep the
> crc32c csums for fast verification of the data read from disk and only
> use the expensive csums for dedup. 

I think that this should be part of a userland programm. It can take all
the time it wants during weekends to find dedup blocks.

> > Does that mean that I can dedup 4k blocks even if you use extents?

> Yes.

Perfect. :-)

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:45             ` Heinz-Josef Claes
@ 2009-04-28 20:16               ` Thomas Glanzmann
  2009-04-28 20:36                 ` Heinz-Josef Claes
  2009-05-06 15:16               ` Sander
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:16 UTC (permalink / raw)
  To: Heinz-Josef Claes
  Cc: Chris Mason, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Heinz,

> It's not only cpu time, it's also memory. You need 32 byte for each 4k
> block.  It needs to be in RAM for performance reason.

exactly and that is not going to scale.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:23     ` Chris Mason
  2009-04-28 17:37       ` Thomas Glanzmann
@ 2009-04-28 20:24       ` Thomas Glanzmann
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:24 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello,

> Not today.  The sage developers sent a patch to make an ioctl for this,
> but since it was hard coded to crc32c I haven't taken it yet.

could you send me the patch, I would love to make it work for arbritrary
checksums and resubmit.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 20:10               ` Thomas Glanzmann
@ 2009-04-28 20:29                 ` Thomas Glanzmann
  0 siblings, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:29 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrey Kuzmin, linux-btrfs

Hello,

* Thomas Glanzmann <thomas@glanzmann.de> [090428 22:10]:
> exactly. And if there is a way to retrieve the already calculated
> checksums from kernel land, than it would be possible to implement a
> ,,systemcall'' that gives the kernel a hint of a possible duplicated
> block (like providing a list of lists of blocks to the kernel that might
> be duplicated because they have the same checksum). Than the kernel code
> could dedup the block after byte-byte comparing it.

how hard would be to implement a systemcall that gets a list of blocks
and check if it is possible to dedup them (especially for the
implemented raid code, etc)?

Probably it would be smarter to report one set of blocks at a time
instead of a list of sets of blocks because than the systemcall would
take for ever to finish which is not really an issue but I assume that
the dedup systemcall needs to get a global lock that most probably
blocks any i/o operations?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 20:16               ` Thomas Glanzmann
@ 2009-04-28 20:36                 ` Heinz-Josef Claes
  2009-04-28 20:52                   ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Heinz-Josef Claes @ 2009-04-28 20:36 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Chris Mason, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Am Dienstag, 28. April 2009 22:16:19 schrieb Thomas Glanzmann:
> Hello Heinz,
>
> > It's not only cpu time, it's also memory. You need 32 byte for each 4k
> > block.  It needs to be in RAM for performance reason.
>
> exactly and that is not going to scale.
>
>         Thomas


Hi Thomas,

I wrote a backup tool which uses dedup, so I know a little bit about the 
problem and the performance impact if the checksums are not in memory 
(optionally in that tool).
http://savannah.gnu.org/projects/storebackup

Dedup really helps a lot - I think more than I could imagine before I was 
engaged in this kind of backup. You will not beleve how many identical files 
are in a filesystem to give a simple example.

EMC has very big boxes for this with lots of RAM in it.
I think the first problem which has to be solved is the memory problem. 
Perhaps something asynchronous to find identical blocks and storing the 
checksums on disk?

Heinz

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 20:36                 ` Heinz-Josef Claes
@ 2009-04-28 20:52                   ` Thomas Glanzmann
  2009-04-28 20:58                     ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 20:52 UTC (permalink / raw)
  To: Heinz-Josef Claes
  Cc: Chris Mason, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Heinz,

> I wrote a backup tool which uses dedup, so I know a little bit about
> the problem and the performance impact if the checksums are not in
> memory (optionally in that tool).
> http://savannah.gnu.org/projects/storebackup

> Dedup really helps a lot - I think more than I could imagine before I
> was engaged in this kind of backup. You will not beleve how many
> identical files are in a filesystem to give a simple example.

I saw it with my own yes (see my previous e-mail).

> EMC has very big boxes for this with lots of RAM in it.  I think the
> first problem which has to be solved is the memory problem.  Perhaps
> something asynchronous to find identical blocks and storing the
> checksums on disk?

I think we already have a very nice solution in order to solve that
issue:

        - Implement a system call that reports all checksums and unique
          block identifiers for all stored blocks.

        - Implement another system call that reports all checksums and
          unique identifiers for all stored blocks since the last
          report. This can be easily implemented:

          Use a block bitmap for every block on the filesystem use one
          bit. If the block is modified set the bit to one, when a
          bitmap is retrieved simply zero it out:

        Assuming a 4 kbyte block size that would mean for a 1 Tbyte
        filesystem:

        1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
        be saved to disk from time to time and be restored on startup).

        - Write a userland program that identifies duplicated blocks
          (for example by counting the occurance of a checksum using
          tokio cabinet[1] as persistant storage)

        - Implement a systemcall that gets hints from userland about
          blocks that might be deduplicated, and dedup them after
          verifying that they match in fact on a byte per byte basis.

                Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 20:52                   ` Thomas Glanzmann
@ 2009-04-28 20:58                     ` Chris Mason
  2009-04-28 21:12                       ` Thomas Glanzmann
  2009-04-29  0:06                       ` Bron Gondwana
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-04-28 20:58 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Tue, 2009-04-28 at 22:52 +0200, Thomas Glanzmann wrote:
> Hello Heinz,
> 
> > I wrote a backup tool which uses dedup, so I know a little bit about
> > the problem and the performance impact if the checksums are not in
> > memory (optionally in that tool).
> > http://savannah.gnu.org/projects/storebackup
> 
> > Dedup really helps a lot - I think more than I could imagine before I
> > was engaged in this kind of backup. You will not beleve how many
> > identical files are in a filesystem to give a simple example.
> 
> I saw it with my own yes (see my previous e-mail).
> 
> > EMC has very big boxes for this with lots of RAM in it.  I think the
> > first problem which has to be solved is the memory problem.  Perhaps
> > something asynchronous to find identical blocks and storing the
> > checksums on disk?
> 
> I think we already have a very nice solution in order to solve that
> issue:
> 
>         - Implement a system call that reports all checksums and unique
>           block identifiers for all stored blocks.
> 

This would require storing the larger checksums in the filesystem.  It
is much better done in the dedup program.

>         - Implement another system call that reports all checksums and
>           unique identifiers for all stored blocks since the last
>           report. This can be easily implemented:

This is racey because there's no way to prevent new changes.

> 
>           Use a block bitmap for every block on the filesystem use one
>           bit. If the block is modified set the bit to one, when a
>           bitmap is retrieved simply zero it out:

>         Assuming a 4 kbyte block size that would mean for a 1 Tbyte
>         filesystem:
> 
>         1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
>         be saved to disk from time to time and be restored on startup).
> 

Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across
the whole FS.  Btrfs has metadata that can quickly and easily tell you
which files and which blocks in which files have changed since a given
transaction id.  This is how you want to find new things.

But, the ioctl to actually do the dedup needs to be able to verify a
given block has the contents you expect it to.  The only place you can
lock down the pages in the file and prevent new changes is inside the
kernel.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 20:58                     ` Chris Mason
@ 2009-04-28 21:12                       ` Thomas Glanzmann
  2009-04-28 21:26                         ` Chris Mason
  2009-04-29  0:06                       ` Bron Gondwana
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 21:12 UTC (permalink / raw)
  To: Chris Mason
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello,

> >         - Implement a system call that reports all checksums and unique
> >           block identifiers for all stored blocks.

> This would require storing the larger checksums in the filesystem.  It
> is much better done in the dedup program.

I think I misunderstood something here. I thought the checksums per
block would already be stored somewhere in btrfs?

> >         - Implement another system call that reports all checksums and
> >           unique identifiers for all stored blocks since the last
> >           report. This can be easily implemented:

> This is racey because there's no way to prevent new changes.

I got the point.

> >           Use a block bitmap for every block on the filesystem use one
> >           bit. If the block is modified set the bit to one, when a
> >           bitmap is retrieved simply zero it out:

> >         Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> >         filesystem:

> >         1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> >         be saved to disk from time to time and be restored on startup).

> Sorry, a 1TB drive is teeny, I don't think a bitmap is practical
> across the whole FS.  Btrfs has metadata that can quickly and easily
> tell you which files and which blocks in which files have changed
> since a given transaction id.  This is how you want to find new
> things.

You're right the bitmap would not scale. So what is missing is a
systemcall to report the changes to userland? (Is this feature used to
generate off-site backups as well?)

> But, the ioctl to actually do the dedup needs to be able to verify a
> given block has the contents you expect it to.  The only place you can
> lock down the pages in the file and prevent new changes is inside the
> kernel.

I totally agree to that. How much time would it consume to implement
such a systemcall?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:43         ` Chris Mason
  2009-04-28 20:15           ` Thomas Glanzmann
@ 2009-04-28 21:19           ` Dmitri Nikulin
  1 sibling, 0 replies; 67+ messages in thread
From: Dmitri Nikulin @ 2009-04-28 21:19 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason <chris.mason@oracle.com> w=
rote:
> So you need an extra index either way. =C2=A0It makes sense to keep t=
he
> crc32c csums for fast verification of the data read from disk and onl=
y
> use the expensive csums for dedup.

What about self-healing? With only a CRC32 to distinguish a good block
from a bad one, statistically you're likely to get an incorrectly
healed block in only every few billion blocks. And that may not be
your machine, but it'll be somebody's, since the probability is way
too high for it not to happen to somebody. Even just a 64 bit checksum
would drop the probability plenty, but I'd really only start with 128
bits. NetApp does 64 for 4k of data, ZFS does 256 bits per block, and
this traces back to the root like a highly dynamic Merkle tree.

In the CRC case the only safe redundancy is one that has 3+ copies of
the block, to compare the raw data itself, at which point you may as
well have just been using healing RAID1 without checksums.

--=20
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 21:12                       ` Thomas Glanzmann
@ 2009-04-28 21:26                         ` Chris Mason
  2009-04-28 22:14                           ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-28 21:26 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Tue, 2009-04-28 at 23:12 +0200, Thomas Glanzmann wrote:
> Hello,
> 
> > >         - Implement a system call that reports all checksums and unique
> > >           block identifiers for all stored blocks.
> 
> > This would require storing the larger checksums in the filesystem.  It
> > is much better done in the dedup program.
> 
> I think I misunderstood something here. I thought the checksums per
> block would already be stored somewhere in btrfs?

They are, but only the crc32c are stored today.

> 
> > >         - Implement another system call that reports all checksums and
> > >           unique identifiers for all stored blocks since the last
> > >           report. This can be easily implemented:
> 
> > This is racey because there's no way to prevent new changes.
> 
> I got the point.
> 
> > >           Use a block bitmap for every block on the filesystem use one
> > >           bit. If the block is modified set the bit to one, when a
> > >           bitmap is retrieved simply zero it out:
> 
> > >         Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> > >         filesystem:
> 
> > >         1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> > >         be saved to disk from time to time and be restored on startup).
> 
> > Sorry, a 1TB drive is teeny, I don't think a bitmap is practical
> > across the whole FS.  Btrfs has metadata that can quickly and easily
> > tell you which files and which blocks in which files have changed
> > since a given transaction id.  This is how you want to find new
> > things.
> 
> You're right the bitmap would not scale. So what is missing is a
> systemcall to report the changes to userland? (Is this feature used to
> generate off-site backups as well?)

Yes, that's the idea.  An ioctl to walk the tree and report on changes,
but this doesn't have to be done with version 1 of the dedup code, you
can just scan the file based on mtime/ctime.

> 
> > But, the ioctl to actually do the dedup needs to be able to verify a
> > given block has the contents you expect it to.  The only place you can
> > lock down the pages in the file and prevent new changes is inside the
> > kernel.
> 
> I totally agree to that. How much time would it consume to implement
> such a systemcall?

It is probably a 3 week to one month effort.

-chris




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 21:26                         ` Chris Mason
@ 2009-04-28 22:14                           ` Thomas Glanzmann
  2009-04-28 23:18                             ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-28 22:14 UTC (permalink / raw)
  To: Chris Mason
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Chris,

> They are, but only the crc32c are stored today.

maybe crc32c is good enough to identify duplicated blocks, I mean we
only need a hint, the dedup ioctl does the double checking. I will write
tomorrow a perl script and compare the results to the one that uses md5
and repoort back.

> Yes, that's the idea.  An ioctl to walk the tree and report on
> changes, but this doesn't have to be done with version 1 of the dedup
> code, you can just scan the file based on mtime/ctime.

Good point.

> > > But, the ioctl to actually do the dedup needs to be able to verify a
> > > given block has the contents you expect it to.  The only place you can
> > > lock down the pages in the file and prevent new changes is inside the
> > > kernel.

> > I totally agree to that. How much time would it consume to implement
> > such a systemcall?

> It is probably a 3 week to one month effort.

I'm taking the challenge. Is there a document that I can read that
introduces me to the structures used in btrfs or can someone walk me
through on the phone to get a quick start?

I also would like to retrieve the checksums and identify the potential
blocks and after that work is done (even in a very preliminary state) in
a way that someone can work with it, I would like to move on to the
dedup ioctl.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 22:14                           ` Thomas Glanzmann
@ 2009-04-28 23:18                             ` Chris Mason
  2009-04-29 12:03                               ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-28 23:18 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Wed, 2009-04-29 at 00:14 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > They are, but only the crc32c are stored today.
> 
> maybe crc32c is good enough to identify duplicated blocks, I mean we
> only need a hint, the dedup ioctl does the double checking. I will write
> tomorrow a perl script and compare the results to the one that uses md5
> and repoort back.

Its a start at least.

> 
> > Yes, that's the idea.  An ioctl to walk the tree and report on
> > changes, but this doesn't have to be done with version 1 of the dedup
> > code, you can just scan the file based on mtime/ctime.
> 
> Good point.
> 
> > > > But, the ioctl to actually do the dedup needs to be able to verify a
> > > > given block has the contents you expect it to.  The only place you can
> > > > lock down the pages in the file and prevent new changes is inside the
> > > > kernel.
> 
> > > I totally agree to that. How much time would it consume to implement
> > > such a systemcall?
> 
> > It is probably a 3 week to one month effort.
> 
> I'm taking the challenge. Is there a document that I can read that
> introduces me to the structures used in btrfs or can someone walk me
> through on the phone to get a quick start?
> 

Great to hear.  It's an ambitious project, but I'll definitely help
explain things.

You can start with the code documentation section on
http://btrfs.wiki.kernel.org

I'll write up my ideas on how userspace controlled dedup should work.

> I also would like to retrieve the checksums and identify the potential
> blocks and after that work is done (even in a very preliminary state) in
> a way that someone can work with it, I would like to move on to the
> dedup ioctl.

Sounds fair, I'll forward the original patch.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 20:58                     ` Chris Mason
  2009-04-28 21:12                       ` Thomas Glanzmann
@ 2009-04-29  0:06                       ` Bron Gondwana
  1 sibling, 0 replies; 67+ messages in thread
From: Bron Gondwana @ 2009-04-29  0:06 UTC (permalink / raw)
  To: Chris Mason
  Cc: Thomas Glanzmann, Heinz-Josef Claes, Edward Shishkin,
	Tomasz Chmielewski, linux-btrfs

On Tue, Apr 28, 2009 at 04:58:15PM -0400, Chris Mason wrote:
> >         Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> >         filesystem:
> > 
> >         1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> >         be saved to disk from time to time and be restored on startup).
> > 
> 
> Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across
> the whole FS.

32Mbyte of memory is teeny too.  You can pick up a server with 32Gb for
about $5k these days.  Using 10% of that gives you room for 100Tb of
storage, which is getting to be a reasonable size.  More than that and
you would be thinking of bigger machines anyway to have enough memory
for a decent cache.

(32Gb seems to be the nice price point, because 4Gb dimms are cheap as
chips, and 8Gb are way expensive.  We're buying 32Gb machines at the
moment)

Bron ( ok, so I don't work with giant enterprise systems.  I can see
       that there are cases where this might suck, but memory seems
       to be growing as fast as disk in the low end )

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 23:18                             ` Chris Mason
@ 2009-04-29 12:03                               ` Thomas Glanzmann
  2009-04-29 13:11                                 ` Michael Tharp
  2009-04-29 13:14                                 ` Chris Mason
  0 siblings, 2 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-29 12:03 UTC (permalink / raw)
  To: Chris Mason
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Chris,

> You can start with the code documentation section on
> http://btrfs.wiki.kernel.org

I read through this and at the moment one questions come in my mind:

http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png

Looking at this picture, when I'm going to implement the dedup code, do I also
have to take care to spread the blocks over the different devices or is
there already infrastructure in place that automates that process?

> I'll write up my ideas on how userspace controlled dedup should work.

I'm looking forward to this and will use it as an implementation road
map.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 12:03                               ` Thomas Glanzmann
@ 2009-04-29 13:11                                 ` Michael Tharp
  2009-04-29 13:14                                 ` Chris Mason
  1 sibling, 0 replies; 67+ messages in thread
From: Michael Tharp @ 2009-04-29 13:11 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: Chris Mason, Heinz-Josef Claes, linux-btrfs

Thomas Glanzmann wrote:
> Looking at this picture, when I'm going to implement the dedup code, do I also
> have to take care to spread the blocks over the different devices or is
> there already infrastructure in place that automates that process?

If you somehow had blocks duplicated exactly across two devices such 
that the deduplicator discarded all the blocks from one disk and kept 
all the blocks from the other disk, then there would be a problem. The 
best way to solve this would be to always keep the block on the 
least-full device and discard the rest, which shouldn't be too difficult 
(especially since this would be happening in kernelspace), but the 
cheapest solution is to randomize the choice which would be sufficient 
to prevent any further imbalance from developing.

-- m. tharp

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 12:03                               ` Thomas Glanzmann
  2009-04-29 13:11                                 ` Michael Tharp
@ 2009-04-29 13:14                                 ` Chris Mason
  2009-04-29 13:58                                   ` Thomas Glanzmann
  1 sibling, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-29 13:14 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Wed, 2009-04-29 at 14:03 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > You can start with the code documentation section on
> > http://btrfs.wiki.kernel.org
> 
> I read through this and at the moment one questions come in my mind:
> 
> http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png
> 
> Looking at this picture, when I'm going to implement the dedup code, do I also
> have to take care to spread the blocks over the different devices or is
> there already infrastructure in place that automates that process?

The layering inside of btrfs means that you don't need to worry about
chunks or multiple devices.

But, in your ioctls you want to deal with [file, offset, len], not
directly with block numbers.  COW means that blocks can move around
without you knowing, and some of the btrfs internals will COW files in
order to relocate storage.

So, what you want is a dedup file (or files) where your DB knows a given
offset in the file has a given csum.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 13:14                                 ` Chris Mason
@ 2009-04-29 13:58                                   ` Thomas Glanzmann
  2009-04-29 14:31                                     ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-29 13:58 UTC (permalink / raw)
  To: Chris Mason
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Chris,

> But, in your ioctls you want to deal with [file, offset, len], not
> directly with block numbers.  COW means that blocks can move around
> without you knowing, and some of the btrfs internals will COW files in
> order to relocate storage.

> So, what you want is a dedup file (or files) where your DB knows a given
> offset in the file has a given csum.

how do I track if a certain block has already been deduplicated?

So, if I only have file, offset, len and not the block number, is there
a way from userland to tell if two blocks are already point to the same
block?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 13:58                                   ` Thomas Glanzmann
@ 2009-04-29 14:31                                     ` Chris Mason
  2009-04-29 15:26                                       ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-29 14:31 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Wed, 2009-04-29 at 15:58 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > But, in your ioctls you want to deal with [file, offset, len], not
> > directly with block numbers.  COW means that blocks can move around
> > without you knowing, and some of the btrfs internals will COW files in
> > order to relocate storage.
> 
> > So, what you want is a dedup file (or files) where your DB knows a given
> > offset in the file has a given csum.
> 
> how do I track if a certain block has already been deduplicated?
> 

Your database should know, and the ioctl could check to see if the
source and destination already point to the same thing before doing
anything expensive.

> So, if I only have file, offset, len and not the block number, is there
> a way from userland to tell if two blocks are already point to the same
> block?

You can use the fiemap ioctl.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 14:31                                     ` Chris Mason
@ 2009-04-29 15:26                                       ` Thomas Glanzmann
  2009-04-29 15:45                                         ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-04-29 15:26 UTC (permalink / raw)
  To: Chris Mason
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Chris,

> Your database should know, and the ioctl could check to see if the
> source and destination already point to the same thing before doing
> anything expensive.

I see.

> > So, if I only have file, offset, len and not the block number, is there
> > a way from userland to tell if two blocks are already point to the same
> > block?

> You can use the fiemap ioctl.

I see.

One more question, in my dedup test VMs, I had a block that was often
referenced these are the top block. The last one alone saves 3 Gbyte of
data. My question is now, how often can a block in btrfs be refferenced?

top 4 block counts for 8 Kbyte:

32055
59514
63606
415252

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 15:26                                       ` Thomas Glanzmann
@ 2009-04-29 15:45                                         ` Chris Mason
  2009-06-04  8:49                                           ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-04-29 15:45 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Wed, 2009-04-29 at 17:26 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > Your database should know, and the ioctl could check to see if the
> > source and destination already point to the same thing before doing
> > anything expensive.
> 
> I see.
> 
> > > So, if I only have file, offset, len and not the block number, is there
> > > a way from userland to tell if two blocks are already point to the same
> > > block?
> 
> > You can use the fiemap ioctl.
> 
> I see.
> 
> One more question, in my dedup test VMs, I had a block that was often
> referenced these are the top block. The last one alone saves 3 Gbyte of
> data. My question is now, how often can a block in btrfs be refferenced?
> 

The exact answer depends on if we are referencing it from a single file
or from multiple files.  But either way it is roughly 2^32.

-chris



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:41         ` Michael Tharp
  2009-04-28 20:14           ` Thomas Glanzmann
@ 2009-05-04 14:29           ` Ric Wheeler
  2009-05-04 14:39             ` Tomasz Chmielewski
  1 sibling, 1 reply; 67+ messages in thread
From: Ric Wheeler @ 2009-05-04 14:29 UTC (permalink / raw)
  To: Michael Tharp
  Cc: Thomas Glanzmann, Tomasz Chmielewski, Chris Mason, linux-btrfs

On 04/28/2009 01:41 PM, Michael Tharp wrote:
> Thomas Glanzmann wrote:
>> no, I just used the md5 checksum. And even if I have a hash escalation
>> which is highly unlikely it still gives a good house number.
>
> I'd start with a crc32 and/or MD5 to find candidate blocks, then do a 
> bytewise comparison before actually merging them. Even the risk of an 
> accidental collision is too high, and considering there are plenty of 
> birthday-style MD5 attacks it would not be extraordinarily difficult 
> to construct a block that collides with e.g. a system library.
>
> Keep in mind that although digests do a fairly good job of making 
> unique identifiers for larger chunks of data, they can only hold so 
> many unique combinations. Considering you're comparing blocks of a few 
> kibibytes in size it's best to just do a foolproof comparison. There's 
> nothing wrong with using a checksum/digest as a screening mechanism 
> though.
>
> -- m. tharp

One thing in the above scheme that would be really interesting for all 
possible hash functions is maintaining good stats on hash collisions, 
effectiveness of the hash, etc. There has been a lot of press about MD5 
hash collisions for example - it would be really neat to be able to 
track real world data on those,

Ric


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 14:29           ` Ric Wheeler
@ 2009-05-04 14:39             ` Tomasz Chmielewski
  2009-05-04 14:45               ` Ric Wheeler
  0 siblings, 1 reply; 67+ messages in thread
From: Tomasz Chmielewski @ 2009-05-04 14:39 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Michael Tharp, Thomas Glanzmann, Chris Mason, linux-btrfs

Ric Wheeler schrieb:

> One thing in the above scheme that would be really interesting for all 
> possible hash functions is maintaining good stats on hash collisions, 
> effectiveness of the hash, etc. There has been a lot of press about MD5 
> hash collisions for example - it would be really neat to be able to 
> track real world data on those,

See here ("The hashing function"):

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues

It's not "real world data", but it gives some overview which applies here.


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 14:39             ` Tomasz Chmielewski
@ 2009-05-04 14:45               ` Ric Wheeler
  2009-05-04 15:15                 ` Thomas Glanzmann
  0 siblings, 1 reply; 67+ messages in thread
From: Ric Wheeler @ 2009-05-04 14:45 UTC (permalink / raw)
  To: Tomasz Chmielewski
  Cc: Ric Wheeler, Michael Tharp, Thomas Glanzmann, Chris Mason, linux-btrfs

On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote:
> Ric Wheeler schrieb:
>
>> One thing in the above scheme that would be really interesting for 
>> all possible hash functions is maintaining good stats on hash 
>> collisions, effectiveness of the hash, etc. There has been a lot of 
>> press about MD5 hash collisions for example - it would be really neat 
>> to be able to track real world data on those,
>
> See here ("The hashing function"):
>
> http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues
>
> It's not "real world data", but it gives some overview which applies 
> here.
>
>

I have lots of first hand "real world" data from 5 years at EMC in the 
Centera group where we used various types of hashes to do single 
instancing at the file level, but those are unfortunately not public :-) 
Note that other parts of EMC do dedup at the block level (Avamar most 
notably).

The key to doing dedup at scale is answering a bunch of questions:

     (1) Block level or file level dedup?
     (2) Inband dedup (during a write) or background dedup?
     (3) How reliably can you protect the pool of blocks? How reliably 
can you protect the database that maps hashes to blocks?
     (4) Can you give users who are somewhat jaded confidence in your 
solution (this is where stats come in very handy!)
     (5) In the end, dedup is basically a data compression trick - how 
effective is it (including the costs of the metadata, etc) compared to 
less complex schemes? What does it costs in CPU, DRAM and impact on the 
foreground workload?

Dedup is a very active area in the storage world, but you need to be 
extremely robust in the face of failure since a single block failure 
could (worst case!) lose all stored data in the pool :-)

Regards,

Ric


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 14:45               ` Ric Wheeler
@ 2009-05-04 15:15                 ` Thomas Glanzmann
  2009-05-04 16:03                   ` Ric Wheeler
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-05-04 15:15 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Tomasz Chmielewski, Michael Tharp, Chris Mason, linux-btrfs

Hello Ric,

> (1) Block level or file level dedup?

what is the difference between the two?

> (2) Inband dedup (during a write) or background dedup?

I think inband dedup is way to intensive on ressources (memory) and also
would kill every performance benchmark. So I think the offline dedup is
the right way to go.

> (3) How reliably can you protect the pool of blocks? How reliably can
> you protect the database that maps hashes to blocks?

You have to lock down the i/o requests for the blocks in question and
compare them byte by byte anyway, just to make sure.

> (4) Can you give users who are somewhat jaded confidence in your
> solution (this is where stats come in very handy!)

For virtual machines you can reduce the used data by 1/3. Of course it
can blow in your face when you don't watch your physical resources
closely.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 15:15                 ` Thomas Glanzmann
@ 2009-05-04 16:03                   ` Ric Wheeler
  2009-05-04 16:16                     ` Andrey Kuzmin
  2009-05-04 16:26                     ` Thomas Glanzmann
  0 siblings, 2 replies; 67+ messages in thread
From: Ric Wheeler @ 2009-05-04 16:03 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Tomasz Chmielewski, Michael Tharp, Chris Mason, linux-btrfs

Thomas Glanzmann wrote:
> Hello Ric,
> 
>> (1) Block level or file level dedup?
> 
> what is the difference between the two?
> 
>> (2) Inband dedup (during a write) or background dedup?
> 
> I think inband dedup is way to intensive on ressources (memory) and also
> would kill every performance benchmark. So I think the offline dedup is
> the right way to go.

I would not categorize it as offline, but just not as inband (i.e., you can run 
a low priority background process to handle dedup).  Offline windows are 
extremely rare in production sites these days and it could take a very long time 
to do dedup at the block level over a large file system :-)

> 
>> (3) How reliably can you protect the pool of blocks? How reliably can
>> you protect the database that maps hashes to blocks?
> 
> You have to lock down the i/o requests for the blocks in question and
> compare them byte by byte anyway, just to make sure.

Yes, one advantage we had in centera was that the objects were read-only and 
whole files, so this was not an issue for us.

> 
>> (4) Can you give users who are somewhat jaded confidence in your
>> solution (this is where stats come in very handy!)
> 
> For virtual machines you can reduce the used data by 1/3. Of course it
> can blow in your face when you don't watch your physical resources
> closely.
> 
>         Thomas

1/3 is not sufficient for dedup in my opinion - you can get that with normal 
compression at the block level.

Put another way, if the baseline is bzip2 levels of compression for the block 
device, when is the complexity of dedup going to pay off?

I think that Chris already mentioned that you can (for virt OS images) also 
imagine using copy on write snapshots (of something that is mostly read-only 
like the OS system partitions).  Make 50 copy-on-write snapshots of "/" 
(excluding /home) and you have an effective compression of 98% until someone 
rudely starts writing at least :-)

ric



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 16:03                   ` Ric Wheeler
@ 2009-05-04 16:16                     ` Andrey Kuzmin
  2009-05-04 16:24                       ` Thomas Glanzmann
  2009-05-04 16:26                     ` Thomas Glanzmann
  1 sibling, 1 reply; 67+ messages in thread
From: Andrey Kuzmin @ 2009-05-04 16:16 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Thomas Glanzmann, Tomasz Chmielewski, Michael Tharp, Chris Mason,
	linux-btrfs

On Mon, May 4, 2009 at 8:03 PM, Ric Wheeler <rwheeler@redhat.com> wrote=
:
> Thomas Glanzmann wrote:
>
> I think that Chris already mentioned that you can (for virt OS images=
) also
> imagine using copy on write snapshots (of something that is mostly re=
ad-only
> like the OS system partitions). =A0Make 50 copy-on-write snapshots of=
 "/"
> (excluding /home) and you have an effective compression of 98% until =
someone
> rudely starts writing at least :-)
>

As far as I understand, VMware already ships this "gold image" feature
(as they call it) for Windows environments and claims it to be very
efficient.

Regards,
Andrey

> ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 16:16                     ` Andrey Kuzmin
@ 2009-05-04 16:24                       ` Thomas Glanzmann
  2009-05-04 18:06                         ` Jan-Frode Myklebust
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-05-04 16:24 UTC (permalink / raw)
  To: Andrey Kuzmin
  Cc: Ric Wheeler, Tomasz Chmielewski, Michael Tharp, Chris Mason, linux-btrfs

Hello Andrey,

> As far as I understand, VMware already ships this "gold image" feature
> (as they call it) for Windows environments and claims it to be very
> efficient.

they call it ,,thin or shallow clones'' and ship it with desktop
virtualization (one vm per thinclient user) and for VMware lab manager.
Looking at the website content, it also revealed that VMware will have a
similiar feature for their workhorse ,,esx server'' in the upcoming
release, however my point still stands. Ship out a service pack for
windows and you 1.5 Gbyte of modified data that is not deduped. Because
that is a once in VMs life opportunity.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 16:03                   ` Ric Wheeler
  2009-05-04 16:16                     ` Andrey Kuzmin
@ 2009-05-04 16:26                     ` Thomas Glanzmann
  2009-05-04 19:11                       ` Heinz-Josef Claes
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-05-04 16:26 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Tomasz Chmielewski, Michael Tharp, Chris Mason, linux-btrfs

Ric,

> I would not categorize it as offline, but just not as inband (i.e., you can 
> run a low priority background process to handle dedup).

> Offline windows are extremely rare in production sites these days and
> it could take a very long time to do dedup at the block level over a
> large file system :-)

let me rephrase, by offline I meant asynchronous during off hours.

> 1/3 is not sufficient for dedup in my opinion - you can get that with 
> normal compression at the block level.

1/3 is what gives me real time data of an production environment in a
mixed VM setup without compression.

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 16:24                       ` Thomas Glanzmann
@ 2009-05-04 18:06                         ` Jan-Frode Myklebust
  2009-05-04 19:16                           ` Andrey Kuzmin
  2009-05-05  8:02                           ` Thomas Glanzmann
  0 siblings, 2 replies; 67+ messages in thread
From: Jan-Frode Myklebust @ 2009-05-04 18:06 UTC (permalink / raw)
  To: linux-btrfs

On 2009-05-04, Thomas Glanzmann <thomas@glanzmann.de> wrote:
>
>> As far as I understand, VMware already ships this "gold image" featu=
re
>> (as they call it) for Windows environments and claims it to be very
>> efficient.
>
> they call it ,,thin or shallow clones''

"thin or shallow clones" sounds more like sparse images. I believe "lin=
ked clones"
is the word for running multiple virtual machines off a single gold ima=
ge. Ref,
the "VMware View Composer" section of:

	http://www.vmware.com/products/view/whatsincluded.html
=09

> Looking at the website content, it also revealed that VMware will hav=
e a
> similiar feature for their workhorse ,,esx server'' in the upcoming
> release, however my point still stands. Ship out a service pack for
> windows and you 1.5 Gbyte of modified data that is not deduped.=20

"All desktops that are linked to a master image can be patched or updat=
ed
 simply by updating the master image, without affecting users=E2=80=99 =
settings,
 data or applications."


  -jf

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 16:26                     ` Thomas Glanzmann
@ 2009-05-04 19:11                       ` Heinz-Josef Claes
  2009-05-04 21:29                         ` Dmitri Nikulin
  2009-05-24  7:27                         ` Thomas Glanzmann
  0 siblings, 2 replies; 67+ messages in thread
From: Heinz-Josef Claes @ 2009-05-04 19:11 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Ric Wheeler, Tomasz Chmielewski, Michael Tharp, Chris Mason, linux-btrfs

Thomas Glanzmann schrieb:
> Ric,
>
>   
>> I would not categorize it as offline, but just not as inband (i.e., you can 
>> run a low priority background process to handle dedup).
>>     
>
>   
>> Offline windows are extremely rare in production sites these days and
>> it could take a very long time to do dedup at the block level over a
>> large file system :-)
>>     
>
> let me rephrase, by offline I meant asynchronous during off hours.
>
>   
Hi, during the last half year I thought a little bit about doing dedup 
for my backup program: not only with fixed blocks (which is 
implemented), but with moving blocks (with all offsets in a file: 1 
byte, 2 byte, ...). That means, I have to have *lots* of comparisions 
(size of file - blocksize). Even it's not the same, it must be very fast 
and that's the same problem like the one discussed here.

My solution (not yet implemented) is as follows (hopefully I remember well):

I calculate a checksum of 24 bit. (there can be another size)

This means, I can have 2^24 different checksums.

Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember 
well, I'm just in a hotel and have no calculator): one bit for each 
possibility. This verctor is initialized with zeros.

For each calculated checksum of a block, I set the according bit in the 
bit vector.

It's very fast, to check if a block with a special checksum exists in 
the filesystem (backup for me) by checking the appropriate bit in the 
bit vector.

If it doesn't exist, it's a new block

If it exists, there need to be a separate 'real' check if it's really 
the same block (which is slow, but's that's happening <<1% of the time).

I hope it is possible to understand my thoughts. I'm in a hotel and I 
possibly cannot track the emails in this list in the next hours or days.

Regards, HJC
>> 1/3 is not sufficient for dedup in my opinion - you can get that with 
>> normal compression at the block level.
>>     
>
> 1/3 is what gives me real time data of an production environment in a
> mixed VM setup without compression.
>
>         Thomas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 18:06                         ` Jan-Frode Myklebust
@ 2009-05-04 19:16                           ` Andrey Kuzmin
  2009-05-05  8:02                           ` Thomas Glanzmann
  1 sibling, 0 replies; 67+ messages in thread
From: Andrey Kuzmin @ 2009-05-04 19:16 UTC (permalink / raw)
  To: Jan-Frode Myklebust; +Cc: linux-btrfs

On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust <janfrode@tanso.ne=
t> wrote:
>> Looking at the website content, it also revealed that VMware will ha=
ve a
>> similiar feature for their workhorse ,,esx server'' in the upcoming
>> release, however my point still stands. Ship out a service pack for
>> windows and you 1.5 Gbyte of modified data that is not deduped.
>
> "All desktops that are linked to a master image can be patched or upd=
ated
> =A0simply by updating the master image, without affecting users=92 se=
ttings,
> =A0data or applications."

Writable clone support in the file-system coupled with hierarchical
settings/data/apps  layout, and you have what's described above.

Regards,
Andrey

>
>
> =A0-jf
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 19:11                       ` Heinz-Josef Claes
@ 2009-05-04 21:29                         ` Dmitri Nikulin
  2009-05-05  7:18                           ` Heinz-Josef Claes
  2009-05-24  7:27                         ` Thomas Glanzmann
  1 sibling, 1 reply; 67+ messages in thread
From: Dmitri Nikulin @ 2009-05-04 21:29 UTC (permalink / raw)
  To: linux-btrfs

On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes <hjclaes@web.de> wrote:
> Hi, during the last half year I thought a little bit about doing dedup for
> my backup program: not only with fixed blocks (which is implemented), but
> with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
> means, I have to have *lots* of comparisions (size of file - blocksize).
> Even it's not the same, it must be very fast and that's the same problem
> like the one discussed here.
>
> My solution (not yet implemented) is as follows (hopefully I remember well):
>
> I calculate a checksum of 24 bit. (there can be another size)
>
> This means, I can have 2^24 different checksums.
>
> Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
> I'm just in a hotel and have no calculator): one bit for each possibility.
> This verctor is initialized with zeros.
>
> For each calculated checksum of a block, I set the according bit in the bit
> vector.
>
> It's very fast, to check if a block with a special checksum exists in the
> filesystem (backup for me) by checking the appropriate bit in the bit
> vector.
>
> If it doesn't exist, it's a new block
>
> If it exists, there need to be a separate 'real' check if it's really the
> same block (which is slow, but's that's happening <<1% of the time).

Which means you have to refer to each block in some unique way from
the bit vector, making it a block pointer vector instead. That's only
64 times more expensive for a 64 bit offset...

Since the overwhelming majority of combinations will never appear in
practice, you are much better served with a self-sizing data structure
like a hash map or even a binary tree, or a hash map with each bucket
being a binary tree, etc... You can use any sized hash and it won't
affect the number of nodes you have to store. You can trade off to CPU
or RAM easily as required, just by selecting an appropriate data
structure. A bit vector and especially a pointer vector have extremely
bad "any" case RAM requirements because even if you're deduping a mere
10 blocks you're still allocating and initialising 2^24 offsets. The
least you could do is adaptively switch to a more efficient data
structure if you see the number of blocks is low enough.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 21:29                         ` Dmitri Nikulin
@ 2009-05-05  7:18                           ` Heinz-Josef Claes
  0 siblings, 0 replies; 67+ messages in thread
From: Heinz-Josef Claes @ 2009-05-05  7:18 UTC (permalink / raw)
  To: Dmitri Nikulin; +Cc: linux-btrfs

On Tue, 5 May 2009 07:29:45 +1000
Dmitri Nikulin <dnikulin@gmail.com> wrote:

> On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes <hjclaes@web.de> wrote:
> > Hi, during the last half year I thought a little bit about doing dedup for
> > my backup program: not only with fixed blocks (which is implemented), but
> > with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
> > means, I have to have *lots* of comparisions (size of file - blocksize).
> > Even it's not the same, it must be very fast and that's the same problem
> > like the one discussed here.
> >
> > My solution (not yet implemented) is as follows (hopefully I remember well):
> >
> > I calculate a checksum of 24 bit. (there can be another size)
> >
> > This means, I can have 2^24 different checksums.
> >
> > Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
> > I'm just in a hotel and have no calculator): one bit for each possibility.
> > This verctor is initialized with zeros.
> >
> > For each calculated checksum of a block, I set the according bit in the bit
> > vector.
> >
> > It's very fast, to check if a block with a special checksum exists in the
> > filesystem (backup for me) by checking the appropriate bit in the bit
> > vector.
> >
> > If it doesn't exist, it's a new block
> >
> > If it exists, there need to be a separate 'real' check if it's really the
> > same block (which is slow, but's that's happening <<1% of the time).
> 
> Which means you have to refer to each block in some unique way from
> the bit vector, making it a block pointer vector instead. That's only
> 64 times more expensive for a 64 bit offset...
> 
It was not the idea to have a pointer vector, only a bit vector. A pointer vector would be too big to hold it in RAM. Therefore, I need access to the disk after using the more exact md5sum (I wanted to use). The bitvector is only needed to have a very quick decision for most of the cases (speedup).
But I have no idea if it fits to this use case. I'm not a filesystem developer ;-)

> Since the overwhelming majority of combinations will never appear in
> practice, you are much better served with a self-sizing data structure
> like a hash map or even a binary tree, or a hash map with each bucket
> being a binary tree, etc... You can use any sized hash and it won't
> affect the number of nodes you have to store. You can trade off to CPU
> or RAM easily as required, just by selecting an appropriate data
> structure. A bit vector and especially a pointer vector have extremely
> bad "any" case RAM requirements because even if you're deduping a mere
> 10 blocks you're still allocating and initialising 2^24 offsets. The
> least you could do is adaptively switch to a more efficient data
> structure if you see the number of blocks is low enough.
> 
> -- 
> Dmitri Nikulin
> 
> Centre for Synchrotron Science
> Monash University
> Victoria 3800, Australia
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Heinz-Josef Claes <hjclaes@web.de>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 18:06                         ` Jan-Frode Myklebust
  2009-05-04 19:16                           ` Andrey Kuzmin
@ 2009-05-05  8:02                           ` Thomas Glanzmann
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-05-05  8:02 UTC (permalink / raw)
  To: Jan-Frode Myklebust; +Cc: linux-btrfs

Hello Jan,

* Jan-Frode Myklebust <janfrode@tanso.net> [090504 20:20]:
> "thin or shallow clones" sounds more like sparse images. I believe
> "linked clones" is the word for running multiple virtual machines off
> a single gold image. Ref, the "VMware View Composer" section of:

not exactly. VMware has one golden image and than accounts the blocks
different from the golden image by accounting the differences in a
snapshot file on a block level basis. So yes, it is a ,,sparse file''
but implemented in userland.

> "All desktops that are linked to a master image can be patched or upd=
ated
>  simply by updating the master image, without affecting users=E2=80=99=
 settings,
>  data or applications."

True for their desktops, because each desktop contains of a virtual
machine which is created when a user connects and destroyed, when he
disconnects. Bottom line the VM for the thin client only exists while
the user is connected. That of course makes it easy to update the maste=
r
image. But for VMs that are not desktops that are get created/destroyed
at least once a day, this doesn't scale.

        Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-28 17:45             ` Heinz-Josef Claes
  2009-04-28 20:16               ` Thomas Glanzmann
@ 2009-05-06 15:16               ` Sander
  1 sibling, 0 replies; 67+ messages in thread
From: Sander @ 2009-05-06 15:16 UTC (permalink / raw)
  To: Heinz-Josef Claes
  Cc: Chris Mason, Thomas Glanzmann, Edward Shishkin,
	Tomasz Chmielewski, linux-btrfs

Heinz-Josef Claes wrote (ao):
> Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason:
> > On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
> > > Hello,
> > >
> > > > I wouldn't rely on crc32: it is not a strong hash,
> > > > Such deduplication can lead to various problems,
> > > > including security ones.
> > >
> > > sure thing, did you think of replacing crc32 with sha1 or md5, is this
> > > even possible (is there enough space reserved so that the change can be
> > > done without changing the filesystem layout) at the moment with btrfs?
> >
> > It is possible, there's room in the metadata for about about 4k of
> > checksum for each 4k of data.  The initial btrfs code used sha256, but
> > the real limiting factor is the CPU time used.
> >
> > -chris
> >
> It's not only cpu time, it's also memory. You need 32 byte for each 4k block. 
> It needs to be in RAM for performance reason.

Less so with SSD I would assume.

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-05-04 19:11                       ` Heinz-Josef Claes
  2009-05-04 21:29                         ` Dmitri Nikulin
@ 2009-05-24  7:27                         ` Thomas Glanzmann
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Glanzmann @ 2009-05-24  7:27 UTC (permalink / raw)
  To: Heinz-Josef Claes
  Cc: Ric Wheeler, Tomasz Chmielewski, Michael Tharp, Chris Mason, linux-btrfs

Hello Heinz,

> Hi, during the last half year I thought a little bit about doing dedup
> for my backup program: not only with fixed blocks (which is
> implemented), but with moving blocks (with all offsets in a file: 1
> byte, 2 byte, ...). That means, I have to have *lots* of comparisions
> (size of file - blocksize).  Even it's not the same, it must be very
> fast and that's the same problem like the one discussed here.

because I just stumbled across that, I wanted to let you know about an
interesting approach that NetAPP is using for its Virtual Tape Library:

http://www.netapp.com/us/communities/tech-ontap/vtl-dedupe.html

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-04-29 15:45                                         ` Chris Mason
@ 2009-06-04  8:49                                           ` Thomas Glanzmann
  2009-06-04 11:43                                             ` Chris Mason
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-06-04  8:49 UTC (permalink / raw)
  To: Chris Mason
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

Hello Chris,

> >  My question is now, how often can a block in btrfs be refferenced?

> The exact answer depends on if we are referencing it from a single
> file or from multiple files.  But either way it is roughly 2^32.

could you please explain to me what underlying datastructure is used to
monitor if the block is still referenced or already free? Is a counter
used, bitmap (but that can't be if is 2^32) or some sort of list? I
assume that a counter is used. If this is the case, I assume when a
snapshot for example is deleted the reference counter of every block
that was referenced in the snapshot will be decremented by one. Is this
correct or am I missing something here?

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-06-04  8:49                                           ` Thomas Glanzmann
@ 2009-06-04 11:43                                             ` Chris Mason
  2009-06-04 12:03                                               ` Thomas Glanzmann
  2009-06-05 12:20                                               ` Tomasz Chmielewski
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2009-06-04 11:43 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > >  My question is now, how often can a block in btrfs be refferenced?
> 
> > The exact answer depends on if we are referencing it from a single
> > file or from multiple files.  But either way it is roughly 2^32.
> 
> could you please explain to me what underlying datastructure is used to
> monitor if the block is still referenced or already free? Is a counter
> used, bitmap (but that can't be if is 2^32) or some sort of list? I
> assume that a counter is used. If this is the case, I assume when a
> snapshot for example is deleted the reference counter of every block
> that was referenced in the snapshot will be decremented by one. Is this
> correct or am I missing something here?

It is a counter and a back reference.  With Yan Zheng's new format work,
the limit is not 2^64.

When a snapshot is deleted, the btree is walked to efficiently drop the
references on the blocks it referenced.

>From a dedup point of view, we'll want the dedup file to hold a
reference on the file extents.  The kernel ioctl side of things will
take care of that part.

-chris


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-06-04 11:43                                             ` Chris Mason
@ 2009-06-04 12:03                                               ` Thomas Glanzmann
  2009-06-04 12:43                                                 ` Chris Mason
  2009-06-05 12:20                                               ` Tomasz Chmielewski
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Glanzmann @ 2009-06-04 12:03 UTC (permalink / raw)
  To: Chris Mason, Heinz-Josef Claes, Edward Shishkin,
	Tomasz Chmielewski, linux-btrfs

Chris,

> It is a counter and a back reference.  With Yan Zheng's new format
> work, the limit is not 2^64.

That means that there is one back reference for every use of the block?
Where is this back reference stored? (I'm asking because if one back
reference for every copy is stored, it can obviously not be allocated
statically).

        Thomas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-06-04 12:03                                               ` Thomas Glanzmann
@ 2009-06-04 12:43                                                 ` Chris Mason
  0 siblings, 0 replies; 67+ messages in thread
From: Chris Mason @ 2009-06-04 12:43 UTC (permalink / raw)
  To: Thomas Glanzmann
  Cc: Heinz-Josef Claes, Edward Shishkin, Tomasz Chmielewski, linux-btrfs

On Thu, Jun 04, 2009 at 02:03:50PM +0200, Thomas Glanzmann wrote:
> Chris,
> 
> > It is a counter and a back reference.  With Yan Zheng's new format
> > work, the limit is not 2^64.
> 
> That means that there is one back reference for every use of the block?
> Where is this back reference stored? (I'm asking because if one back
> reference for every copy is stored, it can obviously not be allocated
> statically).

These are all stored in the extent allocation tree.  There isn't exactly
a 1:1 mapping but it is effectively that.

-chris

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-06-04 11:43                                             ` Chris Mason
  2009-06-04 12:03                                               ` Thomas Glanzmann
@ 2009-06-05 12:20                                               ` Tomasz Chmielewski
  2009-06-05 12:50                                                 ` Chris Mason
  1 sibling, 1 reply; 67+ messages in thread
From: Tomasz Chmielewski @ 2009-06-05 12:20 UTC (permalink / raw)
  To: Chris Mason, Thomas Glanzmann, Heinz-Josef Claes,
	Edward Shishkin, Tomasz Chmielewski

Chris Mason wrote:
> On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:
>> Hello Chris,
>>
>>>>  My question is now, how often can a block in btrfs be refferenced?
>>> The exact answer depends on if we are referencing it from a single
>>> file or from multiple files.  But either way it is roughly 2^32.
>> could you please explain to me what underlying datastructure is used to
>> monitor if the block is still referenced or already free? Is a counter
>> used, bitmap (but that can't be if is 2^32) or some sort of list? I
>> assume that a counter is used. If this is the case, I assume when a
>> snapshot for example is deleted the reference counter of every block
>> that was referenced in the snapshot will be decremented by one. Is this
>> correct or am I missing something here?
> 
> It is a counter and a back reference.  With Yan Zheng's new format work,
> the limit is not 2^64.
> 
> When a snapshot is deleted, the btree is walked to efficiently drop the
> references on the blocks it referenced.
> 
> From a dedup point of view, we'll want the dedup file to hold a
> reference on the file extents.  The kernel ioctl side of things will
> take care of that part.

I wonder how well would deduplication work with defragmentation? One 
excludes the other to some extent.


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-06-05 12:20                                               ` Tomasz Chmielewski
@ 2009-06-05 12:50                                                 ` Chris Mason
  2009-06-05 15:35                                                   ` Tomasz Chmielewski
  0 siblings, 1 reply; 67+ messages in thread
From: Chris Mason @ 2009-06-05 12:50 UTC (permalink / raw)
  To: Tomasz Chmielewski
  Cc: Thomas Glanzmann, Heinz-Josef Claes, Edward Shishkin, linux-btrfs

On Fri, Jun 05, 2009 at 02:20:48PM +0200, Tomasz Chmielewski wrote:
> Chris Mason wrote:
>> On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:
>>> Hello Chris,
>>>
>>>>>  My question is now, how often can a block in btrfs be refferenced?
>>>> The exact answer depends on if we are referencing it from a single
>>>> file or from multiple files.  But either way it is roughly 2^32.
>>> could you please explain to me what underlying datastructure is used to
>>> monitor if the block is still referenced or already free? Is a counter
>>> used, bitmap (but that can't be if is 2^32) or some sort of list? I
>>> assume that a counter is used. If this is the case, I assume when a
>>> snapshot for example is deleted the reference counter of every block
>>> that was referenced in the snapshot will be decremented by one. Is this
>>> correct or am I missing something here?
>>
>> It is a counter and a back reference.  With Yan Zheng's new format work,
>> the limit is not 2^64.
>>
>> When a snapshot is deleted, the btree is walked to efficiently drop the
>> references on the blocks it referenced.
>>
>> From a dedup point of view, we'll want the dedup file to hold a
>> reference on the file extents.  The kernel ioctl side of things will
>> take care of that part.
>
> I wonder how well would deduplication work with defragmentation? One  
> excludes the other to some extent.

Very much so ;)  Ideally we end up doing dedup in large extents, but it
will definitely increase the overall fragmentation of the FS.

-chris


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Data Deduplication with the help of an online filesystem check
  2009-06-05 12:50                                                 ` Chris Mason
@ 2009-06-05 15:35                                                   ` Tomasz Chmielewski
  0 siblings, 0 replies; 67+ messages in thread
From: Tomasz Chmielewski @ 2009-06-05 15:35 UTC (permalink / raw)
  To: Chris Mason, Tomasz Chmielewski, Thomas Glanzmann,
	Heinz-Josef Claes, Edward Shishkin

Chris Mason wrote:

>> I wonder how well would deduplication work with defragmentation? One  
>> excludes the other to some extent.
> 
> Very much so ;)  Ideally we end up doing dedup in large extents, but it
> will definitely increase the overall fragmentation of the FS.

Defragmentation could lead to interesting problems if it's not aware of 
dedupliction.


I can imagine "freeing up space" (i.e., as seen by userspace) where 
duplicated blocks are found, but keeping track of duplicated blocks 
internally (and not allowing to overwrite any block which is duplicated).

Only when free space is really needed, "deduplicate" blocks which have 
more copies, and allow to overwrite them with data.

Perhaps complicated though.


-- 
Tomasz Chmielewski

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2009-06-05 15:35 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-27  3:33 Data Deduplication with the help of an online filesystem check Thomas Glanzmann
2009-04-27 13:37 ` Chris Mason
2009-04-28  5:22   ` Thomas Glanzmann
2009-04-28 10:02     ` Chris Mason
2009-04-28 13:49       ` Andrey Kuzmin
2009-04-28 13:58         ` Chris Mason
2009-04-28 14:04           ` Thomas Glanzmann
2009-04-28 17:21             ` Chris Mason
2009-04-28 20:10               ` Thomas Glanzmann
2009-04-28 20:29                 ` Thomas Glanzmann
2009-04-28 13:58         ` jim owens
2009-04-28 16:10       ` Anthony Roberts
2009-04-28 15:59   ` Thomas Glanzmann
2009-04-28 16:04     ` Tomasz Chmielewski
2009-04-28 17:29       ` Edward Shishkin
2009-04-28 17:34         ` Thomas Glanzmann
2009-04-28 17:38           ` Chris Mason
2009-04-28 17:43             ` Thomas Glanzmann
2009-04-28 17:45             ` Heinz-Josef Claes
2009-04-28 20:16               ` Thomas Glanzmann
2009-04-28 20:36                 ` Heinz-Josef Claes
2009-04-28 20:52                   ` Thomas Glanzmann
2009-04-28 20:58                     ` Chris Mason
2009-04-28 21:12                       ` Thomas Glanzmann
2009-04-28 21:26                         ` Chris Mason
2009-04-28 22:14                           ` Thomas Glanzmann
2009-04-28 23:18                             ` Chris Mason
2009-04-29 12:03                               ` Thomas Glanzmann
2009-04-29 13:11                                 ` Michael Tharp
2009-04-29 13:14                                 ` Chris Mason
2009-04-29 13:58                                   ` Thomas Glanzmann
2009-04-29 14:31                                     ` Chris Mason
2009-04-29 15:26                                       ` Thomas Glanzmann
2009-04-29 15:45                                         ` Chris Mason
2009-06-04  8:49                                           ` Thomas Glanzmann
2009-06-04 11:43                                             ` Chris Mason
2009-06-04 12:03                                               ` Thomas Glanzmann
2009-06-04 12:43                                                 ` Chris Mason
2009-06-05 12:20                                               ` Tomasz Chmielewski
2009-06-05 12:50                                                 ` Chris Mason
2009-06-05 15:35                                                   ` Tomasz Chmielewski
2009-04-29  0:06                       ` Bron Gondwana
2009-05-06 15:16               ` Sander
2009-04-28 17:32       ` Thomas Glanzmann
2009-04-28 17:41         ` Michael Tharp
2009-04-28 20:14           ` Thomas Glanzmann
2009-05-04 14:29           ` Ric Wheeler
2009-05-04 14:39             ` Tomasz Chmielewski
2009-05-04 14:45               ` Ric Wheeler
2009-05-04 15:15                 ` Thomas Glanzmann
2009-05-04 16:03                   ` Ric Wheeler
2009-05-04 16:16                     ` Andrey Kuzmin
2009-05-04 16:24                       ` Thomas Glanzmann
2009-05-04 18:06                         ` Jan-Frode Myklebust
2009-05-04 19:16                           ` Andrey Kuzmin
2009-05-05  8:02                           ` Thomas Glanzmann
2009-05-04 16:26                     ` Thomas Glanzmann
2009-05-04 19:11                       ` Heinz-Josef Claes
2009-05-04 21:29                         ` Dmitri Nikulin
2009-05-05  7:18                           ` Heinz-Josef Claes
2009-05-24  7:27                         ` Thomas Glanzmann
2009-04-28 17:23     ` Chris Mason
2009-04-28 17:37       ` Thomas Glanzmann
2009-04-28 17:43         ` Chris Mason
2009-04-28 20:15           ` Thomas Glanzmann
2009-04-28 21:19           ` Dmitri Nikulin
2009-04-28 20:24       ` Thomas Glanzmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.