All of lore.kernel.org
 help / color / mirror / Atom feed
* [markfasheh/duperemove] Why blocksize is limit to 1MB?
@ 2016-12-30 20:28 Peter Becker
  2017-01-01  4:38 ` Xin Zhou
  2017-01-03 12:40 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Becker @ 2016-12-30 20:28 UTC (permalink / raw)
  To: linux-btrfs

Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
I try to dedupe this because the first hundred GB of many files are identical.
With 128KB blocksize with nofiemap and lookup-extends=no option, will
take more then a week (only dedupe, previously hashed). So i tryed -b
100M but this returned me an error: "Blocksize is bounded ...".

The reason is that the blocksize is limit to

#define MAX_BLOCKSIZE (1024U*1024)

But i can't found any description why.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2016-12-30 20:28 [markfasheh/duperemove] Why blocksize is limit to 1MB? Peter Becker
@ 2017-01-01  4:38 ` Xin Zhou
  2017-01-02 12:32   ` Peter Becker
  2017-01-03 12:40 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 19+ messages in thread
From: Xin Zhou @ 2017-01-01  4:38 UTC (permalink / raw)
  To: Peter Becker; +Cc: linux-btrfs

Hi,

In general, the larger the block / chunk size is, the less dedup can be achieved.
1M is already a little bit too big in size.

Thanks,
Xin

 
 

Sent: Friday, December 30, 2016 at 12:28 PM
From: "Peter Becker" <floyd.net@gmail.com>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: [markfasheh/duperemove] Why blocksize is limit to 1MB?
Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
I try to dedupe this because the first hundred GB of many files are identical.
With 128KB blocksize with nofiemap and lookup-extends=no option, will
take more then a week (only dedupe, previously hashed). So i tryed -b
100M but this returned me an error: "Blocksize is bounded ...".

The reason is that the blocksize is limit to

#define MAX_BLOCKSIZE (1024U*1024)

But i can't found any description why.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-01  4:38 ` Xin Zhou
@ 2017-01-02 12:32   ` Peter Becker
  2017-01-02 19:36     ` Xin Zhou
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Becker @ 2017-01-02 12:32 UTC (permalink / raw)
  To: Xin Zhou; +Cc: linux-btrfs

> 1M is already a little bit too big in size.

Not in my usecase :)

Is it right the this isn't an limit in btrfs? So i can patch this and try 100M.
The reason is, that i must dedupe the whole 8 TB in less then a day
but with 128K and 1M blocksize it will take a week.

I don't know why adding extends take so long.
I/O during adding extends is less then 4MB/s, and CPU (dual core) and
memory (8 GB) usage are less then 20%, on bare metal.

2017-01-01 5:38 GMT+01:00 Xin Zhou <xin.zhou@gmx.com>:
> Hi,
>
> In general, the larger the block / chunk size is, the less dedup can be achieved.
> 1M is already a little bit too big in size.
>
> Thanks,
> Xin
>
>
>
>
> Sent: Friday, December 30, 2016 at 12:28 PM
> From: "Peter Becker" <floyd.net@gmail.com>
> To: linux-btrfs <linux-btrfs@vger.kernel.org>
> Subject: [markfasheh/duperemove] Why blocksize is limit to 1MB?
> Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
> I try to dedupe this because the first hundred GB of many files are identical.
> With 128KB blocksize with nofiemap and lookup-extends=no option, will
> take more then a week (only dedupe, previously hashed). So i tryed -b
> 100M but this returned me an error: "Blocksize is bounded ...".
>
> The reason is that the blocksize is limit to
>
> #define MAX_BLOCKSIZE (1024U*1024)
>
> But i can't found any description why.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-02 12:32   ` Peter Becker
@ 2017-01-02 19:36     ` Xin Zhou
  0 siblings, 0 replies; 19+ messages in thread
From: Xin Zhou @ 2017-01-02 19:36 UTC (permalink / raw)
  To: Peter Becker; +Cc: linux-btrfs

Hi,

Before doing that, probably one way to think about it could be,
what would be the probablitity of two 100M blocks generate the same hash and be treated as identical.
Thanks,
Xin
 
 

Sent: Monday, January 02, 2017 at 4:32 AM
From: "Peter Becker" <floyd.net@gmail.com>
To: "Xin Zhou" <xin.zhou@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
> 1M is already a little bit too big in size.

Not in my usecase :)

Is it right the this isn't an limit in btrfs? So i can patch this and try 100M.
The reason is, that i must dedupe the whole 8 TB in less then a day
but with 128K and 1M blocksize it will take a week.

I don't know why adding extends take so long.
I/O during adding extends is less then 4MB/s, and CPU (dual core) and
memory (8 GB) usage are less then 20%, on bare metal.

2017-01-01 5:38 GMT+01:00 Xin Zhou <xin.zhou@gmx.com>:
> Hi,
>
> In general, the larger the block / chunk size is, the less dedup can be achieved.
> 1M is already a little bit too big in size.
>
> Thanks,
> Xin
>
>
>
>
> Sent: Friday, December 30, 2016 at 12:28 PM
> From: "Peter Becker" <floyd.net@gmail.com>
> To: linux-btrfs <linux-btrfs@vger.kernel.org>
> Subject: [markfasheh/duperemove] Why blocksize is limit to 1MB?
> Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
> I try to dedupe this because the first hundred GB of many files are identical.
> With 128KB blocksize with nofiemap and lookup-extends=no option, will
> take more then a week (only dedupe, previously hashed). So i tryed -b
> 100M but this returned me an error: "Blocksize is bounded ...".
>
> The reason is that the blocksize is limit to
>
> #define MAX_BLOCKSIZE (1024U*1024)
>
> But i can't found any description why.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2016-12-30 20:28 [markfasheh/duperemove] Why blocksize is limit to 1MB? Peter Becker
  2017-01-01  4:38 ` Xin Zhou
@ 2017-01-03 12:40 ` Austin S. Hemmelgarn
  2017-01-03 19:24   ` Peter Becker
       [not found]   ` <CAEtw4r3mUA_4vcS-dbxagQn3NPRh8Cxcz0iF0L7jHwv5c9Ui+g@mail.gmail.com>
  1 sibling, 2 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-03 12:40 UTC (permalink / raw)
  To: Peter Becker, linux-btrfs

On 2016-12-30 15:28, Peter Becker wrote:
> Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
> I try to dedupe this because the first hundred GB of many files are identical.
> With 128KB blocksize with nofiemap and lookup-extends=no option, will
> take more then a week (only dedupe, previously hashed). So i tryed -b
> 100M but this returned me an error: "Blocksize is bounded ...".
>
> The reason is that the blocksize is limit to
>
> #define MAX_BLOCKSIZE (1024U*1024)
>
> But i can't found any description why.
Beyond what Xin mentioned (namely that 1MB is a much larger block than 
will be duplicated in most data-sets), there are a couple of other reasons:
1. Smaller blocks will actually get you better deduplication on average 
because they're more likely to match.  As an example, assume you have 2 
files with the same 8 4k blocks in different orders:
   FileA: 1 2 3 4 5 6 7 8
   FileB: 7 8 5 6 3 4 1 2
In such a case, deduplicating at any block size above 8k would result in 
zero deduplication between these files, while 8k or less would 
completely deduplicate them.  This is of course a highly specific and 
somewhat contrived example (in most cases it will be scattered duplicate 
blocks over dozens of files), but it does convey this specific point.
2. The kernel will do a byte-wise comparison of all ranges you pass into 
the ioctl at the same time.  Larger block sizes here mean that:
	a) The extents will be locked longer, which will prevent any I/O to the 
files being deduplicated for the duration of the comparison, which may 
in turn cause other issues on the system.
	b) The deduplication process will be stuck in uninterruptible sleep 
longer, which on many systems will trigger hung task detection, which 
will in turn either spam the system log or panic the system depending on 
how it's configured.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 12:40 ` Austin S. Hemmelgarn
@ 2017-01-03 19:24   ` Peter Becker
  2017-01-03 23:07     ` Hans van Kranenburg
       [not found]   ` <CAEtw4r3mUA_4vcS-dbxagQn3NPRh8Cxcz0iF0L7jHwv5c9Ui+g@mail.gmail.com>
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Becker @ 2017-01-03 19:24 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

All invocations are justified, but not relevant in (offline) backup
and archive scenarios.

For example you have multiple version of append-only log-files or
append-only db-files (each more then 100GB in size), like this:

> Snapshot_01_01_2017
-> file1.log .. 201 GB

> Snapshot_02_01_2017
-> file1.log .. 205 GB

> Snapshot_03_01_2017
-> file1.log .. 221 GB

The first 201 GB would be every time the same.
Files a copied at night from windows, linux or bsd systems and
snapshoted after copy.

So a fast way to dedupe this is needed. Using 128KB blocks would
result in 1646592 extends per Snapshot. 1MB blocksize results in
205.824 extends (not bad, but still terrible speed).
I will test it at night with a patched version of duperemove with
100MB blocksize, but I have no hope that the throughput increases
thereby.

For backup and archive scenarios the checksum-feature and the
dub-data/metadata-feature of btrfs is realy nice. In particular if one
considers the 7 years legally prescribed storage time.

2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2016-12-30 15:28, Peter Becker wrote:
>>
>> Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
>> I try to dedupe this because the first hundred GB of many files are
>> identical.
>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>> take more then a week (only dedupe, previously hashed). So i tryed -b
>> 100M but this returned me an error: "Blocksize is bounded ...".
>>
>> The reason is that the blocksize is limit to
>>
>> #define MAX_BLOCKSIZE (1024U*1024)
>>
>> But i can't found any description why.
>
> Beyond what Xin mentioned (namely that 1MB is a much larger block than will
> be duplicated in most data-sets), there are a couple of other reasons:
> 1. Smaller blocks will actually get you better deduplication on average
> because they're more likely to match.  As an example, assume you have 2
> files with the same 8 4k blocks in different orders:
>   FileA: 1 2 3 4 5 6 7 8
>   FileB: 7 8 5 6 3 4 1 2
> In such a case, deduplicating at any block size above 8k would result in
> zero deduplication between these files, while 8k or less would completely
> deduplicate them.  This is of course a highly specific and somewhat
> contrived example (in most cases it will be scattered duplicate blocks over
> dozens of files), but it does convey this specific point.
> 2. The kernel will do a byte-wise comparison of all ranges you pass into the
> ioctl at the same time.  Larger block sizes here mean that:
>         a) The extents will be locked longer, which will prevent any I/O to
> the files being deduplicated for the duration of the comparison, which may
> in turn cause other issues on the system.
>         b) The deduplication process will be stuck in uninterruptible sleep
> longer, which on many systems will trigger hung task detection, which will
> in turn either spam the system log or panic the system depending on how it's
> configured.
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
       [not found]     ` <7b0c897f-844c-e7f4-0ce7-c9f888b95983@gmail.com>
@ 2017-01-03 20:20       ` Peter Becker
  2017-01-03 20:40         ` Austin S. Hemmelgarn
  2017-01-03 20:21       ` Fwd: " Peter Becker
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Becker @ 2017-01-03 20:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

I think i understand. The resulting keyquestion is, how i can improve
the performance of extend_same ioctl.
I tested it with following results:

enviorment:
2 files, called "file", size each 100GB, duperemove nofiemap-options
set, 1MB extend size.

duperemove output:
[0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
[0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
[0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
[0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
1.0M), "/mnt/new/file"

iotop output for a 30 sec. sample

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               22,31    0,00   13,83        33,81    0,00       30,05

Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0,00     1,70          1149,93    0,73  4600,53
139,60    8,24       0,23          0,20    0,19   13,64      0,19
21,84
sde               0,00     0,00          1149,33    0,53  4597,33
23,87     8,04       0,20          0,18    0,18    1,75       0,18
20,47
sdf                0,00     1,70          1149,60    0,63  4598,40
139,60    8,24       0,21          0,18    0,18    4,63       0,18
20,63
sdh               0,00     0,00          1149,33    0,53  4597,33
23,87     8,04       0,21          0,18    0,18    4,25       0,18
20,85

resulting in less then 18MB/s read. realy slow.

Querstion 1: why, so slow?
Questiont 2a: would be a higher extend-size perform better?
Querstion 2b: or did i understand something wrong?

2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2017-01-03 14:21, Peter Becker wrote:
>>
>> All invocations are justified, but not relevant in (offline) backup
>> and archive scenarios.
>>
>> For example you have multiple version of append-only log-files or
>> append-only db-files (each more then 100GB in size), like this:
>>
>>> Snapshot_01_01_2017
>>
>> -> file1.log .. 201 GB
>>
>>> Snapshot_02_01_2017
>>
>> -> file1.log .. 205 GB
>>
>>> Snapshot_03_01_2017
>>
>> -> file1.log .. 221 GB
>>
>> The first 201 GB would be every time the same.
>> Files a copied at night from windows, linux or bsd systems and
>> snapshoted after copy.
>>
>> So a fast way to dedupe this is needed. Using 128KB blocks would
>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>> 205.824 extends (not bad, but still terrible speed).
>> I will test it at night with a patched version of duperemove with
>> 100MB blocksize, but I have no hope that the throughput increases
>> thereby.
>
> Deduplication is not a general purpose thing (usually you have very
> specifically structured data), but duperemove is supposed to be a general
> purpose tool.  It works fine for two of the most common cases (deduplicating
> large numbers of small files or small numbers of large files), but it is
> sub-optimal for those cases, and will be for almost any other case.  This is
> a canonical example though of a case where you can use a custom script or
> program to figure out what's duplicated and then have that just call the
> ioctl as appropriate itself.  Most cases where you are storing some kind of
> well structured data fall into this category.  In fact, both of the cases
> where I use deduplication myself fall into such a category.  One case
> involves multiple directories that are partial copies of a larger tree which
> are kept in sync with the larger tree and each other.  In that particular
> case, I care about whole file deduplication, so I have a script that just
> matches on path relative to the roots of each copy and the master copy,
> verifies that the mtime and size are the same, and if so calls the ioctl for
> deduplication (with some fancy processing to fit within the max size
> supported by the ioctl and prevent tiny tail extents).  The other case is a
> set of archives with a pretty solid fixed structure to them.  In that case,
> I have a different script that knows enough about the file structure to know
> where to look for duplicate blocks, thus avoiding having to hash the whole
> files.
>
> The append-only log/database case fits this type of thing perfectly, for
> each subsequent file, you know already that (most of) the file up to the
> length of the previous file is duplicated, so you can just split that
> however you want into chunks and pass those to the dedupe ioctl and free up
> most of the space that would be used by the new file.  You can then run
> duperemove with a hash-file to process any new blocks beyond the point you
> deduplicated up to to reclaim any excess space (currently this will process
> the whole file, but it should see that most of it is deduplicated already).
>
>>
>> For backup and archive scenarios the checksum-feature and the
>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>> considers the 7 years legally prescribed storage time.
>>
>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>
>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>
>>>>
>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>> each.
>>>> I try to dedupe this because the first hundred GB of many files are
>>>> identical.
>>>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>>>> take more then a week (only dedupe, previously hashed). So i tryed -b
>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>
>>>> The reason is that the blocksize is limit to
>>>>
>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>
>>>> But i can't found any description why.
>>>
>>>
>>> Beyond what Xin mentioned (namely that 1MB is a much larger block than
>>> will
>>> be duplicated in most data-sets), there are a couple of other reasons:
>>> 1. Smaller blocks will actually get you better deduplication on average
>>> because they're more likely to match.  As an example, assume you have 2
>>> files with the same 8 4k blocks in different orders:
>>>   FileA: 1 2 3 4 5 6 7 8
>>>   FileB: 7 8 5 6 3 4 1 2
>>> In such a case, deduplicating at any block size above 8k would result in
>>> zero deduplication between these files, while 8k or less would completely
>>> deduplicate them.  This is of course a highly specific and somewhat
>>> contrived example (in most cases it will be scattered duplicate blocks
>>> over
>>> dozens of files), but it does convey this specific point.
>>> 2. The kernel will do a byte-wise comparison of all ranges you pass into
>>> the
>>> ioctl at the same time.  Larger block sizes here mean that:
>>>         a) The extents will be locked longer, which will prevent any I/O
>>> to
>>> the files being deduplicated for the duration of the comparison, which
>>> may
>>> in turn cause other issues on the system.
>>>         b) The deduplication process will be stuck in uninterruptible
>>> sleep
>>> longer, which on many systems will trigger hung task detection, which
>>> will
>>> in turn either spam the system log or panic the system depending on how
>>> it's
>>> configured.
>>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Fwd: [markfasheh/duperemove] Why blocksize is limit to 1MB?
       [not found]     ` <7b0c897f-844c-e7f4-0ce7-c9f888b95983@gmail.com>
  2017-01-03 20:20       ` Peter Becker
@ 2017-01-03 20:21       ` Peter Becker
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Becker @ 2017-01-03 20:21 UTC (permalink / raw)
  To: linux-btrfs

---------- Forwarded message ----------
From: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Date: 2017-01-03 20:37 GMT+01:00
Subject: Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
To: Peter Becker <floyd.net@gmail.com>


On 2017-01-03 14:21, Peter Becker wrote:
>
> All invocations are justified, but not relevant in (offline) backup
> and archive scenarios.
>
> For example you have multiple version of append-only log-files or
> append-only db-files (each more then 100GB in size), like this:
>
>> Snapshot_01_01_2017
>
> -> file1.log .. 201 GB
>
>> Snapshot_02_01_2017
>
> -> file1.log .. 205 GB
>
>> Snapshot_03_01_2017
>
> -> file1.log .. 221 GB
>
> The first 201 GB would be every time the same.
> Files a copied at night from windows, linux or bsd systems and
> snapshoted after copy.
>
> So a fast way to dedupe this is needed. Using 128KB blocks would
> result in 1646592 extends per Snapshot. 1MB blocksize results in
> 205.824 extends (not bad, but still terrible speed).
> I will test it at night with a patched version of duperemove with
> 100MB blocksize, but I have no hope that the throughput increases
> thereby.

Deduplication is not a general purpose thing (usually you have very
specifically structured data), but duperemove is supposed to be a
general purpose tool.  It works fine for two of the most common cases
(deduplicating large numbers of small files or small numbers of large
files), but it is sub-optimal for those cases, and will be for almost
any other case.  This is a canonical example though of a case where
you can use a custom script or program to figure out what's duplicated
and then have that just call the ioctl as appropriate itself.  Most
cases where you are storing some kind of well structured data fall
into this category.  In fact, both of the cases where I use
deduplication myself fall into such a category.  One case involves
multiple directories that are partial copies of a larger tree which
are kept in sync with the larger tree and each other.  In that
particular case, I care about whole file deduplication, so I have a
script that just matches on path relative to the roots of each copy
and the master copy, verifies that the mtime and size are the same,
and if so calls the ioctl for deduplication (with some fancy
processing to fit within the max size supported by the ioctl and
prevent tiny tail extents).  The other case is a set of archives with
a pretty solid fixed structure to them.  In that case, I have a
different script that knows enough about the file structure to know
where to look for duplicate blocks, thus avoiding having to hash the
whole files.

The append-only log/database case fits this type of thing perfectly,
for each subsequent file, you know already that (most of) the file up
to the length of the previous file is duplicated, so you can just
split that however you want into chunks and pass those to the dedupe
ioctl and free up most of the space that would be used by the new
file.  You can then run duperemove with a hash-file to process any new
blocks beyond the point you deduplicated up to to reclaim any excess
space (currently this will process the whole file, but it should see
that most of it is deduplicated already).

>
> For backup and archive scenarios the checksum-feature and the
> dub-data/metadata-feature of btrfs is realy nice. In particular if one
> considers the 7 years legally prescribed storage time.
>
> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>
>> On 2016-12-30 15:28, Peter Becker wrote:
>>>
>>>
>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB each.
>>> I try to dedupe this because the first hundred GB of many files are
>>> identical.
>>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>>> take more then a week (only dedupe, previously hashed). So i tryed -b
>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>
>>> The reason is that the blocksize is limit to
>>>
>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>
>>> But i can't found any description why.
>>
>>
>> Beyond what Xin mentioned (namely that 1MB is a much larger block than will
>> be duplicated in most data-sets), there are a couple of other reasons:
>> 1. Smaller blocks will actually get you better deduplication on average
>> because they're more likely to match.  As an example, assume you have 2
>> files with the same 8 4k blocks in different orders:
>>   FileA: 1 2 3 4 5 6 7 8
>>   FileB: 7 8 5 6 3 4 1 2
>> In such a case, deduplicating at any block size above 8k would result in
>> zero deduplication between these files, while 8k or less would completely
>> deduplicate them.  This is of course a highly specific and somewhat
>> contrived example (in most cases it will be scattered duplicate blocks over
>> dozens of files), but it does convey this specific point.
>> 2. The kernel will do a byte-wise comparison of all ranges you pass into the
>> ioctl at the same time.  Larger block sizes here mean that:
>>         a) The extents will be locked longer, which will prevent any I/O to
>> the files being deduplicated for the duration of the comparison, which may
>> in turn cause other issues on the system.
>>         b) The deduplication process will be stuck in uninterruptible sleep
>> longer, which on many systems will trigger hung task detection, which will
>> in turn either spam the system log or panic the system depending on how it's
>> configured.
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 20:20       ` Peter Becker
@ 2017-01-03 20:40         ` Austin S. Hemmelgarn
  2017-01-03 21:35           ` Peter Becker
  0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-03 20:40 UTC (permalink / raw)
  To: Peter Becker, linux-btrfs

On 2017-01-03 15:20, Peter Becker wrote:
> I think i understand. The resulting keyquestion is, how i can improve
> the performance of extend_same ioctl.
> I tested it with following results:
>
> enviorment:
> 2 files, called "file", size each 100GB, duperemove nofiemap-options
> set, 1MB extend size.
>
> duperemove output:
> [0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
> [0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
> [0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
> [0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
> 1.0M), "/mnt/new/file"
>
> iotop output for a 30 sec. sample
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                22,31    0,00   13,83        33,81    0,00       30,05
>
> Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
> wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdd               0,00     1,70          1149,93    0,73  4600,53
> 139,60    8,24       0,23          0,20    0,19   13,64      0,19
> 21,84
> sde               0,00     0,00          1149,33    0,53  4597,33
> 23,87     8,04       0,20          0,18    0,18    1,75       0,18
> 20,47
> sdf                0,00     1,70          1149,60    0,63  4598,40
> 139,60    8,24       0,21          0,18    0,18    4,63       0,18
> 20,63
> sdh               0,00     0,00          1149,33    0,53  4597,33
> 23,87     8,04       0,21          0,18    0,18    4,25       0,18
> 20,85
>
> resulting in less then 18MB/s read. realy slow.
>
> Querstion 1: why, so slow?
For a couple of reasons.  First, you have to understand that duperemove 
itself actually does a pretty large amount of processing outside of the 
call to the ioctl.  It first hashes the blocks for quicker comparison 
and matching, then figures out which blocks match, and finally calls the 
ioctl on the resulting matches.  The reason for this behavior is that 
the ioctl is insanely slow.  It first locks the ranges passed in (so 
they don't get changed by anything else during the deduplication 
process), then does a byte-by-byte comparison to make sure they all 
actually do match (data safety, I've said at least once before that I 
think there should be a flag for the ioctl (or a separate ioctl) to skip 
this and assume that userspace really knows what it's doing), then 
finally sets up the reflinks, and unlocks the new extent.

All of this ties into why I keep telling people that efficient 
deduplication requires a tool that understands how the data being 
deduplicated is structured.  By avoiding the need to hash and compare 
every block of data, you can significantly improve the time that part 
takes, and quite often this will mitigate the impact of getting a few 
false positives passed into the ioctl.
> Questiont 2a: would be a higher extend-size perform better?
> Querstion 2b: or did i understand something wrong?
No, a larger extent would probably not help much, and that's actually a 
really good performance sample you've created.

The block size does end up being somewhat of a trade-off.  Ideally, you 
want it matched to the smallest possible chunk of duplicate data greater 
than or equal to the filesystem block size for maximal space efficiency. 
  Doing this however makes the extra processing done by duperemove take 
exponentially longer because it has to calculate hashes for more blocks 
(this has very low impact until you get to very small block sizes), and 
has to make exponentially more comparisons (this has a very big impact 
as you shrink the block size, just halving the block size will roughly 
quadruple the time it takes to make the comparisons).
>
> 2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> On 2017-01-03 14:21, Peter Becker wrote:
>>>
>>> All invocations are justified, but not relevant in (offline) backup
>>> and archive scenarios.
>>>
>>> For example you have multiple version of append-only log-files or
>>> append-only db-files (each more then 100GB in size), like this:
>>>
>>>> Snapshot_01_01_2017
>>>
>>> -> file1.log .. 201 GB
>>>
>>>> Snapshot_02_01_2017
>>>
>>> -> file1.log .. 205 GB
>>>
>>>> Snapshot_03_01_2017
>>>
>>> -> file1.log .. 221 GB
>>>
>>> The first 201 GB would be every time the same.
>>> Files a copied at night from windows, linux or bsd systems and
>>> snapshoted after copy.
>>>
>>> So a fast way to dedupe this is needed. Using 128KB blocks would
>>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>>> 205.824 extends (not bad, but still terrible speed).
>>> I will test it at night with a patched version of duperemove with
>>> 100MB blocksize, but I have no hope that the throughput increases
>>> thereby.
>>
>> Deduplication is not a general purpose thing (usually you have very
>> specifically structured data), but duperemove is supposed to be a general
>> purpose tool.  It works fine for two of the most common cases (deduplicating
>> large numbers of small files or small numbers of large files), but it is
>> sub-optimal for those cases, and will be for almost any other case.  This is
>> a canonical example though of a case where you can use a custom script or
>> program to figure out what's duplicated and then have that just call the
>> ioctl as appropriate itself.  Most cases where you are storing some kind of
>> well structured data fall into this category.  In fact, both of the cases
>> where I use deduplication myself fall into such a category.  One case
>> involves multiple directories that are partial copies of a larger tree which
>> are kept in sync with the larger tree and each other.  In that particular
>> case, I care about whole file deduplication, so I have a script that just
>> matches on path relative to the roots of each copy and the master copy,
>> verifies that the mtime and size are the same, and if so calls the ioctl for
>> deduplication (with some fancy processing to fit within the max size
>> supported by the ioctl and prevent tiny tail extents).  The other case is a
>> set of archives with a pretty solid fixed structure to them.  In that case,
>> I have a different script that knows enough about the file structure to know
>> where to look for duplicate blocks, thus avoiding having to hash the whole
>> files.
>>
>> The append-only log/database case fits this type of thing perfectly, for
>> each subsequent file, you know already that (most of) the file up to the
>> length of the previous file is duplicated, so you can just split that
>> however you want into chunks and pass those to the dedupe ioctl and free up
>> most of the space that would be used by the new file.  You can then run
>> duperemove with a hash-file to process any new blocks beyond the point you
>> deduplicated up to to reclaim any excess space (currently this will process
>> the whole file, but it should see that most of it is deduplicated already).
>>
>>>
>>> For backup and archive scenarios the checksum-feature and the
>>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>>> considers the 7 years legally prescribed storage time.
>>>
>>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>
>>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>>
>>>>>
>>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>>> each.
>>>>> I try to dedupe this because the first hundred GB of many files are
>>>>> identical.
>>>>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>>>>> take more then a week (only dedupe, previously hashed). So i tryed -b
>>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>>
>>>>> The reason is that the blocksize is limit to
>>>>>
>>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>>
>>>>> But i can't found any description why.
>>>>
>>>>
>>>> Beyond what Xin mentioned (namely that 1MB is a much larger block than
>>>> will
>>>> be duplicated in most data-sets), there are a couple of other reasons:
>>>> 1. Smaller blocks will actually get you better deduplication on average
>>>> because they're more likely to match.  As an example, assume you have 2
>>>> files with the same 8 4k blocks in different orders:
>>>>   FileA: 1 2 3 4 5 6 7 8
>>>>   FileB: 7 8 5 6 3 4 1 2
>>>> In such a case, deduplicating at any block size above 8k would result in
>>>> zero deduplication between these files, while 8k or less would completely
>>>> deduplicate them.  This is of course a highly specific and somewhat
>>>> contrived example (in most cases it will be scattered duplicate blocks
>>>> over
>>>> dozens of files), but it does convey this specific point.
>>>> 2. The kernel will do a byte-wise comparison of all ranges you pass into
>>>> the
>>>> ioctl at the same time.  Larger block sizes here mean that:
>>>>         a) The extents will be locked longer, which will prevent any I/O
>>>> to
>>>> the files being deduplicated for the duration of the comparison, which
>>>> may
>>>> in turn cause other issues on the system.
>>>>         b) The deduplication process will be stuck in uninterruptible
>>>> sleep
>>>> longer, which on many systems will trigger hung task detection, which
>>>> will
>>>> in turn either spam the system log or panic the system depending on how
>>>> it's
>>>> configured.
>>>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 20:40         ` Austin S. Hemmelgarn
@ 2017-01-03 21:35           ` Peter Becker
  2017-01-04 12:58             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Becker @ 2017-01-03 21:35 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Qu Wenruo; +Cc: linux-btrfs

As i understand the duperemove source-code right (i work on/ try to
improve this code since 5 or 6 weeks on multiple parts), duperemove
does hashing and calculation before they call extend_same.
Duperemove stores all in a hashfile and read this. after all files
hashed, and duplicates detected, the progress all in order without
reading new data form disk / hashfile. so the byte-by-byte comparison
of extend_same ioctl should consume the full possible bandwidth of the
disks.

1. dbfile_load_hashes
2. find_all_dupes
3. dedupe_results
-> call the following in N threads:
> dedupe_extent_list
>> list_for_each_entry
>>> add_extent_to_dedupe #produce a simple list/queue
>>> dedupe_extents
>>>> btrfs_extent_same
>>>>> BTRFS_IOC_FILE_EXTENT_SAME

So if this right, one of this thinks is realy slow:

1. byte-per-byte comparison
2. sets up the reflinks
3. unlocks the new extent

If i'm not wrong with my understanding of the duperemove source code,
this behaivor should also affected the online dedupe feature on with
Qu Wenruo works.

2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2017-01-03 15:20, Peter Becker wrote:
>>
>> I think i understand. The resulting keyquestion is, how i can improve
>> the performance of extend_same ioctl.
>> I tested it with following results:
>>
>> enviorment:
>> 2 files, called "file", size each 100GB, duperemove nofiemap-options
>> set, 1MB extend size.
>>
>> duperemove output:
>> [0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
>> [0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
>> [0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
>> [0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
>> 1.0M), "/mnt/new/file"
>>
>> iotop output for a 30 sec. sample
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                22,31    0,00   13,83        33,81    0,00       30,05
>>
>> Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
>> wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdd               0,00     1,70          1149,93    0,73  4600,53
>> 139,60    8,24       0,23          0,20    0,19   13,64      0,19
>> 21,84
>> sde               0,00     0,00          1149,33    0,53  4597,33
>> 23,87     8,04       0,20          0,18    0,18    1,75       0,18
>> 20,47
>> sdf                0,00     1,70          1149,60    0,63  4598,40
>> 139,60    8,24       0,21          0,18    0,18    4,63       0,18
>> 20,63
>> sdh               0,00     0,00          1149,33    0,53  4597,33
>> 23,87     8,04       0,21          0,18    0,18    4,25       0,18
>> 20,85
>>
>> resulting in less then 18MB/s read. realy slow.
>>
>> Querstion 1: why, so slow?
>
> For a couple of reasons.  First, you have to understand that duperemove
> itself actually does a pretty large amount of processing outside of the call
> to the ioctl.  It first hashes the blocks for quicker comparison and
> matching, then figures out which blocks match, and finally calls the ioctl
> on the resulting matches.  The reason for this behavior is that the ioctl is
> insanely slow.  It first locks the ranges passed in (so they don't get
> changed by anything else during the deduplication process), then does a
> byte-by-byte comparison to make sure they all actually do match (data
> safety, I've said at least once before that I think there should be a flag
> for the ioctl (or a separate ioctl) to skip this and assume that userspace
> really knows what it's doing), then finally sets up the reflinks, and
> unlocks the new extent.
>
> All of this ties into why I keep telling people that efficient deduplication
> requires a tool that understands how the data being deduplicated is
> structured.  By avoiding the need to hash and compare every block of data,
> you can significantly improve the time that part takes, and quite often this
> will mitigate the impact of getting a few false positives passed into the
> ioctl.
>>
>> Questiont 2a: would be a higher extend-size perform better?
>> Querstion 2b: or did i understand something wrong?
>
> No, a larger extent would probably not help much, and that's actually a
> really good performance sample you've created.
>
> The block size does end up being somewhat of a trade-off.  Ideally, you want
> it matched to the smallest possible chunk of duplicate data greater than or
> equal to the filesystem block size for maximal space efficiency.  Doing this
> however makes the extra processing done by duperemove take exponentially
> longer because it has to calculate hashes for more blocks (this has very low
> impact until you get to very small block sizes), and has to make
> exponentially more comparisons (this has a very big impact as you shrink the
> block size, just halving the block size will roughly quadruple the time it
> takes to make the comparisons).
>
>>
>> 2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>
>>> On 2017-01-03 14:21, Peter Becker wrote:
>>>>
>>>>
>>>> All invocations are justified, but not relevant in (offline) backup
>>>> and archive scenarios.
>>>>
>>>> For example you have multiple version of append-only log-files or
>>>> append-only db-files (each more then 100GB in size), like this:
>>>>
>>>>> Snapshot_01_01_2017
>>>>
>>>>
>>>> -> file1.log .. 201 GB
>>>>
>>>>> Snapshot_02_01_2017
>>>>
>>>>
>>>> -> file1.log .. 205 GB
>>>>
>>>>> Snapshot_03_01_2017
>>>>
>>>>
>>>> -> file1.log .. 221 GB
>>>>
>>>> The first 201 GB would be every time the same.
>>>> Files a copied at night from windows, linux or bsd systems and
>>>> snapshoted after copy.
>>>>
>>>> So a fast way to dedupe this is needed. Using 128KB blocks would
>>>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>>>> 205.824 extends (not bad, but still terrible speed).
>>>> I will test it at night with a patched version of duperemove with
>>>> 100MB blocksize, but I have no hope that the throughput increases
>>>> thereby.
>>>
>>>
>>> Deduplication is not a general purpose thing (usually you have very
>>> specifically structured data), but duperemove is supposed to be a general
>>> purpose tool.  It works fine for two of the most common cases
>>> (deduplicating
>>> large numbers of small files or small numbers of large files), but it is
>>> sub-optimal for those cases, and will be for almost any other case.  This
>>> is
>>> a canonical example though of a case where you can use a custom script or
>>> program to figure out what's duplicated and then have that just call the
>>> ioctl as appropriate itself.  Most cases where you are storing some kind
>>> of
>>> well structured data fall into this category.  In fact, both of the cases
>>> where I use deduplication myself fall into such a category.  One case
>>> involves multiple directories that are partial copies of a larger tree
>>> which
>>> are kept in sync with the larger tree and each other.  In that particular
>>> case, I care about whole file deduplication, so I have a script that just
>>> matches on path relative to the roots of each copy and the master copy,
>>> verifies that the mtime and size are the same, and if so calls the ioctl
>>> for
>>> deduplication (with some fancy processing to fit within the max size
>>> supported by the ioctl and prevent tiny tail extents).  The other case is
>>> a
>>> set of archives with a pretty solid fixed structure to them.  In that
>>> case,
>>> I have a different script that knows enough about the file structure to
>>> know
>>> where to look for duplicate blocks, thus avoiding having to hash the
>>> whole
>>> files.
>>>
>>> The append-only log/database case fits this type of thing perfectly, for
>>> each subsequent file, you know already that (most of) the file up to the
>>> length of the previous file is duplicated, so you can just split that
>>> however you want into chunks and pass those to the dedupe ioctl and free
>>> up
>>> most of the space that would be used by the new file.  You can then run
>>> duperemove with a hash-file to process any new blocks beyond the point
>>> you
>>> deduplicated up to to reclaim any excess space (currently this will
>>> process
>>> the whole file, but it should see that most of it is deduplicated
>>> already).
>>>
>>>>
>>>> For backup and archive scenarios the checksum-feature and the
>>>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>>>> considers the 7 years legally prescribed storage time.
>>>>
>>>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>>
>>>>>
>>>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>>>> each.
>>>>>> I try to dedupe this because the first hundred GB of many files are
>>>>>> identical.
>>>>>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>>>>>> take more then a week (only dedupe, previously hashed). So i tryed -b
>>>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>>>
>>>>>> The reason is that the blocksize is limit to
>>>>>>
>>>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>>>
>>>>>> But i can't found any description why.
>>>>>
>>>>>
>>>>>
>>>>> Beyond what Xin mentioned (namely that 1MB is a much larger block than
>>>>> will
>>>>> be duplicated in most data-sets), there are a couple of other reasons:
>>>>> 1. Smaller blocks will actually get you better deduplication on average
>>>>> because they're more likely to match.  As an example, assume you have 2
>>>>> files with the same 8 4k blocks in different orders:
>>>>>   FileA: 1 2 3 4 5 6 7 8
>>>>>   FileB: 7 8 5 6 3 4 1 2
>>>>> In such a case, deduplicating at any block size above 8k would result
>>>>> in
>>>>> zero deduplication between these files, while 8k or less would
>>>>> completely
>>>>> deduplicate them.  This is of course a highly specific and somewhat
>>>>> contrived example (in most cases it will be scattered duplicate blocks
>>>>> over
>>>>> dozens of files), but it does convey this specific point.
>>>>> 2. The kernel will do a byte-wise comparison of all ranges you pass
>>>>> into
>>>>> the
>>>>> ioctl at the same time.  Larger block sizes here mean that:
>>>>>         a) The extents will be locked longer, which will prevent any
>>>>> I/O
>>>>> to
>>>>> the files being deduplicated for the duration of the comparison, which
>>>>> may
>>>>> in turn cause other issues on the system.
>>>>>         b) The deduplication process will be stuck in uninterruptible
>>>>> sleep
>>>>> longer, which on many systems will trigger hung task detection, which
>>>>> will
>>>>> in turn either spam the system log or panic the system depending on how
>>>>> it's
>>>>> configured.
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 19:24   ` Peter Becker
@ 2017-01-03 23:07     ` Hans van Kranenburg
  2017-01-03 23:12       ` Peter Becker
  0 siblings, 1 reply; 19+ messages in thread
From: Hans van Kranenburg @ 2017-01-03 23:07 UTC (permalink / raw)
  To: Peter Becker, Austin S. Hemmelgarn; +Cc: linux-btrfs

On 01/03/2017 08:24 PM, Peter Becker wrote:
> All invocations are justified, but not relevant in (offline) backup
> and archive scenarios.
> 
> For example you have multiple version of append-only log-files or
> append-only db-files (each more then 100GB in size), like this:
> 
>> Snapshot_01_01_2017
> -> file1.log .. 201 GB
> 
>> Snapshot_02_01_2017
> -> file1.log .. 205 GB
> 
>> Snapshot_03_01_2017
> -> file1.log .. 221 GB
> 
> The first 201 GB would be every time the same.
> Files a copied at night from windows, linux or bsd systems and
> snapshoted after copy.

XY problem?

Why not use rsync --inplace in combination with btrfs snapshots? Even if
the remote does not support rsync and you need to pull the full file
first, you could again use rsync locally.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 23:07     ` Hans van Kranenburg
@ 2017-01-03 23:12       ` Peter Becker
  2017-01-03 23:43         ` Hans van Kranenburg
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Becker @ 2017-01-03 23:12 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Austin S. Hemmelgarn, linux-btrfs

Good hint, this would be an option and i will try this.

Regardless of this the curiosity has packed me and I will try to
figure out where the problem with the low transfer rate is.

2017-01-04 0:07 GMT+01:00 Hans van Kranenburg <hans.van.kranenburg@mendix.com>:
> On 01/03/2017 08:24 PM, Peter Becker wrote:
>> All invocations are justified, but not relevant in (offline) backup
>> and archive scenarios.
>>
>> For example you have multiple version of append-only log-files or
>> append-only db-files (each more then 100GB in size), like this:
>>
>>> Snapshot_01_01_2017
>> -> file1.log .. 201 GB
>>
>>> Snapshot_02_01_2017
>> -> file1.log .. 205 GB
>>
>>> Snapshot_03_01_2017
>> -> file1.log .. 221 GB
>>
>> The first 201 GB would be every time the same.
>> Files a copied at night from windows, linux or bsd systems and
>> snapshoted after copy.
>
> XY problem?
>
> Why not use rsync --inplace in combination with btrfs snapshots? Even if
> the remote does not support rsync and you need to pull the full file
> first, you could again use rsync locally.
>
> --
> Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 23:12       ` Peter Becker
@ 2017-01-03 23:43         ` Hans van Kranenburg
  2017-01-04  0:08           ` Martin Raiber
  0 siblings, 1 reply; 19+ messages in thread
From: Hans van Kranenburg @ 2017-01-03 23:43 UTC (permalink / raw)
  To: Peter Becker; +Cc: linux-btrfs

On 01/04/2017 12:12 AM, Peter Becker wrote:
> Good hint, this would be an option and i will try this.
> 
> Regardless of this the curiosity has packed me and I will try to
> figure out where the problem with the low transfer rate is.
> 
> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg <hans.van.kranenburg@mendix.com>:
>> On 01/03/2017 08:24 PM, Peter Becker wrote:
>>> All invocations are justified, but not relevant in (offline) backup
>>> and archive scenarios.
>>>
>>> For example you have multiple version of append-only log-files or
>>> append-only db-files (each more then 100GB in size), like this:
>>>
>>>> Snapshot_01_01_2017
>>> -> file1.log .. 201 GB
>>>
>>>> Snapshot_02_01_2017
>>> -> file1.log .. 205 GB
>>>
>>>> Snapshot_03_01_2017
>>> -> file1.log .. 221 GB
>>>
>>> The first 201 GB would be every time the same.
>>> Files a copied at night from windows, linux or bsd systems and
>>> snapshoted after copy.
>>
>> XY problem?
>>
>> Why not use rsync --inplace in combination with btrfs snapshots? Even if
>> the remote does not support rsync and you need to pull the full file
>> first, you could again use rsync locally.

<annoyed>please don't toppost</annoyed>

Also, there is a rather huge difference in the two approaches, given the
way how btrfs works internally.

Say, I have a subvolume with thousands of directories and millions of
files with random data in it, and I want to have a second deduped copy
of it.

Approach 1:

Create a full copy of everything (compare: retrieving remote file again)
(now 200% of data storage is used), and after that do deduplication, so
that again only 100% of data storage is used.

Approach 2:

cp -av --reflink original/ copy/

By doing this, you end up with the same as doing approach 1 if your
deduper is the most ideal in the world (and the files are so random they
don't contain duplicate blocks inside them).

Approach 3:

btrfs sub snap original copy

W00t, that was fast, and the only thing that happened was writing a few
16kB metadata pages again. (1 for the toplevel tree page that got cloned
into a new filesystem tree, and a few for the blocks one level lower to
add backreferences to the new root).

So:

The big difference in the end result between approach 1,2 and otoh 3 is
that while deduplicating your data, you're actually duplicating all your
metadata at the same time.

In your situation, if possible doing an rsync --inplace from the remote,
so that only changed appended data gets stored, and then useing native
btrfs snapshotting it would seem the most effective.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 23:43         ` Hans van Kranenburg
@ 2017-01-04  0:08           ` Martin Raiber
  0 siblings, 0 replies; 19+ messages in thread
From: Martin Raiber @ 2017-01-04  0:08 UTC (permalink / raw)
  To: Hans van Kranenburg, Peter Becker; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2932 bytes --]

On 04.01.2017 00:43 Hans van Kranenburg wrote:
> On 01/04/2017 12:12 AM, Peter Becker wrote:
>> Good hint, this would be an option and i will try this.
>>
>> Regardless of this the curiosity has packed me and I will try to
>> figure out where the problem with the low transfer rate is.
>>
>> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg <hans.van.kranenburg@mendix.com>:
>>> On 01/03/2017 08:24 PM, Peter Becker wrote:
>>>> All invocations are justified, but not relevant in (offline) backup
>>>> and archive scenarios.
>>>>
>>>> For example you have multiple version of append-only log-files or
>>>> append-only db-files (each more then 100GB in size), like this:
>>>>
>>>>> Snapshot_01_01_2017
>>>> -> file1.log .. 201 GB
>>>>
>>>>> Snapshot_02_01_2017
>>>> -> file1.log .. 205 GB
>>>>
>>>>> Snapshot_03_01_2017
>>>> -> file1.log .. 221 GB
>>>>
>>>> The first 201 GB would be every time the same.
>>>> Files a copied at night from windows, linux or bsd systems and
>>>> snapshoted after copy.
>>> XY problem?
>>>
>>> Why not use rsync --inplace in combination with btrfs snapshots? Even if
>>> the remote does not support rsync and you need to pull the full file
>>> first, you could again use rsync locally.
> <annoyed>please don't toppost</annoyed>
>
> Also, there is a rather huge difference in the two approaches, given the
> way how btrfs works internally.
>
> Say, I have a subvolume with thousands of directories and millions of
> files with random data in it, and I want to have a second deduped copy
> of it.
>
> Approach 1:
>
> Create a full copy of everything (compare: retrieving remote file again)
> (now 200% of data storage is used), and after that do deduplication, so
> that again only 100% of data storage is used.
>
> Approach 2:
>
> cp -av --reflink original/ copy/
>
> By doing this, you end up with the same as doing approach 1 if your
> deduper is the most ideal in the world (and the files are so random they
> don't contain duplicate blocks inside them).
>
> Approach 3:
>
> btrfs sub snap original copy
>
> W00t, that was fast, and the only thing that happened was writing a few
> 16kB metadata pages again. (1 for the toplevel tree page that got cloned
> into a new filesystem tree, and a few for the blocks one level lower to
> add backreferences to the new root).
>
> So:
>
> The big difference in the end result between approach 1,2 and otoh 3 is
> that while deduplicating your data, you're actually duplicating all your
> metadata at the same time.
>
> In your situation, if possible doing an rsync --inplace from the remote,
> so that only changed appended data gets stored, and then useing native
> btrfs snapshotting it would seem the most effective.
>
Or use UrBackup as backup software. It uses the snapshot then modfiy
approach with btrfs, plus you get file level deduplication between
clients using reflinks.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-03 21:35           ` Peter Becker
@ 2017-01-04 12:58             ` Austin S. Hemmelgarn
  2017-01-04 14:42               ` Peter Becker
  2017-01-09  1:09               ` Zygo Blaxell
  0 siblings, 2 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-04 12:58 UTC (permalink / raw)
  To: Peter Becker, Qu Wenruo; +Cc: linux-btrfs

On 2017-01-03 16:35, Peter Becker wrote:
> As i understand the duperemove source-code right (i work on/ try to
> improve this code since 5 or 6 weeks on multiple parts), duperemove
> does hashing and calculation before they call extend_same.
> Duperemove stores all in a hashfile and read this. after all files
> hashed, and duplicates detected, the progress all in order without
> reading new data form disk / hashfile. so the byte-by-byte comparison
> of extend_same ioctl should consume the full possible bandwidth of the
> disks.
Not necessarily.  You've actually got a significant amount of processing 
between each disk operation.  General ordering inside the ioctl is:
1. Do generic ioctl setup.
2. Lock the extents.
3. Read the ranges into memory.
4. Compare the ranges.
5. If the ranges are identical, write out the changes needed to reflink 
them.
6. Unlock all the extents.
7. Do generic ioctl cleanup.
1 and 7 in particular are pretty heavy.  Ioctls were not intended to be 
called with this kind of frequency, and that fact really shows in the 
setup and teardown (overhead is way higher than a syscall).  The 
operation ended up being an ioctl instead of a syscall (or extension to 
another syscall) because:
1. Manipulating low-level filesystem state is part of what they're 
intended to be used for.
2. Introducing a new FS specific ioctl is a whole lot less controversial 
than introducing a new FS specific syscall.
>
> 1. dbfile_load_hashes
> 2. find_all_dupes
> 3. dedupe_results
> -> call the following in N threads:
>> dedupe_extent_list
>>> list_for_each_entry
>>>> add_extent_to_dedupe #produce a simple list/queue
>>>> dedupe_extents
>>>>> btrfs_extent_same
>>>>>> BTRFS_IOC_FILE_EXTENT_SAME
>
> So if this right, one of this thinks is realy slow:
>
> 1. byte-per-byte comparison
There's no way that this part can't be slow.  You need to load the data 
into the registers to do the comparison, you can't just point something 
at RAM and get an answer.  On x86, this in turn means that the 
comparison amounts to a loop of 2 loads followed by a compare and a 
branch for , repeated once for each range beyond the first, and that's 
assuming that the compiler optimizes it to the greatest degree possible. 
  On some other systems the compare and branch are one instruction, on 
others the second load might be eliminated, but overall it's not 
something that can be sped up all that much.
> 2. sets up the reflinks
This actually is not as efficient as it sounds like it should be, adding 
reflinks means updating metadata, which means that there is some 
unavoidable overhead here.  I doubt that it's where the issue is, but I 
may be wrong.
> 3. unlocks the new extent
There's one other aspect not listed here, locking the original extents, 
which can actually add quite a lot of overhead if the files are actually 
being used.
>
> If i'm not wrong with my understanding of the duperemove source code,
> this behaivor should also affected the online dedupe feature on with
> Qu Wenruo works.
AFAIK, that uses a different code path from the batch deduplication 
ioctl.  It also doesn't have the context switches and other overhead 
from an ioctl involved, because it's done in kernel code.
>
> 2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> On 2017-01-03 15:20, Peter Becker wrote:
>>>
>>> I think i understand. The resulting keyquestion is, how i can improve
>>> the performance of extend_same ioctl.
>>> I tested it with following results:
>>>
>>> enviorment:
>>> 2 files, called "file", size each 100GB, duperemove nofiemap-options
>>> set, 1MB extend size.
>>>
>>> duperemove output:
>>> [0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
>>> [0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
>>> [0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
>>> [0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
>>> 1.0M), "/mnt/new/file"
>>>
>>> iotop output for a 30 sec. sample
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                22,31    0,00   13,83        33,81    0,00       30,05
>>>
>>> Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
>>> wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sdd               0,00     1,70          1149,93    0,73  4600,53
>>> 139,60    8,24       0,23          0,20    0,19   13,64      0,19
>>> 21,84
>>> sde               0,00     0,00          1149,33    0,53  4597,33
>>> 23,87     8,04       0,20          0,18    0,18    1,75       0,18
>>> 20,47
>>> sdf                0,00     1,70          1149,60    0,63  4598,40
>>> 139,60    8,24       0,21          0,18    0,18    4,63       0,18
>>> 20,63
>>> sdh               0,00     0,00          1149,33    0,53  4597,33
>>> 23,87     8,04       0,21          0,18    0,18    4,25       0,18
>>> 20,85
>>>
>>> resulting in less then 18MB/s read. realy slow.
>>>
>>> Querstion 1: why, so slow?
>>
>> For a couple of reasons.  First, you have to understand that duperemove
>> itself actually does a pretty large amount of processing outside of the call
>> to the ioctl.  It first hashes the blocks for quicker comparison and
>> matching, then figures out which blocks match, and finally calls the ioctl
>> on the resulting matches.  The reason for this behavior is that the ioctl is
>> insanely slow.  It first locks the ranges passed in (so they don't get
>> changed by anything else during the deduplication process), then does a
>> byte-by-byte comparison to make sure they all actually do match (data
>> safety, I've said at least once before that I think there should be a flag
>> for the ioctl (or a separate ioctl) to skip this and assume that userspace
>> really knows what it's doing), then finally sets up the reflinks, and
>> unlocks the new extent.
>>
>> All of this ties into why I keep telling people that efficient deduplication
>> requires a tool that understands how the data being deduplicated is
>> structured.  By avoiding the need to hash and compare every block of data,
>> you can significantly improve the time that part takes, and quite often this
>> will mitigate the impact of getting a few false positives passed into the
>> ioctl.
>>>
>>> Questiont 2a: would be a higher extend-size perform better?
>>> Querstion 2b: or did i understand something wrong?
>>
>> No, a larger extent would probably not help much, and that's actually a
>> really good performance sample you've created.
>>
>> The block size does end up being somewhat of a trade-off.  Ideally, you want
>> it matched to the smallest possible chunk of duplicate data greater than or
>> equal to the filesystem block size for maximal space efficiency.  Doing this
>> however makes the extra processing done by duperemove take exponentially
>> longer because it has to calculate hashes for more blocks (this has very low
>> impact until you get to very small block sizes), and has to make
>> exponentially more comparisons (this has a very big impact as you shrink the
>> block size, just halving the block size will roughly quadruple the time it
>> takes to make the comparisons).
>>
>>>
>>> 2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>
>>>> On 2017-01-03 14:21, Peter Becker wrote:
>>>>>
>>>>>
>>>>> All invocations are justified, but not relevant in (offline) backup
>>>>> and archive scenarios.
>>>>>
>>>>> For example you have multiple version of append-only log-files or
>>>>> append-only db-files (each more then 100GB in size), like this:
>>>>>
>>>>>> Snapshot_01_01_2017
>>>>>
>>>>>
>>>>> -> file1.log .. 201 GB
>>>>>
>>>>>> Snapshot_02_01_2017
>>>>>
>>>>>
>>>>> -> file1.log .. 205 GB
>>>>>
>>>>>> Snapshot_03_01_2017
>>>>>
>>>>>
>>>>> -> file1.log .. 221 GB
>>>>>
>>>>> The first 201 GB would be every time the same.
>>>>> Files a copied at night from windows, linux or bsd systems and
>>>>> snapshoted after copy.
>>>>>
>>>>> So a fast way to dedupe this is needed. Using 128KB blocks would
>>>>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>>>>> 205.824 extends (not bad, but still terrible speed).
>>>>> I will test it at night with a patched version of duperemove with
>>>>> 100MB blocksize, but I have no hope that the throughput increases
>>>>> thereby.
>>>>
>>>>
>>>> Deduplication is not a general purpose thing (usually you have very
>>>> specifically structured data), but duperemove is supposed to be a general
>>>> purpose tool.  It works fine for two of the most common cases
>>>> (deduplicating
>>>> large numbers of small files or small numbers of large files), but it is
>>>> sub-optimal for those cases, and will be for almost any other case.  This
>>>> is
>>>> a canonical example though of a case where you can use a custom script or
>>>> program to figure out what's duplicated and then have that just call the
>>>> ioctl as appropriate itself.  Most cases where you are storing some kind
>>>> of
>>>> well structured data fall into this category.  In fact, both of the cases
>>>> where I use deduplication myself fall into such a category.  One case
>>>> involves multiple directories that are partial copies of a larger tree
>>>> which
>>>> are kept in sync with the larger tree and each other.  In that particular
>>>> case, I care about whole file deduplication, so I have a script that just
>>>> matches on path relative to the roots of each copy and the master copy,
>>>> verifies that the mtime and size are the same, and if so calls the ioctl
>>>> for
>>>> deduplication (with some fancy processing to fit within the max size
>>>> supported by the ioctl and prevent tiny tail extents).  The other case is
>>>> a
>>>> set of archives with a pretty solid fixed structure to them.  In that
>>>> case,
>>>> I have a different script that knows enough about the file structure to
>>>> know
>>>> where to look for duplicate blocks, thus avoiding having to hash the
>>>> whole
>>>> files.
>>>>
>>>> The append-only log/database case fits this type of thing perfectly, for
>>>> each subsequent file, you know already that (most of) the file up to the
>>>> length of the previous file is duplicated, so you can just split that
>>>> however you want into chunks and pass those to the dedupe ioctl and free
>>>> up
>>>> most of the space that would be used by the new file.  You can then run
>>>> duperemove with a hash-file to process any new blocks beyond the point
>>>> you
>>>> deduplicated up to to reclaim any excess space (currently this will
>>>> process
>>>> the whole file, but it should see that most of it is deduplicated
>>>> already).
>>>>
>>>>>
>>>>> For backup and archive scenarios the checksum-feature and the
>>>>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>>>>> considers the 7 years legally prescribed storage time.
>>>>>
>>>>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>>>
>>>>>>
>>>>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>>>>> each.
>>>>>>> I try to dedupe this because the first hundred GB of many files are
>>>>>>> identical.
>>>>>>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>>>>>>> take more then a week (only dedupe, previously hashed). So i tryed -b
>>>>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>>>>
>>>>>>> The reason is that the blocksize is limit to
>>>>>>>
>>>>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>>>>
>>>>>>> But i can't found any description why.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Beyond what Xin mentioned (namely that 1MB is a much larger block than
>>>>>> will
>>>>>> be duplicated in most data-sets), there are a couple of other reasons:
>>>>>> 1. Smaller blocks will actually get you better deduplication on average
>>>>>> because they're more likely to match.  As an example, assume you have 2
>>>>>> files with the same 8 4k blocks in different orders:
>>>>>>   FileA: 1 2 3 4 5 6 7 8
>>>>>>   FileB: 7 8 5 6 3 4 1 2
>>>>>> In such a case, deduplicating at any block size above 8k would result
>>>>>> in
>>>>>> zero deduplication between these files, while 8k or less would
>>>>>> completely
>>>>>> deduplicate them.  This is of course a highly specific and somewhat
>>>>>> contrived example (in most cases it will be scattered duplicate blocks
>>>>>> over
>>>>>> dozens of files), but it does convey this specific point.
>>>>>> 2. The kernel will do a byte-wise comparison of all ranges you pass
>>>>>> into
>>>>>> the
>>>>>> ioctl at the same time.  Larger block sizes here mean that:
>>>>>>         a) The extents will be locked longer, which will prevent any
>>>>>> I/O
>>>>>> to
>>>>>> the files being deduplicated for the duration of the comparison, which
>>>>>> may
>>>>>> in turn cause other issues on the system.
>>>>>>         b) The deduplication process will be stuck in uninterruptible
>>>>>> sleep
>>>>>> longer, which on many systems will trigger hung task detection, which
>>>>>> will
>>>>>> in turn either spam the system log or panic the system depending on how
>>>>>> it's
>>>>>> configured.
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-04 12:58             ` Austin S. Hemmelgarn
@ 2017-01-04 14:42               ` Peter Becker
  2017-01-09  1:09               ` Zygo Blaxell
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Becker @ 2017-01-04 14:42 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, linux-btrfs

Thank you for the information / clarifications. This helps me to
understand the operation somewhat better. I will continue to deal with
the subject.

Regardless of this, i will change the structure of my data in my
usecase and put on rsync --inplace --no-whole-file.

2017-01-04 13:58 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2017-01-03 16:35, Peter Becker wrote:
>>
>> As i understand the duperemove source-code right (i work on/ try to
>> improve this code since 5 or 6 weeks on multiple parts), duperemove
>> does hashing and calculation before they call extend_same.
>> Duperemove stores all in a hashfile and read this. after all files
>> hashed, and duplicates detected, the progress all in order without
>> reading new data form disk / hashfile. so the byte-by-byte comparison
>> of extend_same ioctl should consume the full possible bandwidth of the
>> disks.
>
> Not necessarily.  You've actually got a significant amount of processing
> between each disk operation.  General ordering inside the ioctl is:
> 1. Do generic ioctl setup.
> 2. Lock the extents.
> 3. Read the ranges into memory.
> 4. Compare the ranges.
> 5. If the ranges are identical, write out the changes needed to reflink
> them.
> 6. Unlock all the extents.
> 7. Do generic ioctl cleanup.
> 1 and 7 in particular are pretty heavy.  Ioctls were not intended to be
> called with this kind of frequency, and that fact really shows in the setup
> and teardown (overhead is way higher than a syscall).  The operation ended
> up being an ioctl instead of a syscall (or extension to another syscall)
> because:
> 1. Manipulating low-level filesystem state is part of what they're intended
> to be used for.
> 2. Introducing a new FS specific ioctl is a whole lot less controversial
> than introducing a new FS specific syscall.
>>
>>
>> 1. dbfile_load_hashes
>> 2. find_all_dupes
>> 3. dedupe_results
>> -> call the following in N threads:
>>>
>>> dedupe_extent_list
>>>>
>>>> list_for_each_entry
>>>>>
>>>>> add_extent_to_dedupe #produce a simple list/queue
>>>>> dedupe_extents
>>>>>>
>>>>>> btrfs_extent_same
>>>>>>>
>>>>>>> BTRFS_IOC_FILE_EXTENT_SAME
>>
>>
>> So if this right, one of this thinks is realy slow:
>>
>> 1. byte-per-byte comparison
>
> There's no way that this part can't be slow.  You need to load the data into
> the registers to do the comparison, you can't just point something at RAM
> and get an answer.  On x86, this in turn means that the comparison amounts
> to a loop of 2 loads followed by a compare and a branch for , repeated once
> for each range beyond the first, and that's assuming that the compiler
> optimizes it to the greatest degree possible.  On some other systems the
> compare and branch are one instruction, on others the second load might be
> eliminated, but overall it's not something that can be sped up all that
> much.
>>
>> 2. sets up the reflinks
>
> This actually is not as efficient as it sounds like it should be, adding
> reflinks means updating metadata, which means that there is some unavoidable
> overhead here.  I doubt that it's where the issue is, but I may be wrong.
>>
>> 3. unlocks the new extent
>
> There's one other aspect not listed here, locking the original extents,
> which can actually add quite a lot of overhead if the files are actually
> being used.
>>
>>
>> If i'm not wrong with my understanding of the duperemove source code,
>> this behaivor should also affected the online dedupe feature on with
>> Qu Wenruo works.
>
> AFAIK, that uses a different code path from the batch deduplication ioctl.
> It also doesn't have the context switches and other overhead from an ioctl
> involved, because it's done in kernel code.
>
>>
>> 2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>
>>> On 2017-01-03 15:20, Peter Becker wrote:
>>>>
>>>>
>>>> I think i understand. The resulting keyquestion is, how i can improve
>>>> the performance of extend_same ioctl.
>>>> I tested it with following results:
>>>>
>>>> enviorment:
>>>> 2 files, called "file", size each 100GB, duperemove nofiemap-options
>>>> set, 1MB extend size.
>>>>
>>>> duperemove output:
>>>> [0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
>>>> [0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
>>>> [0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
>>>> [0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
>>>> 1.0M), "/mnt/new/file"
>>>>
>>>> iotop output for a 30 sec. sample
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                22,31    0,00   13,83        33,81    0,00       30,05
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
>>>> wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sdd               0,00     1,70          1149,93    0,73  4600,53
>>>> 139,60    8,24       0,23          0,20    0,19   13,64      0,19
>>>> 21,84
>>>> sde               0,00     0,00          1149,33    0,53  4597,33
>>>> 23,87     8,04       0,20          0,18    0,18    1,75       0,18
>>>> 20,47
>>>> sdf                0,00     1,70          1149,60    0,63  4598,40
>>>> 139,60    8,24       0,21          0,18    0,18    4,63       0,18
>>>> 20,63
>>>> sdh               0,00     0,00          1149,33    0,53  4597,33
>>>> 23,87     8,04       0,21          0,18    0,18    4,25       0,18
>>>> 20,85
>>>>
>>>> resulting in less then 18MB/s read. realy slow.
>>>>
>>>> Querstion 1: why, so slow?
>>>
>>>
>>> For a couple of reasons.  First, you have to understand that duperemove
>>> itself actually does a pretty large amount of processing outside of the
>>> call
>>> to the ioctl.  It first hashes the blocks for quicker comparison and
>>> matching, then figures out which blocks match, and finally calls the
>>> ioctl
>>> on the resulting matches.  The reason for this behavior is that the ioctl
>>> is
>>> insanely slow.  It first locks the ranges passed in (so they don't get
>>> changed by anything else during the deduplication process), then does a
>>> byte-by-byte comparison to make sure they all actually do match (data
>>> safety, I've said at least once before that I think there should be a
>>> flag
>>> for the ioctl (or a separate ioctl) to skip this and assume that
>>> userspace
>>> really knows what it's doing), then finally sets up the reflinks, and
>>> unlocks the new extent.
>>>
>>> All of this ties into why I keep telling people that efficient
>>> deduplication
>>> requires a tool that understands how the data being deduplicated is
>>> structured.  By avoiding the need to hash and compare every block of
>>> data,
>>> you can significantly improve the time that part takes, and quite often
>>> this
>>> will mitigate the impact of getting a few false positives passed into the
>>> ioctl.
>>>>
>>>>
>>>> Questiont 2a: would be a higher extend-size perform better?
>>>> Querstion 2b: or did i understand something wrong?
>>>
>>>
>>> No, a larger extent would probably not help much, and that's actually a
>>> really good performance sample you've created.
>>>
>>> The block size does end up being somewhat of a trade-off.  Ideally, you
>>> want
>>> it matched to the smallest possible chunk of duplicate data greater than
>>> or
>>> equal to the filesystem block size for maximal space efficiency.  Doing
>>> this
>>> however makes the extra processing done by duperemove take exponentially
>>> longer because it has to calculate hashes for more blocks (this has very
>>> low
>>> impact until you get to very small block sizes), and has to make
>>> exponentially more comparisons (this has a very big impact as you shrink
>>> the
>>> block size, just halving the block size will roughly quadruple the time
>>> it
>>> takes to make the comparisons).
>>>
>>>>
>>>> 2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>>
>>>>>
>>>>> On 2017-01-03 14:21, Peter Becker wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> All invocations are justified, but not relevant in (offline) backup
>>>>>> and archive scenarios.
>>>>>>
>>>>>> For example you have multiple version of append-only log-files or
>>>>>> append-only db-files (each more then 100GB in size), like this:
>>>>>>
>>>>>>> Snapshot_01_01_2017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -> file1.log .. 201 GB
>>>>>>
>>>>>>> Snapshot_02_01_2017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -> file1.log .. 205 GB
>>>>>>
>>>>>>> Snapshot_03_01_2017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -> file1.log .. 221 GB
>>>>>>
>>>>>> The first 201 GB would be every time the same.
>>>>>> Files a copied at night from windows, linux or bsd systems and
>>>>>> snapshoted after copy.
>>>>>>
>>>>>> So a fast way to dedupe this is needed. Using 128KB blocks would
>>>>>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>>>>>> 205.824 extends (not bad, but still terrible speed).
>>>>>> I will test it at night with a patched version of duperemove with
>>>>>> 100MB blocksize, but I have no hope that the throughput increases
>>>>>> thereby.
>>>>>
>>>>>
>>>>>
>>>>> Deduplication is not a general purpose thing (usually you have very
>>>>> specifically structured data), but duperemove is supposed to be a
>>>>> general
>>>>> purpose tool.  It works fine for two of the most common cases
>>>>> (deduplicating
>>>>> large numbers of small files or small numbers of large files), but it
>>>>> is
>>>>> sub-optimal for those cases, and will be for almost any other case.
>>>>> This
>>>>> is
>>>>> a canonical example though of a case where you can use a custom script
>>>>> or
>>>>> program to figure out what's duplicated and then have that just call
>>>>> the
>>>>> ioctl as appropriate itself.  Most cases where you are storing some
>>>>> kind
>>>>> of
>>>>> well structured data fall into this category.  In fact, both of the
>>>>> cases
>>>>> where I use deduplication myself fall into such a category.  One case
>>>>> involves multiple directories that are partial copies of a larger tree
>>>>> which
>>>>> are kept in sync with the larger tree and each other.  In that
>>>>> particular
>>>>> case, I care about whole file deduplication, so I have a script that
>>>>> just
>>>>> matches on path relative to the roots of each copy and the master copy,
>>>>> verifies that the mtime and size are the same, and if so calls the
>>>>> ioctl
>>>>> for
>>>>> deduplication (with some fancy processing to fit within the max size
>>>>> supported by the ioctl and prevent tiny tail extents).  The other case
>>>>> is
>>>>> a
>>>>> set of archives with a pretty solid fixed structure to them.  In that
>>>>> case,
>>>>> I have a different script that knows enough about the file structure to
>>>>> know
>>>>> where to look for duplicate blocks, thus avoiding having to hash the
>>>>> whole
>>>>> files.
>>>>>
>>>>> The append-only log/database case fits this type of thing perfectly,
>>>>> for
>>>>> each subsequent file, you know already that (most of) the file up to
>>>>> the
>>>>> length of the previous file is duplicated, so you can just split that
>>>>> however you want into chunks and pass those to the dedupe ioctl and
>>>>> free
>>>>> up
>>>>> most of the space that would be used by the new file.  You can then run
>>>>> duperemove with a hash-file to process any new blocks beyond the point
>>>>> you
>>>>> deduplicated up to to reclaim any excess space (currently this will
>>>>> process
>>>>> the whole file, but it should see that most of it is deduplicated
>>>>> already).
>>>>>
>>>>>>
>>>>>> For backup and archive scenarios the checksum-feature and the
>>>>>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>>>>>> considers the 7 years legally prescribed storage time.
>>>>>>
>>>>>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn
>>>>>> <ahferroin7@gmail.com>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>>>>>> each.
>>>>>>>> I try to dedupe this because the first hundred GB of many files are
>>>>>>>> identical.
>>>>>>>> With 128KB blocksize with nofiemap and lookup-extends=no option,
>>>>>>>> will
>>>>>>>> take more then a week (only dedupe, previously hashed). So i tryed
>>>>>>>> -b
>>>>>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>>>>>
>>>>>>>> The reason is that the blocksize is limit to
>>>>>>>>
>>>>>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>>>>>
>>>>>>>> But i can't found any description why.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Beyond what Xin mentioned (namely that 1MB is a much larger block
>>>>>>> than
>>>>>>> will
>>>>>>> be duplicated in most data-sets), there are a couple of other
>>>>>>> reasons:
>>>>>>> 1. Smaller blocks will actually get you better deduplication on
>>>>>>> average
>>>>>>> because they're more likely to match.  As an example, assume you have
>>>>>>> 2
>>>>>>> files with the same 8 4k blocks in different orders:
>>>>>>>   FileA: 1 2 3 4 5 6 7 8
>>>>>>>   FileB: 7 8 5 6 3 4 1 2
>>>>>>> In such a case, deduplicating at any block size above 8k would result
>>>>>>> in
>>>>>>> zero deduplication between these files, while 8k or less would
>>>>>>> completely
>>>>>>> deduplicate them.  This is of course a highly specific and somewhat
>>>>>>> contrived example (in most cases it will be scattered duplicate
>>>>>>> blocks
>>>>>>> over
>>>>>>> dozens of files), but it does convey this specific point.
>>>>>>> 2. The kernel will do a byte-wise comparison of all ranges you pass
>>>>>>> into
>>>>>>> the
>>>>>>> ioctl at the same time.  Larger block sizes here mean that:
>>>>>>>         a) The extents will be locked longer, which will prevent any
>>>>>>> I/O
>>>>>>> to
>>>>>>> the files being deduplicated for the duration of the comparison,
>>>>>>> which
>>>>>>> may
>>>>>>> in turn cause other issues on the system.
>>>>>>>         b) The deduplication process will be stuck in uninterruptible
>>>>>>> sleep
>>>>>>> longer, which on many systems will trigger hung task detection, which
>>>>>>> will
>>>>>>> in turn either spam the system log or panic the system depending on
>>>>>>> how
>>>>>>> it's
>>>>>>> configured.
>>>>>>>
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-04 12:58             ` Austin S. Hemmelgarn
  2017-01-04 14:42               ` Peter Becker
@ 2017-01-09  1:09               ` Zygo Blaxell
  2017-01-09  9:29                 ` Peter Becker
  1 sibling, 1 reply; 19+ messages in thread
From: Zygo Blaxell @ 2017-01-09  1:09 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Peter Becker, Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 16484 bytes --]

On Wed, Jan 04, 2017 at 07:58:55AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-01-03 16:35, Peter Becker wrote:
> >As i understand the duperemove source-code right (i work on/ try to
> >improve this code since 5 or 6 weeks on multiple parts), duperemove
> >does hashing and calculation before they call extend_same.
> >Duperemove stores all in a hashfile and read this. after all files
> >hashed, and duplicates detected, the progress all in order without
> >reading new data form disk / hashfile. so the byte-by-byte comparison
> >of extend_same ioctl should consume the full possible bandwidth of the
> >disks.
> Not necessarily.  You've actually got a significant amount of processing
> between each disk operation.  General ordering inside the ioctl is:
> 1. Do generic ioctl setup.
> 2. Lock the extents.
> 3. Read the ranges into memory.
> 4. Compare the ranges.
> 5. If the ranges are identical, write out the changes needed to reflink
> them.
> 6. Unlock all the extents.
> 7. Do generic ioctl cleanup.
> 1 and 7 in particular are pretty heavy.  Ioctls were not intended to be
> called with this kind of frequency, and that fact really shows in the setup
> and teardown (overhead is way higher than a syscall).

Steps 1 and 7 are not heavy at all.  ioctl setup is an order of magnitude
higher than other system calls, but still up to 11 orders of magnitude
faster than the other steps.  The other steps are *slow*, and step 5
is orders of magnitude slower than all the others combined.

Most of the time in step 5 is spent deleting the dst extent refs
(or waiting for transaction commits, but everything waits for those).
It gets worse when you have big files (1G and larger), more extents,
and more extent references in the same inode.  On a 100G file the overhead
of manipulating shared extent refs is so large that the rest of the
extent-same ioctl is just noise by comparison (microseconds vs minutes).

The commit 1d57ee9 "btrfs: improve delayed refs iterations" (merged in
v4.10-rc1) helps a bit with this, but deleting shared refs is still
one of the most expensive things you can do in btrfs.

> The operation ended
> up being an ioctl instead of a syscall (or extension to another syscall)
> because:
> 1. Manipulating low-level filesystem state is part of what they're intended
> to be used for.
> 2. Introducing a new FS specific ioctl is a whole lot less controversial
> than introducing a new FS specific syscall.
> >
> >1. dbfile_load_hashes
> >2. find_all_dupes
> >3. dedupe_results
> >-> call the following in N threads:
> >>dedupe_extent_list
> >>>list_for_each_entry
> >>>>add_extent_to_dedupe #produce a simple list/queue
> >>>>dedupe_extents
> >>>>>btrfs_extent_same
> >>>>>>BTRFS_IOC_FILE_EXTENT_SAME
> >
> >So if this right, one of this thinks is realy slow:
> >
> >1. byte-per-byte comparison
> There's no way that this part can't be slow.  You need to load the data into
> the registers to do the comparison, you can't just point something at RAM
> and get an answer.  On x86, this in turn means that the comparison amounts
> to a loop of 2 loads followed by a compare and a branch for , repeated once
> for each range beyond the first, and that's assuming that the compiler
> optimizes it to the greatest degree possible.  On some other systems the
> compare and branch are one instruction, on others the second load might be
> eliminated, but overall it's not something that can be sped up all that
> much.

On cheap amd64 machines this can be done at gigabytes per second.  Not much
gain from optimizing this.

> >2. sets up the reflinks
> This actually is not as efficient as it sounds like it should be, adding
> reflinks means updating metadata, which means that there is some unavoidable
> overhead here.  I doubt that it's where the issue is, but I may be wrong.

Most of the time spent here is spent waiting for IO.  extent-same seems to
imply fsync() with all the performance cost thereof.

> >3. unlocks the new extent
> There's one other aspect not listed here, locking the original extents,
> which can actually add quite a lot of overhead if the files are actually
> being used.
> >
> >If i'm not wrong with my understanding of the duperemove source code,
> >this behaivor should also affected the online dedupe feature on with
> >Qu Wenruo works.
> AFAIK, that uses a different code path from the batch deduplication ioctl.
> It also doesn't have the context switches and other overhead from an ioctl
> involved, because it's done in kernel code.

No difference there--the extent-same ioctl is all kernel code too.

> >2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> >>On 2017-01-03 15:20, Peter Becker wrote:
> >>>
> >>>I think i understand. The resulting keyquestion is, how i can improve
> >>>the performance of extend_same ioctl.
> >>>I tested it with following results:
> >>>
> >>>enviorment:
> >>>2 files, called "file", size each 100GB, duperemove nofiemap-options
> >>>set, 1MB extend size.
> >>>
> >>>duperemove output:
> >>>[0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
> >>>[0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
> >>>[0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
> >>>[0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
> >>>1.0M), "/mnt/new/file"
> >>>
> >>>iotop output for a 30 sec. sample
> >>>
> >>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >>>               22,31    0,00   13,83        33,81    0,00       30,05
> >>>
> >>>Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
> >>>wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> >>>sdd               0,00     1,70          1149,93    0,73  4600,53
> >>>139,60    8,24       0,23          0,20    0,19   13,64      0,19
> >>>21,84
> >>>sde               0,00     0,00          1149,33    0,53  4597,33
> >>>23,87     8,04       0,20          0,18    0,18    1,75       0,18
> >>>20,47
> >>>sdf                0,00     1,70          1149,60    0,63  4598,40
> >>>139,60    8,24       0,21          0,18    0,18    4,63       0,18
> >>>20,63
> >>>sdh               0,00     0,00          1149,33    0,53  4597,33
> >>>23,87     8,04       0,21          0,18    0,18    4,25       0,18
> >>>20,85
> >>>
> >>>resulting in less then 18MB/s read. realy slow.
> >>>
> >>>Querstion 1: why, so slow?
> >>
> >>For a couple of reasons.  First, you have to understand that duperemove
> >>itself actually does a pretty large amount of processing outside of the call
> >>to the ioctl.  It first hashes the blocks for quicker comparison and
> >>matching, then figures out which blocks match, and finally calls the ioctl
> >>on the resulting matches.  The reason for this behavior is that the ioctl is
> >>insanely slow.  It first locks the ranges passed in (so they don't get
> >>changed by anything else during the deduplication process), then does a
> >>byte-by-byte comparison to make sure they all actually do match (data
> >>safety, I've said at least once before that I think there should be a flag
> >>for the ioctl (or a separate ioctl) to skip this and assume that userspace
> >>really knows what it's doing), then finally sets up the reflinks, and
> >>unlocks the new extent.

I've never found the ioctl's lock/read/compare operation taking any more
than 1.0% of the time.  A dedup agent may spend 20% of its time running
the extent-same ioctl (with the other 80% being metadata traversal
and block reads/hashes).  Within the extent-same ioctl the "set up the
reflinks" step is 99% of the time, often more.

The ioctl can read from the cache, so if userspace reads all
the data immediately before calling the extent-same ioctl then the
read/lock/compare component of the ioctl is negligible.

> >>All of this ties into why I keep telling people that efficient deduplication
> >>requires a tool that understands how the data being deduplicated is
> >>structured.  By avoiding the need to hash and compare every block of data,
> >>you can significantly improve the time that part takes, and quite often this
> >>will mitigate the impact of getting a few false positives passed into the
> >>ioctl.
> >>>
> >>>Questiont 2a: would be a higher extend-size perform better?

Most of the overhead of deleting shared extents depends on the number
of extents, so using larger extents is helpful up to the maximum extent
size; however, the extent-same ioctl cannot make extents larger.  If the
source file was fragmented to begin with then using larger sizes in the
extent-same ioctl will not have any effect.

Given a choice of extents to deduplicate, it helps to choose the larger
extents to keep if possible (i.e. delete/replace the smaller extents
with references to the larger ones); however, the tradeoff is that this
only allows dedup along existing extent boundaries (i.e. no files over
a few MB in size will ever be deduped).

> >>>Querstion 2b: or did i understand something wrong?
> >>
> >>No, a larger extent would probably not help much, and that's actually a
> >>really good performance sample you've created.
> >>
> >>The block size does end up being somewhat of a trade-off.  Ideally, you want
> >>it matched to the smallest possible chunk of duplicate data greater than or
> >>equal to the filesystem block size for maximal space efficiency.  Doing this
> >>however makes the extra processing done by duperemove take exponentially
> >>longer because it has to calculate hashes for more blocks (this has very low
> >>impact until you get to very small block sizes), and has to make
> >>exponentially more comparisons (this has a very big impact as you shrink the
> >>block size, just halving the block size will roughly quadruple the time it
> >>takes to make the comparisons).
> >>
> >>>
> >>>2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> >>>>
> >>>>On 2017-01-03 14:21, Peter Becker wrote:
> >>>>>
> >>>>>
> >>>>>All invocations are justified, but not relevant in (offline) backup
> >>>>>and archive scenarios.
> >>>>>
> >>>>>For example you have multiple version of append-only log-files or
> >>>>>append-only db-files (each more then 100GB in size), like this:
> >>>>>
> >>>>>>Snapshot_01_01_2017
> >>>>>
> >>>>>
> >>>>>-> file1.log .. 201 GB
> >>>>>
> >>>>>>Snapshot_02_01_2017
> >>>>>
> >>>>>
> >>>>>-> file1.log .. 205 GB
> >>>>>
> >>>>>>Snapshot_03_01_2017
> >>>>>
> >>>>>
> >>>>>-> file1.log .. 221 GB
> >>>>>
> >>>>>The first 201 GB would be every time the same.
> >>>>>Files a copied at night from windows, linux or bsd systems and
> >>>>>snapshoted after copy.
> >>>>>
> >>>>>So a fast way to dedupe this is needed. Using 128KB blocks would
> >>>>>result in 1646592 extends per Snapshot. 1MB blocksize results in
> >>>>>205.824 extends (not bad, but still terrible speed).
> >>>>>I will test it at night with a patched version of duperemove with
> >>>>>100MB blocksize, but I have no hope that the throughput increases
> >>>>>thereby.
> >>>>
> >>>>
> >>>>Deduplication is not a general purpose thing (usually you have very
> >>>>specifically structured data), but duperemove is supposed to be a general
> >>>>purpose tool.  It works fine for two of the most common cases
> >>>>(deduplicating
> >>>>large numbers of small files or small numbers of large files), but it is
> >>>>sub-optimal for those cases, and will be for almost any other case.  This
> >>>>is
> >>>>a canonical example though of a case where you can use a custom script or
> >>>>program to figure out what's duplicated and then have that just call the
> >>>>ioctl as appropriate itself.  Most cases where you are storing some kind
> >>>>of
> >>>>well structured data fall into this category.  In fact, both of the cases
> >>>>where I use deduplication myself fall into such a category.  One case
> >>>>involves multiple directories that are partial copies of a larger tree
> >>>>which
> >>>>are kept in sync with the larger tree and each other.  In that particular
> >>>>case, I care about whole file deduplication, so I have a script that just
> >>>>matches on path relative to the roots of each copy and the master copy,
> >>>>verifies that the mtime and size are the same, and if so calls the ioctl
> >>>>for
> >>>>deduplication (with some fancy processing to fit within the max size
> >>>>supported by the ioctl and prevent tiny tail extents).  The other case is
> >>>>a
> >>>>set of archives with a pretty solid fixed structure to them.  In that
> >>>>case,
> >>>>I have a different script that knows enough about the file structure to
> >>>>know
> >>>>where to look for duplicate blocks, thus avoiding having to hash the
> >>>>whole
> >>>>files.
> >>>>
> >>>>The append-only log/database case fits this type of thing perfectly, for
> >>>>each subsequent file, you know already that (most of) the file up to the
> >>>>length of the previous file is duplicated, so you can just split that
> >>>>however you want into chunks and pass those to the dedupe ioctl and free
> >>>>up
> >>>>most of the space that would be used by the new file.  You can then run
> >>>>duperemove with a hash-file to process any new blocks beyond the point
> >>>>you
> >>>>deduplicated up to to reclaim any excess space (currently this will
> >>>>process
> >>>>the whole file, but it should see that most of it is deduplicated
> >>>>already).
> >>>>
> >>>>>
> >>>>>For backup and archive scenarios the checksum-feature and the
> >>>>>dub-data/metadata-feature of btrfs is realy nice. In particular if one
> >>>>>considers the 7 years legally prescribed storage time.
> >>>>>
> >>>>>2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> >>>>>>
> >>>>>>
> >>>>>>On 2016-12-30 15:28, Peter Becker wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>Hello, i have a 8 TB volume with multiple files with hundreds of GB
> >>>>>>>each.
> >>>>>>>I try to dedupe this because the first hundred GB of many files are
> >>>>>>>identical.
> >>>>>>>With 128KB blocksize with nofiemap and lookup-extends=no option, will
> >>>>>>>take more then a week (only dedupe, previously hashed). So i tryed -b
> >>>>>>>100M but this returned me an error: "Blocksize is bounded ...".
> >>>>>>>
> >>>>>>>The reason is that the blocksize is limit to
> >>>>>>>
> >>>>>>>#define MAX_BLOCKSIZE (1024U*1024)
> >>>>>>>
> >>>>>>>But i can't found any description why.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>Beyond what Xin mentioned (namely that 1MB is a much larger block than
> >>>>>>will
> >>>>>>be duplicated in most data-sets), there are a couple of other reasons:
> >>>>>>1. Smaller blocks will actually get you better deduplication on average
> >>>>>>because they're more likely to match.  As an example, assume you have 2
> >>>>>>files with the same 8 4k blocks in different orders:
> >>>>>>  FileA: 1 2 3 4 5 6 7 8
> >>>>>>  FileB: 7 8 5 6 3 4 1 2
> >>>>>>In such a case, deduplicating at any block size above 8k would result
> >>>>>>in
> >>>>>>zero deduplication between these files, while 8k or less would
> >>>>>>completely
> >>>>>>deduplicate them.  This is of course a highly specific and somewhat
> >>>>>>contrived example (in most cases it will be scattered duplicate blocks
> >>>>>>over
> >>>>>>dozens of files), but it does convey this specific point.
> >>>>>>2. The kernel will do a byte-wise comparison of all ranges you pass
> >>>>>>into
> >>>>>>the
> >>>>>>ioctl at the same time.  Larger block sizes here mean that:
> >>>>>>        a) The extents will be locked longer, which will prevent any
> >>>>>>I/O
> >>>>>>to
> >>>>>>the files being deduplicated for the duration of the comparison, which
> >>>>>>may
> >>>>>>in turn cause other issues on the system.
> >>>>>>        b) The deduplication process will be stuck in uninterruptible
> >>>>>>sleep
> >>>>>>longer, which on many systems will trigger hung task detection, which
> >>>>>>will
> >>>>>>in turn either spam the system log or panic the system depending on how
> >>>>>>it's
> >>>>>>configured.
> >>>>>>
> >>>>
> >>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-09  1:09               ` Zygo Blaxell
@ 2017-01-09  9:29                 ` Peter Becker
  2017-01-10  4:12                   ` Zygo Blaxell
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Becker @ 2017-01-09  9:29 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Austin S. Hemmelgarn, Qu Wenruo, linux-btrfs

2017-01-09 2:09 GMT+01:00 Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> On Wed, Jan 04, 2017 at 07:58:55AM -0500, Austin S. Hemmelgarn wrote:
>> On 2017-01-03 16:35, Peter Becker wrote:
>> >As i understand the duperemove source-code right (i work on/ try to
>> >improve this code since 5 or 6 weeks on multiple parts), duperemove
>> >does hashing and calculation before they call extend_same.
>> >Duperemove stores all in a hashfile and read this. after all files
>> >hashed, and duplicates detected, the progress all in order without
>> >reading new data form disk / hashfile. so the byte-by-byte comparison
>> >of extend_same ioctl should consume the full possible bandwidth of the
>> >disks.
>> Not necessarily.  You've actually got a significant amount of processing
>> between each disk operation.  General ordering inside the ioctl is:
>> 1. Do generic ioctl setup.
>> 2. Lock the extents.
>> 3. Read the ranges into memory.
>> 4. Compare the ranges.
>> 5. If the ranges are identical, write out the changes needed to reflink
>> them.
>> 6. Unlock all the extents.
>> 7. Do generic ioctl cleanup.
>> 1 and 7 in particular are pretty heavy.  Ioctls were not intended to be
>> called with this kind of frequency, and that fact really shows in the setup
>> and teardown (overhead is way higher than a syscall).
>
> Steps 1 and 7 are not heavy at all.  ioctl setup is an order of magnitude
> higher than other system calls, but still up to 11 orders of magnitude
> faster than the other steps.  The other steps are *slow*, and step 5
> is orders of magnitude slower than all the others combined.
>
> Most of the time in step 5 is spent deleting the dst extent refs
> (or waiting for transaction commits, but everything waits for those).
> It gets worse when you have big files (1G and larger), more extents,
> and more extent references in the same inode.  On a 100G file the overhead
> of manipulating shared extent refs is so large that the rest of the
> extent-same ioctl is just noise by comparison (microseconds vs minutes).
>
> The commit 1d57ee9 "btrfs: improve delayed refs iterations" (merged in
> v4.10-rc1) helps a bit with this, but deleting shared refs is still
> one of the most expensive things you can do in btrfs.

Would it not be better to bulk insert all extent refs and bulk delete
the dst extend refs?

I am currently dealing intensively with the btrfs implementation and
the extent_same ioctl. But still in a very early stage of planing.
(understand and plan)
I'm not sure if this task would be possible for a newbie in btrfs
source/development.

I plan to introduce a new extent_same ioctl with accept an array of
the parameters of the current btrfs_extent_same function (struct inode
*src, u64 loff, u64 olen,
struct inode *dst, u64 dst_loff) called btrfs_extent_same_v2.

This would reduce the numbers of ioc calls, allows bulk insert, bulk remove, ...
Additional the new version should also introduce a new options
parameter, for example to bypassing the cmp if the caller realy knows
what he does.

All suggestions and hints for this are very welcome.

>> The operation ended
>> up being an ioctl instead of a syscall (or extension to another syscall)
>> because:
>> 1. Manipulating low-level filesystem state is part of what they're intended
>> to be used for.
>> 2. Introducing a new FS specific ioctl is a whole lot less controversial
>> than introducing a new FS specific syscall.
>> >
>> >1. dbfile_load_hashes
>> >2. find_all_dupes
>> >3. dedupe_results
>> >-> call the following in N threads:
>> >>dedupe_extent_list
>> >>>list_for_each_entry
>> >>>>add_extent_to_dedupe #produce a simple list/queue
>> >>>>dedupe_extents
>> >>>>>btrfs_extent_same
>> >>>>>>BTRFS_IOC_FILE_EXTENT_SAME
>> >
>> >So if this right, one of this thinks is realy slow:
>> >
>> >1. byte-per-byte comparison
>> There's no way that this part can't be slow.  You need to load the data into
>> the registers to do the comparison, you can't just point something at RAM
>> and get an answer.  On x86, this in turn means that the comparison amounts
>> to a loop of 2 loads followed by a compare and a branch for , repeated once
>> for each range beyond the first, and that's assuming that the compiler
>> optimizes it to the greatest degree possible.  On some other systems the
>> compare and branch are one instruction, on others the second load might be
>> eliminated, but overall it's not something that can be sped up all that
>> much.
>
> On cheap amd64 machines this can be done at gigabytes per second.  Not much
> gain from optimizing this.
>
>> >2. sets up the reflinks
>> This actually is not as efficient as it sounds like it should be, adding
>> reflinks means updating metadata, which means that there is some unavoidable
>> overhead here.  I doubt that it's where the issue is, but I may be wrong.
>
> Most of the time spent here is spent waiting for IO.  extent-same seems to
> imply fsync() with all the performance cost thereof.
>
>> >3. unlocks the new extent
>> There's one other aspect not listed here, locking the original extents,
>> which can actually add quite a lot of overhead if the files are actually
>> being used.
>> >
>> >If i'm not wrong with my understanding of the duperemove source code,
>> >this behaivor should also affected the online dedupe feature on with
>> >Qu Wenruo works.
>> AFAIK, that uses a different code path from the batch deduplication ioctl.
>> It also doesn't have the context switches and other overhead from an ioctl
>> involved, because it's done in kernel code.
>
> No difference there--the extent-same ioctl is all kernel code too.
>
>> >2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> >>On 2017-01-03 15:20, Peter Becker wrote:
>> >>>
>> >>>I think i understand. The resulting keyquestion is, how i can improve
>> >>>the performance of extend_same ioctl.
>> >>>I tested it with following results:
>> >>>
>> >>>enviorment:
>> >>>2 files, called "file", size each 100GB, duperemove nofiemap-options
>> >>>set, 1MB extend size.
>> >>>
>> >>>duperemove output:
>> >>>[0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
>> >>>[0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
>> >>>[0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
>> >>>[0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
>> >>>1.0M), "/mnt/new/file"
>> >>>
>> >>>iotop output for a 30 sec. sample
>> >>>
>> >>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >>>               22,31    0,00   13,83        33,81    0,00       30,05
>> >>>
>> >>>Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
>> >>>wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> >>>sdd               0,00     1,70          1149,93    0,73  4600,53
>> >>>139,60    8,24       0,23          0,20    0,19   13,64      0,19
>> >>>21,84
>> >>>sde               0,00     0,00          1149,33    0,53  4597,33
>> >>>23,87     8,04       0,20          0,18    0,18    1,75       0,18
>> >>>20,47
>> >>>sdf                0,00     1,70          1149,60    0,63  4598,40
>> >>>139,60    8,24       0,21          0,18    0,18    4,63       0,18
>> >>>20,63
>> >>>sdh               0,00     0,00          1149,33    0,53  4597,33
>> >>>23,87     8,04       0,21          0,18    0,18    4,25       0,18
>> >>>20,85
>> >>>
>> >>>resulting in less then 18MB/s read. realy slow.
>> >>>
>> >>>Querstion 1: why, so slow?
>> >>
>> >>For a couple of reasons.  First, you have to understand that duperemove
>> >>itself actually does a pretty large amount of processing outside of the call
>> >>to the ioctl.  It first hashes the blocks for quicker comparison and
>> >>matching, then figures out which blocks match, and finally calls the ioctl
>> >>on the resulting matches.  The reason for this behavior is that the ioctl is
>> >>insanely slow.  It first locks the ranges passed in (so they don't get
>> >>changed by anything else during the deduplication process), then does a
>> >>byte-by-byte comparison to make sure they all actually do match (data
>> >>safety, I've said at least once before that I think there should be a flag
>> >>for the ioctl (or a separate ioctl) to skip this and assume that userspace
>> >>really knows what it's doing), then finally sets up the reflinks, and
>> >>unlocks the new extent.
>
> I've never found the ioctl's lock/read/compare operation taking any more
> than 1.0% of the time.  A dedup agent may spend 20% of its time running
> the extent-same ioctl (with the other 80% being metadata traversal
> and block reads/hashes).  Within the extent-same ioctl the "set up the
> reflinks" step is 99% of the time, often more.
>
> The ioctl can read from the cache, so if userspace reads all
> the data immediately before calling the extent-same ioctl then the
> read/lock/compare component of the ioctl is negligible.
>
>> >>All of this ties into why I keep telling people that efficient deduplication
>> >>requires a tool that understands how the data being deduplicated is
>> >>structured.  By avoiding the need to hash and compare every block of data,
>> >>you can significantly improve the time that part takes, and quite often this
>> >>will mitigate the impact of getting a few false positives passed into the
>> >>ioctl.
>> >>>
>> >>>Questiont 2a: would be a higher extend-size perform better?
>
> Most of the overhead of deleting shared extents depends on the number
> of extents, so using larger extents is helpful up to the maximum extent
> size; however, the extent-same ioctl cannot make extents larger.  If the
> source file was fragmented to begin with then using larger sizes in the
> extent-same ioctl will not have any effect.
>
> Given a choice of extents to deduplicate, it helps to choose the larger
> extents to keep if possible (i.e. delete/replace the smaller extents
> with references to the larger ones); however, the tradeoff is that this
> only allows dedup along existing extent boundaries (i.e. no files over
> a few MB in size will ever be deduped).
>
>> >>>Querstion 2b: or did i understand something wrong?
>> >>
>> >>No, a larger extent would probably not help much, and that's actually a
>> >>really good performance sample you've created.
>> >>
>> >>The block size does end up being somewhat of a trade-off.  Ideally, you want
>> >>it matched to the smallest possible chunk of duplicate data greater than or
>> >>equal to the filesystem block size for maximal space efficiency.  Doing this
>> >>however makes the extra processing done by duperemove take exponentially
>> >>longer because it has to calculate hashes for more blocks (this has very low
>> >>impact until you get to very small block sizes), and has to make
>> >>exponentially more comparisons (this has a very big impact as you shrink the
>> >>block size, just halving the block size will roughly quadruple the time it
>> >>takes to make the comparisons).
>> >>
>> >>>
>> >>>2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> >>>>
>> >>>>On 2017-01-03 14:21, Peter Becker wrote:
>> >>>>>
>> >>>>>
>> >>>>>All invocations are justified, but not relevant in (offline) backup
>> >>>>>and archive scenarios.
>> >>>>>
>> >>>>>For example you have multiple version of append-only log-files or
>> >>>>>append-only db-files (each more then 100GB in size), like this:
>> >>>>>
>> >>>>>>Snapshot_01_01_2017
>> >>>>>
>> >>>>>
>> >>>>>-> file1.log .. 201 GB
>> >>>>>
>> >>>>>>Snapshot_02_01_2017
>> >>>>>
>> >>>>>
>> >>>>>-> file1.log .. 205 GB
>> >>>>>
>> >>>>>>Snapshot_03_01_2017
>> >>>>>
>> >>>>>
>> >>>>>-> file1.log .. 221 GB
>> >>>>>
>> >>>>>The first 201 GB would be every time the same.
>> >>>>>Files a copied at night from windows, linux or bsd systems and
>> >>>>>snapshoted after copy.
>> >>>>>
>> >>>>>So a fast way to dedupe this is needed. Using 128KB blocks would
>> >>>>>result in 1646592 extends per Snapshot. 1MB blocksize results in
>> >>>>>205.824 extends (not bad, but still terrible speed).
>> >>>>>I will test it at night with a patched version of duperemove with
>> >>>>>100MB blocksize, but I have no hope that the throughput increases
>> >>>>>thereby.
>> >>>>
>> >>>>
>> >>>>Deduplication is not a general purpose thing (usually you have very
>> >>>>specifically structured data), but duperemove is supposed to be a general
>> >>>>purpose tool.  It works fine for two of the most common cases
>> >>>>(deduplicating
>> >>>>large numbers of small files or small numbers of large files), but it is
>> >>>>sub-optimal for those cases, and will be for almost any other case.  This
>> >>>>is
>> >>>>a canonical example though of a case where you can use a custom script or
>> >>>>program to figure out what's duplicated and then have that just call the
>> >>>>ioctl as appropriate itself.  Most cases where you are storing some kind
>> >>>>of
>> >>>>well structured data fall into this category.  In fact, both of the cases
>> >>>>where I use deduplication myself fall into such a category.  One case
>> >>>>involves multiple directories that are partial copies of a larger tree
>> >>>>which
>> >>>>are kept in sync with the larger tree and each other.  In that particular
>> >>>>case, I care about whole file deduplication, so I have a script that just
>> >>>>matches on path relative to the roots of each copy and the master copy,
>> >>>>verifies that the mtime and size are the same, and if so calls the ioctl
>> >>>>for
>> >>>>deduplication (with some fancy processing to fit within the max size
>> >>>>supported by the ioctl and prevent tiny tail extents).  The other case is
>> >>>>a
>> >>>>set of archives with a pretty solid fixed structure to them.  In that
>> >>>>case,
>> >>>>I have a different script that knows enough about the file structure to
>> >>>>know
>> >>>>where to look for duplicate blocks, thus avoiding having to hash the
>> >>>>whole
>> >>>>files.
>> >>>>
>> >>>>The append-only log/database case fits this type of thing perfectly, for
>> >>>>each subsequent file, you know already that (most of) the file up to the
>> >>>>length of the previous file is duplicated, so you can just split that
>> >>>>however you want into chunks and pass those to the dedupe ioctl and free
>> >>>>up
>> >>>>most of the space that would be used by the new file.  You can then run
>> >>>>duperemove with a hash-file to process any new blocks beyond the point
>> >>>>you
>> >>>>deduplicated up to to reclaim any excess space (currently this will
>> >>>>process
>> >>>>the whole file, but it should see that most of it is deduplicated
>> >>>>already).
>> >>>>
>> >>>>>
>> >>>>>For backup and archive scenarios the checksum-feature and the
>> >>>>>dub-data/metadata-feature of btrfs is realy nice. In particular if one
>> >>>>>considers the 7 years legally prescribed storage time.
>> >>>>>
>> >>>>>2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> >>>>>>
>> >>>>>>
>> >>>>>>On 2016-12-30 15:28, Peter Becker wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>Hello, i have a 8 TB volume with multiple files with hundreds of GB
>> >>>>>>>each.
>> >>>>>>>I try to dedupe this because the first hundred GB of many files are
>> >>>>>>>identical.
>> >>>>>>>With 128KB blocksize with nofiemap and lookup-extends=no option, will
>> >>>>>>>take more then a week (only dedupe, previously hashed). So i tryed -b
>> >>>>>>>100M but this returned me an error: "Blocksize is bounded ...".
>> >>>>>>>
>> >>>>>>>The reason is that the blocksize is limit to
>> >>>>>>>
>> >>>>>>>#define MAX_BLOCKSIZE (1024U*1024)
>> >>>>>>>
>> >>>>>>>But i can't found any description why.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>Beyond what Xin mentioned (namely that 1MB is a much larger block than
>> >>>>>>will
>> >>>>>>be duplicated in most data-sets), there are a couple of other reasons:
>> >>>>>>1. Smaller blocks will actually get you better deduplication on average
>> >>>>>>because they're more likely to match.  As an example, assume you have 2
>> >>>>>>files with the same 8 4k blocks in different orders:
>> >>>>>>  FileA: 1 2 3 4 5 6 7 8
>> >>>>>>  FileB: 7 8 5 6 3 4 1 2
>> >>>>>>In such a case, deduplicating at any block size above 8k would result
>> >>>>>>in
>> >>>>>>zero deduplication between these files, while 8k or less would
>> >>>>>>completely
>> >>>>>>deduplicate them.  This is of course a highly specific and somewhat
>> >>>>>>contrived example (in most cases it will be scattered duplicate blocks
>> >>>>>>over
>> >>>>>>dozens of files), but it does convey this specific point.
>> >>>>>>2. The kernel will do a byte-wise comparison of all ranges you pass
>> >>>>>>into
>> >>>>>>the
>> >>>>>>ioctl at the same time.  Larger block sizes here mean that:
>> >>>>>>        a) The extents will be locked longer, which will prevent any
>> >>>>>>I/O
>> >>>>>>to
>> >>>>>>the files being deduplicated for the duration of the comparison, which
>> >>>>>>may
>> >>>>>>in turn cause other issues on the system.
>> >>>>>>        b) The deduplication process will be stuck in uninterruptible
>> >>>>>>sleep
>> >>>>>>longer, which on many systems will trigger hung task detection, which
>> >>>>>>will
>> >>>>>>in turn either spam the system log or panic the system depending on how
>> >>>>>>it's
>> >>>>>>configured.
>> >>>>>>
>> >>>>
>> >>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
  2017-01-09  9:29                 ` Peter Becker
@ 2017-01-10  4:12                   ` Zygo Blaxell
  0 siblings, 0 replies; 19+ messages in thread
From: Zygo Blaxell @ 2017-01-10  4:12 UTC (permalink / raw)
  To: Peter Becker; +Cc: Austin S. Hemmelgarn, Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 20883 bytes --]

On Mon, Jan 09, 2017 at 10:29:03AM +0100, Peter Becker wrote:
> 2017-01-09 2:09 GMT+01:00 Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > On Wed, Jan 04, 2017 at 07:58:55AM -0500, Austin S. Hemmelgarn wrote:
> >> On 2017-01-03 16:35, Peter Becker wrote:
> >> >As i understand the duperemove source-code right (i work on/ try to
> >> >improve this code since 5 or 6 weeks on multiple parts), duperemove
> >> >does hashing and calculation before they call extend_same.
> >> >Duperemove stores all in a hashfile and read this. after all files
> >> >hashed, and duplicates detected, the progress all in order without
> >> >reading new data form disk / hashfile. so the byte-by-byte comparison
> >> >of extend_same ioctl should consume the full possible bandwidth of the
> >> >disks.
> >> Not necessarily.  You've actually got a significant amount of processing
> >> between each disk operation.  General ordering inside the ioctl is:
> >> 1. Do generic ioctl setup.
> >> 2. Lock the extents.
> >> 3. Read the ranges into memory.
> >> 4. Compare the ranges.
> >> 5. If the ranges are identical, write out the changes needed to reflink
> >> them.
> >> 6. Unlock all the extents.
> >> 7. Do generic ioctl cleanup.
> >> 1 and 7 in particular are pretty heavy.  Ioctls were not intended to be
> >> called with this kind of frequency, and that fact really shows in the setup
> >> and teardown (overhead is way higher than a syscall).
> >
> > Steps 1 and 7 are not heavy at all.  ioctl setup is an order of magnitude
> > higher than other system calls, but still up to 11 orders of magnitude
> > faster than the other steps.  The other steps are *slow*, and step 5
> > is orders of magnitude slower than all the others combined.
> >
> > Most of the time in step 5 is spent deleting the dst extent refs
> > (or waiting for transaction commits, but everything waits for those).
> > It gets worse when you have big files (1G and larger), more extents,
> > and more extent references in the same inode.  On a 100G file the overhead
> > of manipulating shared extent refs is so large that the rest of the
> > extent-same ioctl is just noise by comparison (microseconds vs minutes).
> >
> > The commit 1d57ee9 "btrfs: improve delayed refs iterations" (merged in
> > v4.10-rc1) helps a bit with this, but deleting shared refs is still
> > one of the most expensive things you can do in btrfs.
> 
> Would it not be better to bulk insert all extent refs and bulk delete
> the dst extend refs?

Maybe.  See below.

> I am currently dealing intensively with the btrfs implementation and
> the extent_same ioctl. But still in a very early stage of planing.
> (understand and plan)
> I'm not sure if this task would be possible for a newbie in btrfs
> source/development.

Some aspects would probably challenge experienced btrfs veterans...

> I plan to introduce a new extent_same ioctl with accept an array of
> the parameters of the current btrfs_extent_same function (struct inode
> *src, u64 loff, u64 olen,
> struct inode *dst, u64 dst_loff) called btrfs_extent_same_v2.

That sounds a lot like the existing FILE_EXTENT_SAME ioctl (it takes one
src argument and an array of dst arguments).  Did you mean to introduce
an array of src arguments as well?  If so, can you explain why that
would be useful?

There is a task potentially accessible to newbies here:  remove the length
limit on FILE_EXTENT_SAME (i.e. make it split large dedup requests into
individual extents transparently).

Another possible enhancement is to replace blocks starting from
loff/dst_loff until either a non-matching block or the end of the src
extent is found, and return the length of the matched data.  This would
reduce the "entire file is duplicate" case to a single ioctl, and avoid
having to preread all the data from userspace to find out where the
mismatches are.

> This would reduce the numbers of ioc calls, allows bulk insert, bulk remove, ...

If we're making a wishlist about extent-same, I'd like two new
capabilities for FILE_EXTENT_SAME:

1.  If the src and dst contents are equal, replace *all* references to
dst in one pass.  i.e. not just the specific dst reference (or array
of references) that was given in the argument.  Also handle references
that only partially overlap with dst (easier to do in the kernel than
from userspace).

2.  Combine adjacent dst data references into a single extent where
possible.  e.g. if extent-same is given 16x1MB physically adjacent
blocks at logically adjacent offsets, these should be combined into a
1x16MB extent automatically.  Currently the blocks will be split into
16 distinct extent refs, even if they all ultimately refer to different
subsets of a single original extent.  I hear XFS already does this,
so dedup agents on XFS can just hurl block pairs at the kernel and not
have to worry about the underlying file structure.

It's possible to work around these missing features from userspace,
but it's painfully slow, and unreliable because the btrfs kernel code
doesn't provide quite the right information through the LOGICAL_INO and
SEARCH_V2 ioctls.

> Additional the new version should also introduce a new options
> parameter, for example to bypassing the cmp if the caller realy knows
> what he does.

That would make extent-same equivalent to the existing CLONE_RANGE ioctl
and/or system call.  That wouldn't be useful by itself, unless it was
combined with some other new functionality.

> All suggestions and hints for this are very welcome.
> 
> >> The operation ended
> >> up being an ioctl instead of a syscall (or extension to another syscall)
> >> because:
> >> 1. Manipulating low-level filesystem state is part of what they're intended
> >> to be used for.
> >> 2. Introducing a new FS specific ioctl is a whole lot less controversial
> >> than introducing a new FS specific syscall.
> >> >
> >> >1. dbfile_load_hashes
> >> >2. find_all_dupes
> >> >3. dedupe_results
> >> >-> call the following in N threads:
> >> >>dedupe_extent_list
> >> >>>list_for_each_entry
> >> >>>>add_extent_to_dedupe #produce a simple list/queue
> >> >>>>dedupe_extents
> >> >>>>>btrfs_extent_same
> >> >>>>>>BTRFS_IOC_FILE_EXTENT_SAME
> >> >
> >> >So if this right, one of this thinks is realy slow:
> >> >
> >> >1. byte-per-byte comparison
> >> There's no way that this part can't be slow.  You need to load the data into
> >> the registers to do the comparison, you can't just point something at RAM
> >> and get an answer.  On x86, this in turn means that the comparison amounts
> >> to a loop of 2 loads followed by a compare and a branch for , repeated once
> >> for each range beyond the first, and that's assuming that the compiler
> >> optimizes it to the greatest degree possible.  On some other systems the
> >> compare and branch are one instruction, on others the second load might be
> >> eliminated, but overall it's not something that can be sped up all that
> >> much.
> >
> > On cheap amd64 machines this can be done at gigabytes per second.  Not much
> > gain from optimizing this.
> >
> >> >2. sets up the reflinks
> >> This actually is not as efficient as it sounds like it should be, adding
> >> reflinks means updating metadata, which means that there is some unavoidable
> >> overhead here.  I doubt that it's where the issue is, but I may be wrong.
> >
> > Most of the time spent here is spent waiting for IO.  extent-same seems to
> > imply fsync() with all the performance cost thereof.
> >
> >> >3. unlocks the new extent
> >> There's one other aspect not listed here, locking the original extents,
> >> which can actually add quite a lot of overhead if the files are actually
> >> being used.
> >> >
> >> >If i'm not wrong with my understanding of the duperemove source code,
> >> >this behaivor should also affected the online dedupe feature on with
> >> >Qu Wenruo works.
> >> AFAIK, that uses a different code path from the batch deduplication ioctl.
> >> It also doesn't have the context switches and other overhead from an ioctl
> >> involved, because it's done in kernel code.
> >
> > No difference there--the extent-same ioctl is all kernel code too.
> >
> >> >2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> >> >>On 2017-01-03 15:20, Peter Becker wrote:
> >> >>>
> >> >>>I think i understand. The resulting keyquestion is, how i can improve
> >> >>>the performance of extend_same ioctl.
> >> >>>I tested it with following results:
> >> >>>
> >> >>>enviorment:
> >> >>>2 files, called "file", size each 100GB, duperemove nofiemap-options
> >> >>>set, 1MB extend size.
> >> >>>
> >> >>>duperemove output:
> >> >>>[0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
> >> >>>[0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
> >> >>>[0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
> >> >>>[0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
> >> >>>1.0M), "/mnt/new/file"
> >> >>>
> >> >>>iotop output for a 30 sec. sample
> >> >>>
> >> >>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >> >>>               22,31    0,00   13,83        33,81    0,00       30,05
> >> >>>
> >> >>>Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
> >> >>>wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> >> >>>sdd               0,00     1,70          1149,93    0,73  4600,53
> >> >>>139,60    8,24       0,23          0,20    0,19   13,64      0,19
> >> >>>21,84
> >> >>>sde               0,00     0,00          1149,33    0,53  4597,33
> >> >>>23,87     8,04       0,20          0,18    0,18    1,75       0,18
> >> >>>20,47
> >> >>>sdf                0,00     1,70          1149,60    0,63  4598,40
> >> >>>139,60    8,24       0,21          0,18    0,18    4,63       0,18
> >> >>>20,63
> >> >>>sdh               0,00     0,00          1149,33    0,53  4597,33
> >> >>>23,87     8,04       0,21          0,18    0,18    4,25       0,18
> >> >>>20,85
> >> >>>
> >> >>>resulting in less then 18MB/s read. realy slow.
> >> >>>
> >> >>>Querstion 1: why, so slow?
> >> >>
> >> >>For a couple of reasons.  First, you have to understand that duperemove
> >> >>itself actually does a pretty large amount of processing outside of the call
> >> >>to the ioctl.  It first hashes the blocks for quicker comparison and
> >> >>matching, then figures out which blocks match, and finally calls the ioctl
> >> >>on the resulting matches.  The reason for this behavior is that the ioctl is
> >> >>insanely slow.  It first locks the ranges passed in (so they don't get
> >> >>changed by anything else during the deduplication process), then does a
> >> >>byte-by-byte comparison to make sure they all actually do match (data
> >> >>safety, I've said at least once before that I think there should be a flag
> >> >>for the ioctl (or a separate ioctl) to skip this and assume that userspace
> >> >>really knows what it's doing), then finally sets up the reflinks, and
> >> >>unlocks the new extent.
> >
> > I've never found the ioctl's lock/read/compare operation taking any more
> > than 1.0% of the time.  A dedup agent may spend 20% of its time running
> > the extent-same ioctl (with the other 80% being metadata traversal
> > and block reads/hashes).  Within the extent-same ioctl the "set up the
> > reflinks" step is 99% of the time, often more.
> >
> > The ioctl can read from the cache, so if userspace reads all
> > the data immediately before calling the extent-same ioctl then the
> > read/lock/compare component of the ioctl is negligible.
> >
> >> >>All of this ties into why I keep telling people that efficient deduplication
> >> >>requires a tool that understands how the data being deduplicated is
> >> >>structured.  By avoiding the need to hash and compare every block of data,
> >> >>you can significantly improve the time that part takes, and quite often this
> >> >>will mitigate the impact of getting a few false positives passed into the
> >> >>ioctl.
> >> >>>
> >> >>>Questiont 2a: would be a higher extend-size perform better?
> >
> > Most of the overhead of deleting shared extents depends on the number
> > of extents, so using larger extents is helpful up to the maximum extent
> > size; however, the extent-same ioctl cannot make extents larger.  If the
> > source file was fragmented to begin with then using larger sizes in the
> > extent-same ioctl will not have any effect.
> >
> > Given a choice of extents to deduplicate, it helps to choose the larger
> > extents to keep if possible (i.e. delete/replace the smaller extents
> > with references to the larger ones); however, the tradeoff is that this
> > only allows dedup along existing extent boundaries (i.e. no files over
> > a few MB in size will ever be deduped).
> >
> >> >>>Querstion 2b: or did i understand something wrong?
> >> >>
> >> >>No, a larger extent would probably not help much, and that's actually a
> >> >>really good performance sample you've created.
> >> >>
> >> >>The block size does end up being somewhat of a trade-off.  Ideally, you want
> >> >>it matched to the smallest possible chunk of duplicate data greater than or
> >> >>equal to the filesystem block size for maximal space efficiency.  Doing this
> >> >>however makes the extra processing done by duperemove take exponentially
> >> >>longer because it has to calculate hashes for more blocks (this has very low
> >> >>impact until you get to very small block sizes), and has to make
> >> >>exponentially more comparisons (this has a very big impact as you shrink the
> >> >>block size, just halving the block size will roughly quadruple the time it
> >> >>takes to make the comparisons).
> >> >>
> >> >>>
> >> >>>2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> >> >>>>
> >> >>>>On 2017-01-03 14:21, Peter Becker wrote:
> >> >>>>>
> >> >>>>>
> >> >>>>>All invocations are justified, but not relevant in (offline) backup
> >> >>>>>and archive scenarios.
> >> >>>>>
> >> >>>>>For example you have multiple version of append-only log-files or
> >> >>>>>append-only db-files (each more then 100GB in size), like this:
> >> >>>>>
> >> >>>>>>Snapshot_01_01_2017
> >> >>>>>
> >> >>>>>
> >> >>>>>-> file1.log .. 201 GB
> >> >>>>>
> >> >>>>>>Snapshot_02_01_2017
> >> >>>>>
> >> >>>>>
> >> >>>>>-> file1.log .. 205 GB
> >> >>>>>
> >> >>>>>>Snapshot_03_01_2017
> >> >>>>>
> >> >>>>>
> >> >>>>>-> file1.log .. 221 GB
> >> >>>>>
> >> >>>>>The first 201 GB would be every time the same.
> >> >>>>>Files a copied at night from windows, linux or bsd systems and
> >> >>>>>snapshoted after copy.
> >> >>>>>
> >> >>>>>So a fast way to dedupe this is needed. Using 128KB blocks would
> >> >>>>>result in 1646592 extends per Snapshot. 1MB blocksize results in
> >> >>>>>205.824 extends (not bad, but still terrible speed).
> >> >>>>>I will test it at night with a patched version of duperemove with
> >> >>>>>100MB blocksize, but I have no hope that the throughput increases
> >> >>>>>thereby.
> >> >>>>
> >> >>>>
> >> >>>>Deduplication is not a general purpose thing (usually you have very
> >> >>>>specifically structured data), but duperemove is supposed to be a general
> >> >>>>purpose tool.  It works fine for two of the most common cases
> >> >>>>(deduplicating
> >> >>>>large numbers of small files or small numbers of large files), but it is
> >> >>>>sub-optimal for those cases, and will be for almost any other case.  This
> >> >>>>is
> >> >>>>a canonical example though of a case where you can use a custom script or
> >> >>>>program to figure out what's duplicated and then have that just call the
> >> >>>>ioctl as appropriate itself.  Most cases where you are storing some kind
> >> >>>>of
> >> >>>>well structured data fall into this category.  In fact, both of the cases
> >> >>>>where I use deduplication myself fall into such a category.  One case
> >> >>>>involves multiple directories that are partial copies of a larger tree
> >> >>>>which
> >> >>>>are kept in sync with the larger tree and each other.  In that particular
> >> >>>>case, I care about whole file deduplication, so I have a script that just
> >> >>>>matches on path relative to the roots of each copy and the master copy,
> >> >>>>verifies that the mtime and size are the same, and if so calls the ioctl
> >> >>>>for
> >> >>>>deduplication (with some fancy processing to fit within the max size
> >> >>>>supported by the ioctl and prevent tiny tail extents).  The other case is
> >> >>>>a
> >> >>>>set of archives with a pretty solid fixed structure to them.  In that
> >> >>>>case,
> >> >>>>I have a different script that knows enough about the file structure to
> >> >>>>know
> >> >>>>where to look for duplicate blocks, thus avoiding having to hash the
> >> >>>>whole
> >> >>>>files.
> >> >>>>
> >> >>>>The append-only log/database case fits this type of thing perfectly, for
> >> >>>>each subsequent file, you know already that (most of) the file up to the
> >> >>>>length of the previous file is duplicated, so you can just split that
> >> >>>>however you want into chunks and pass those to the dedupe ioctl and free
> >> >>>>up
> >> >>>>most of the space that would be used by the new file.  You can then run
> >> >>>>duperemove with a hash-file to process any new blocks beyond the point
> >> >>>>you
> >> >>>>deduplicated up to to reclaim any excess space (currently this will
> >> >>>>process
> >> >>>>the whole file, but it should see that most of it is deduplicated
> >> >>>>already).
> >> >>>>
> >> >>>>>
> >> >>>>>For backup and archive scenarios the checksum-feature and the
> >> >>>>>dub-data/metadata-feature of btrfs is realy nice. In particular if one
> >> >>>>>considers the 7 years legally prescribed storage time.
> >> >>>>>
> >> >>>>>2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>On 2016-12-30 15:28, Peter Becker wrote:
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>Hello, i have a 8 TB volume with multiple files with hundreds of GB
> >> >>>>>>>each.
> >> >>>>>>>I try to dedupe this because the first hundred GB of many files are
> >> >>>>>>>identical.
> >> >>>>>>>With 128KB blocksize with nofiemap and lookup-extends=no option, will
> >> >>>>>>>take more then a week (only dedupe, previously hashed). So i tryed -b
> >> >>>>>>>100M but this returned me an error: "Blocksize is bounded ...".
> >> >>>>>>>
> >> >>>>>>>The reason is that the blocksize is limit to
> >> >>>>>>>
> >> >>>>>>>#define MAX_BLOCKSIZE (1024U*1024)
> >> >>>>>>>
> >> >>>>>>>But i can't found any description why.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>Beyond what Xin mentioned (namely that 1MB is a much larger block than
> >> >>>>>>will
> >> >>>>>>be duplicated in most data-sets), there are a couple of other reasons:
> >> >>>>>>1. Smaller blocks will actually get you better deduplication on average
> >> >>>>>>because they're more likely to match.  As an example, assume you have 2
> >> >>>>>>files with the same 8 4k blocks in different orders:
> >> >>>>>>  FileA: 1 2 3 4 5 6 7 8
> >> >>>>>>  FileB: 7 8 5 6 3 4 1 2
> >> >>>>>>In such a case, deduplicating at any block size above 8k would result
> >> >>>>>>in
> >> >>>>>>zero deduplication between these files, while 8k or less would
> >> >>>>>>completely
> >> >>>>>>deduplicate them.  This is of course a highly specific and somewhat
> >> >>>>>>contrived example (in most cases it will be scattered duplicate blocks
> >> >>>>>>over
> >> >>>>>>dozens of files), but it does convey this specific point.
> >> >>>>>>2. The kernel will do a byte-wise comparison of all ranges you pass
> >> >>>>>>into
> >> >>>>>>the
> >> >>>>>>ioctl at the same time.  Larger block sizes here mean that:
> >> >>>>>>        a) The extents will be locked longer, which will prevent any
> >> >>>>>>I/O
> >> >>>>>>to
> >> >>>>>>the files being deduplicated for the duration of the comparison, which
> >> >>>>>>may
> >> >>>>>>in turn cause other issues on the system.
> >> >>>>>>        b) The deduplication process will be stuck in uninterruptible
> >> >>>>>>sleep
> >> >>>>>>longer, which on many systems will trigger hung task detection, which
> >> >>>>>>will
> >> >>>>>>in turn either spam the system log or panic the system depending on how
> >> >>>>>>it's
> >> >>>>>>configured.
> >> >>>>>>
> >> >>>>
> >> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-01-10  4:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-30 20:28 [markfasheh/duperemove] Why blocksize is limit to 1MB? Peter Becker
2017-01-01  4:38 ` Xin Zhou
2017-01-02 12:32   ` Peter Becker
2017-01-02 19:36     ` Xin Zhou
2017-01-03 12:40 ` Austin S. Hemmelgarn
2017-01-03 19:24   ` Peter Becker
2017-01-03 23:07     ` Hans van Kranenburg
2017-01-03 23:12       ` Peter Becker
2017-01-03 23:43         ` Hans van Kranenburg
2017-01-04  0:08           ` Martin Raiber
     [not found]   ` <CAEtw4r3mUA_4vcS-dbxagQn3NPRh8Cxcz0iF0L7jHwv5c9Ui+g@mail.gmail.com>
     [not found]     ` <7b0c897f-844c-e7f4-0ce7-c9f888b95983@gmail.com>
2017-01-03 20:20       ` Peter Becker
2017-01-03 20:40         ` Austin S. Hemmelgarn
2017-01-03 21:35           ` Peter Becker
2017-01-04 12:58             ` Austin S. Hemmelgarn
2017-01-04 14:42               ` Peter Becker
2017-01-09  1:09               ` Zygo Blaxell
2017-01-09  9:29                 ` Peter Becker
2017-01-10  4:12                   ` Zygo Blaxell
2017-01-03 20:21       ` Fwd: " Peter Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.