All of lore.kernel.org
 help / color / mirror / Atom feed
* About in-band dedupe for v4.7
@ 2016-05-10  7:19 Qu Wenruo
  2016-05-10 22:11 ` Mark Fasheh
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Qu Wenruo @ 2016-05-10  7:19 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: btrfs

Hi, Chris, Josef and David,

As merge window for v4.7 is coming, it would be good to hear your ideas 
about the inband dedupe.

We are addressing the ENOSPC problem which Josef pointed out, and we 
believe the final fix patch would come out at the beginning of the merge 
window.(Next week)


If it's fine, would you please consider to merge the in-memory backend 
patchset for v4.7 as an experimental feature?


Most of the patch won't be changed from v10 patchset, only ENOSPC fix 
will be updated, and ioctl patchset will introduce a new Kconfig option 
of "btrfs experimental features" for inband dedupe.
(With explain about unstable ioctl/on-disk format for experimental features)


If you are all OK to merge inband dedupe in-memory backend, I'll prepare 
the new v11 patchset for this merge.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-10  7:19 About in-band dedupe for v4.7 Qu Wenruo
@ 2016-05-10 22:11 ` Mark Fasheh
  2016-05-11  1:03   ` Qu Wenruo
                     ` (2 more replies)
  2016-05-11  0:37 ` Chris Mason
  2016-05-11 16:39 ` David Sterba
  2 siblings, 3 replies; 23+ messages in thread
From: Mark Fasheh @ 2016-05-10 22:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Josef Bacik, David Sterba, btrfs

On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
> Hi, Chris, Josef and David,
> 
> As merge window for v4.7 is coming, it would be good to hear your
> ideas about the inband dedupe.
> 
> We are addressing the ENOSPC problem which Josef pointed out, and we
> believe the final fix patch would come out at the beginning of the
> merge window.(Next week)

How about the fiemap performance problem you referenced before? My guess is
that it happens because you don't coalesce writes into anything larger than
a page so you're stuck deduping at some silly size like 4k. This in turn
fragments the files so much that fiemap has a hard time walking backrefs.

I have to check the patches to be sure but perhaps you can tell me whether
my hunch is correct or not.


In fact, I actually asked privately for time to review your dedupe patches,
but I've been literally so busy cleaning up after the mess you left in your
last qgroups rewrite I haven't had time.

You literally broke qgroups in almost every spot that matters. In some cases
(drop_snapshot) you tore out working code and left in a /* TODO */ comment
for someone else to complete.  Snapshot create was so trivially and
completely broken by your changes that weeks later, I'm still hunting a
solution which doesn't involve adding an extra _commit_ to our commit.  This
is a MASSIVE regression from where we were before.

IMHO, you should not be trusted with large features or rewrites until you  
can demonstrate:

 - A willingness to *completely* solve the problem you are trying to 'fix',
   not do half the job which someone else will have to complete for you.

 - Actual testing. The snapshot bug I reference above exists purely because
   nobody created a snapshot inside of one and checked the qgroup numbers!  

Sorry to be so harsh.
   --Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-10  7:19 About in-band dedupe for v4.7 Qu Wenruo
  2016-05-10 22:11 ` Mark Fasheh
@ 2016-05-11  0:37 ` Chris Mason
  2016-05-11  1:40   ` Qu Wenruo
  2016-05-11 16:39 ` David Sterba
  2 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2016-05-11  0:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Josef Bacik, David Sterba, btrfs

On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
> Hi, Chris, Josef and David,
> 
> As merge window for v4.7 is coming, it would be good to hear your ideas
> about the inband dedupe.
> 
> We are addressing the ENOSPC problem which Josef pointed out, and we believe
> the final fix patch would come out at the beginning of the merge
> window.(Next week)
> 
> 
> If it's fine, would you please consider to merge the in-memory backend
> patchset for v4.7 as an experimental feature?
> 
> 
> Most of the patch won't be changed from v10 patchset, only ENOSPC fix will
> be updated, and ioctl patchset will introduce a new Kconfig option of "btrfs
> experimental features" for inband dedupe.
> (With explain about unstable ioctl/on-disk format for experimental features)
> 
> 
> If you are all OK to merge inband dedupe in-memory backend, I'll prepare the
> new v11 patchset for this merge.

We have to balance the part where we really want the features to come
in, and we want to lower the load on you to continue porting them.  But,
I really do agree that we need strong test suites included with every
major feature like this.

-chris

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-10 22:11 ` Mark Fasheh
@ 2016-05-11  1:03   ` Qu Wenruo
  2016-05-11  2:52     ` Mark Fasheh
  2016-05-11 16:56   ` David Sterba
  2016-05-13  3:13   ` Wang Shilong
  2 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2016-05-11  1:03 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Chris Mason, Josef Bacik, David Sterba, btrfs



Mark Fasheh wrote on 2016/05/10 15:11 -0700:
> On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
>> Hi, Chris, Josef and David,
>>
>> As merge window for v4.7 is coming, it would be good to hear your
>> ideas about the inband dedupe.
>>
>> We are addressing the ENOSPC problem which Josef pointed out, and we
>> believe the final fix patch would come out at the beginning of the
>> merge window.(Next week)
>
> How about the fiemap performance problem you referenced before? My guess is
> that it happens because you don't coalesce writes into anything larger than
> a page so you're stuck deduping at some silly size like 4k. This in turn
> fragments the files so much that fiemap has a hard time walking backrefs.
Nope. Default dedupe size is 128K, and minimal dedupe size is limited to 
64K maximum to 8M.
Yes, it's going to cause fragements, but just check the test case I 
submitted, it doesn't ever need dedupe to trigger the bug.
Clone range will also trigger it.

The problem is quite obvious, btrfs_check_shared() is doing wrong judgment.

It's using find_parent_nodes() to do shared check, while it's completely 
possible a file extent to be shared with other inodes inside the same root.

So our fix is to change btrfs_check_shared(), if a file extent is shared 
between even the same inode but different offset, then it's a shared 
extent, btrfs_check_shared() will exit.

The fix shows pretty positive result yet, and we are just doing the 
final check.
It would be out in recent days.


Yes, there is still root cause hidden behind the whole thing:
Slow(non-cached) backref walk, which will not only affects the old 
check_shared() function, but impact more on qgroup.
(Walking through all the 8192 backrefs of a fully dedupe 1G file takes 
about 5 full seconds)

However the cache for backref walk is almost impossible to implement, as 
any snapshot can easily trash all the cache we have done before.

For that case, we are going to add TTL(time to live) for in-memory cache 
after v4.7, to ensure a dedupe hash will be trashed after specified 
cache hit (64 or 128 seems quite reasonable for me, of course it's 
tuneable).

Oh, you're then going to think I'm not fixing the thing?
It's an *EXISTING* defeat and it's dedupe to make *you* aware of the bug.

>
> I have to check the patches to be sure but perhaps you can tell me whether
> my hunch is correct or not.
>
>
> In fact, I actually asked privately for time to review your dedupe patches,
> but I've been literally so busy cleaning up after the mess you left in your
> last qgroups rewrite I haven't had time.
>
> You literally broke qgroups in almost every spot that matters. In some cases
> (drop_snapshot) you tore out working code and left in a /* TODO */ comment
> for someone else to complete.  Snapshot create was so trivially and
> completely broken by your changes that weeks later, I'm still hunting a
> solution which doesn't involve adding an extra _commit_ to our commit.  This
> is a MASSIVE regression from where we were before.

If you think my rework is a mess, then before that, qgroup is just 
rubbish, it can't even handle reflink between subvolume(btrfs/091), not 
to mention the hell of leaked reserved space(btrfs/099).


You were just overlooking the old qgroup things, thinking old one 
working and use the corner spot to express your disappointment.

The rework is mainly focused on fixing the main and real part of the 
qgroup. Normal accounting routine.

If you think the already broken qgroups is better just because it can 
handle snapshot and subvolume delete but all other part broken.
Then, feel free to revert all my qgroup rework and provide a better 
solution.

>
> IMHO, you should not be trusted with large features or rewrites until you
> can demonstrate:
>
>  - A willingness to *completely* solve the problem you are trying to 'fix',
>    not do half the job which someone else will have to complete for you.

OK, just let the qgroup break forever?
Have you ever noticed the problem before the rework? What are you doing 
before that?
Did you ever want to fix it before the rework?

Yes good corner spot and good whole accounting is perfect.

But it's super wired for you to think bad whole accounting with good 
corner case is even better than good whole accounting with bad corner case.

Any sane will fix the whole part thing first, and then the corner case.
>
>  - Actual testing. The snapshot bug I reference above exists purely because
>    nobody created a snapshot inside of one and checked the qgroup numbers!

Just check the qgroup test cases.

Same word can also be said to yourself.

Why didn't you spot leaked reserve space and reflink problem and even 
the long failed test cases of qgroup test group?

It's always easier to complain others work than provide a good one.

Thanks,
Qu

>
> Sorry to be so harsh.
>    --Mark
>
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11  0:37 ` Chris Mason
@ 2016-05-11  1:40   ` Qu Wenruo
  2016-05-11  2:26     ` Satoru Takeuchi
  2016-05-11  4:22     ` Mark Fasheh
  0 siblings, 2 replies; 23+ messages in thread
From: Qu Wenruo @ 2016-05-11  1:40 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba, btrfs



Chris Mason wrote on 2016/05/10 20:37 -0400:
> On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
>> Hi, Chris, Josef and David,
>>
>> As merge window for v4.7 is coming, it would be good to hear your ideas
>> about the inband dedupe.
>>
>> We are addressing the ENOSPC problem which Josef pointed out, and we believe
>> the final fix patch would come out at the beginning of the merge
>> window.(Next week)
>>
>>
>> If it's fine, would you please consider to merge the in-memory backend
>> patchset for v4.7 as an experimental feature?
>>
>>
>> Most of the patch won't be changed from v10 patchset, only ENOSPC fix will
>> be updated, and ioctl patchset will introduce a new Kconfig option of "btrfs
>> experimental features" for inband dedupe.
>> (With explain about unstable ioctl/on-disk format for experimental features)
>>
>>
>> If you are all OK to merge inband dedupe in-memory backend, I'll prepare the
>> new v11 patchset for this merge.
>
> We have to balance the part where we really want the features to come
> in, and we want to lower the load on you to continue porting them.  But,
> I really do agree that we need strong test suites included with every
> major feature like this.
>
> -chris
>
>
That's fine.

We're running all generic and btrfs test case with dedupe enabled, by 
modifying xfstest to call "btrfs dedeup enable" just after mount, to 
ensure dedupe won't corrupt any existing test case.

And we are also enriching our dedupe test cases to reduce the stability 
concern on the new feature.


BTW, does it mean inband dedup won't catch v4.7 merge window and no need 
to submit a v11 patchset for this merge window?

Thanks,
Qu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11  1:40   ` Qu Wenruo
@ 2016-05-11  2:26     ` Satoru Takeuchi
  2016-05-11  4:22     ` Mark Fasheh
  1 sibling, 0 replies; 23+ messages in thread
From: Satoru Takeuchi @ 2016-05-11  2:26 UTC (permalink / raw)
  To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba, btrfs

On 2016/05/11 10:40, Qu Wenruo wrote:
>
>
> Chris Mason wrote on 2016/05/10 20:37 -0400:
>> On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
>>> Hi, Chris, Josef and David,
>>>
>>> As merge window for v4.7 is coming, it would be good to hear your ideas
>>> about the inband dedupe.
>>>
>>> We are addressing the ENOSPC problem which Josef pointed out, and we believe
>>> the final fix patch would come out at the beginning of the merge
>>> window.(Next week)
>>>
>>>
>>> If it's fine, would you please consider to merge the in-memory backend
>>> patchset for v4.7 as an experimental feature?
>>>
>>>
>>> Most of the patch won't be changed from v10 patchset, only ENOSPC fix will
>>> be updated, and ioctl patchset will introduce a new Kconfig option of "btrfs
>>> experimental features" for inband dedupe.
>>> (With explain about unstable ioctl/on-disk format for experimental features)
>>>
>>>
>>> If you are all OK to merge inband dedupe in-memory backend, I'll prepare the
>>> new v11 patchset for this merge.
>>
>> We have to balance the part where we really want the features to come
>> in, and we want to lower the load on you to continue porting them.  But,
>> I really do agree that we need strong test suites included with every
>> major feature like this.
>>
>> -chris
>>
>>
> That's fine.
>
> We're running all generic and btrfs test case with dedupe enabled,
 > by modifying xfstest to call "btrfs dedeup enable"
> just after mount, to ensure dedupe won't corrupt any existing test case.

How about writing the xfstests patch that we can run
the above tests without modifying xfstests by hand?
Then it becomes easy for everyone to test dedupe.

Thanks,
Satoru

>
> And we are also enriching our dedupe test cases to reduce the stability concern on the new feature.
>
>
> BTW, does it mean inband dedup won't catch v4.7 merge window and no need to submit a v11 patchset for this merge window?
>
> Thanks,
> Qu
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11  1:03   ` Qu Wenruo
@ 2016-05-11  2:52     ` Mark Fasheh
  2016-05-11  9:14       ` Qu Wenruo
  2016-05-11 17:36       ` David Sterba
  0 siblings, 2 replies; 23+ messages in thread
From: Mark Fasheh @ 2016-05-11  2:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Josef Bacik, David Sterba, btrfs

On Wed, May 11, 2016 at 09:03:24AM +0800, Qu Wenruo wrote:
> 
> 
> Mark Fasheh wrote on 2016/05/10 15:11 -0700:
> >On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
> >>Hi, Chris, Josef and David,
> >>
> >>As merge window for v4.7 is coming, it would be good to hear your
> >>ideas about the inband dedupe.
> >>
> >>We are addressing the ENOSPC problem which Josef pointed out, and we
> >>believe the final fix patch would come out at the beginning of the
> >>merge window.(Next week)
> >
> >How about the fiemap performance problem you referenced before? My guess is
> >that it happens because you don't coalesce writes into anything larger than
> >a page so you're stuck deduping at some silly size like 4k. This in turn
> >fragments the files so much that fiemap has a hard time walking backrefs.
> Nope. Default dedupe size is 128K, and minimal dedupe size is
> limited to 64K maximum to 8M.
> Yes, it's going to cause fragements, but just check the test case I
> submitted, it doesn't ever need dedupe to trigger the bug.
> Clone range will also trigger it.

Yes, as you might see I've been looking through the patches now.


> >I have to check the patches to be sure but perhaps you can tell me whether
> >my hunch is correct or not.
> >
> >
> >In fact, I actually asked privately for time to review your dedupe patches,
> >but I've been literally so busy cleaning up after the mess you left in your
> >last qgroups rewrite I haven't had time.
> >
> >You literally broke qgroups in almost every spot that matters. In some cases
> >(drop_snapshot) you tore out working code and left in a /* TODO */ comment
> >for someone else to complete.  Snapshot create was so trivially and
> >completely broken by your changes that weeks later, I'm still hunting a
> >solution which doesn't involve adding an extra _commit_ to our commit.  This
> >is a MASSIVE regression from where we were before.
> 
> If you think my rework is a mess, then before that, qgroup is just
> rubbish, it can't even handle reflink between subvolume(btrfs/091),
> not to mention the hell of leaked reserved space(btrfs/099).

That's fine, I agree that the qgroups code was rubbish, but you replaced it
with something that was known to be broken. Just look at the /* TODO */ in
your original patch series. We were making incremental improvements before
and you threw that all out. Can you fault me for wondering what unsolved
problems await us in your dedupe patches?


> You were just overlooking the old qgroup things, thinking old one
> working and use the corner spot to express your disappointment.

No, I'm not happy about you leaving qgroups more broken than when you
started. There's fixing a bug, and there's trading one bug for several
others. You have done the latter.


> >IMHO, you should not be trusted with large features or rewrites until you
> >can demonstrate:
> >
> > - A willingness to *completely* solve the problem you are trying to 'fix',
> >   not do half the job which someone else will have to complete for you.
> 
> OK, just let the qgroup break forever?

You don't get credit for a partial solution!

Stepping back here for a second, I'm happy that you tried to fix a bug. I'm
extremely dissapointed in how you went about it - with a laser focus on
fixing that one bug at the expense of others. Fixing reflink between
subvolumes does us no good if you broke subvolume create and subvolume
deletion - the accounting is still wrong! The same goes for reservations.
Who cares if they don't work when the basic accounting is reporting
subvolumes with -121231247283946 bytes in them.


> Have you ever noticed the problem before the rework? What are you
> doing before that?
> Did you ever want to fix it before the rework?
> 
> Yes good corner spot and good whole accounting is perfect.
> 
> But it's super wired for you to think bad whole accounting with good
> corner case is even better than good whole accounting with bad
> corner case.
> 
> Any sane will fix the whole part thing first, and then the corner case.

Where are your corner case fixes then? This is what I'm talking about with
respect to half complete solutions!


> > - Actual testing. The snapshot bug I reference above exists purely because
> >   nobody created a snapshot inside of one and checked the qgroup numbers!
> 
> Just check the qgroup test cases.
> 
> Same word can also be said to yourself.
> 
> Why didn't you spot leaked reserve space and reflink problem and
> even the long failed test cases of qgroup test group?
> 
> It's always easier to complain others work than provide a good one.

Just so we're clear I wrote none of the original qgroup code, so those would
be questions for the author. I was aware of some of the bugs, and also was
aware enough of your patches to warn you and the devs that they were leaving
problems unfixed. Why it got merged anyway you have to ask Chris or Josef.

As far as testing I can easily show that you didn't test the most basic
cases of snapshot create or snapshot delete with your qgroup rewrite. This
understandbly gives me pause when you're asking for a large patch set to be
included in the next kernel.


Taking your history with qgroups out of this btw, my opinion does not
change.

With respect to in-memory only dedupe, it is my honest opinion that such a
limited feature is not worth the extra maintenance work. In particular
there's about 800 lines of code in the userspace patches which I'm sure
you'd want merged, because how could we test this then?

A couple examples sore points in my review so far:

- Internally you're using a mutex (instead of a spinlock) to lock out queries
 to the in-memory hash, which I can see becoming a performance problem in the
 write path.

- Also, we're doing SHA256 in the write path which I expect will
 slow it down even more dramatically. Given that all the work done gets
 thrown out every time we fill the hash (or remount), I just don't see much
 benefit to the user with this.

Users can get better dedupe via the ioctl today than with what
you propose go in as an experimental feature so I don't see many people
caring to test it. IMHO you would have to provide a more compelling reason
to include this code.

Solve the hard problems first (dedupe with a disk backend, make in-memory
perform), then come asking for inclusion of your feature. Again, this I
would say about the patches regardless of whose name is on them.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11  1:40   ` Qu Wenruo
  2016-05-11  2:26     ` Satoru Takeuchi
@ 2016-05-11  4:22     ` Mark Fasheh
  1 sibling, 0 replies; 23+ messages in thread
From: Mark Fasheh @ 2016-05-11  4:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Josef Bacik, David Sterba, btrfs

On Wed, May 11, 2016 at 09:40:51AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2016/05/10 20:37 -0400:
> >On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
> >>Hi, Chris, Josef and David,
> >>
> >>As merge window for v4.7 is coming, it would be good to hear your ideas
> >>about the inband dedupe.
> >>
> >>We are addressing the ENOSPC problem which Josef pointed out, and we believe
> >>the final fix patch would come out at the beginning of the merge
> >>window.(Next week)
> >>
> >>
> >>If it's fine, would you please consider to merge the in-memory backend
> >>patchset for v4.7 as an experimental feature?
> >>
> >>
> >>Most of the patch won't be changed from v10 patchset, only ENOSPC fix will
> >>be updated, and ioctl patchset will introduce a new Kconfig option of "btrfs
> >>experimental features" for inband dedupe.
> >>(With explain about unstable ioctl/on-disk format for experimental features)
> >>
> >>
> >>If you are all OK to merge inband dedupe in-memory backend, I'll prepare the
> >>new v11 patchset for this merge.
> >
> >We have to balance the part where we really want the features to come
> >in, and we want to lower the load on you to continue porting them.  But,
> >I really do agree that we need strong test suites included with every
> >major feature like this.
> >
> >-chris
> >
> >
> That's fine.
> 
> We're running all generic and btrfs test case with dedupe enabled,
> by modifying xfstest to call "btrfs dedeup enable" just after mount,
> to ensure dedupe won't corrupt any existing test case.

As Satoru mentioned, this is something that everybody needs to be able to
run. I would also like to see some basic analysis done on write-heavy
workloads. I think it's fair to understand what sort of impact this will
have on the write path.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11  2:52     ` Mark Fasheh
@ 2016-05-11  9:14       ` Qu Wenruo
  2016-05-11 17:36       ` David Sterba
  1 sibling, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2016-05-11  9:14 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Chris Mason, Josef Bacik, David Sterba, btrfs




> Solve the hard problems first (dedupe with a disk backend, make in-memory
> perform), then come asking for inclusion of your feature. Again, this I
> would say about the patches regardless of whose name is on them.
> 	--Mark
>

David has already agreed on the idea to merge in-memory backend first.
And the original idea of BTRFS_DEBUG (I prefer to use independent 
Kconfig though) also comes from him.

http://thread.gmane.org/gmane.comp.file-systems.btrfs/54704/focus=54784

We have also expressed the idea to merge in-memory dedupe backend frist 
in v9 patchset cover letter.

Thanks,
Qu





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-10  7:19 About in-band dedupe for v4.7 Qu Wenruo
  2016-05-10 22:11 ` Mark Fasheh
  2016-05-11  0:37 ` Chris Mason
@ 2016-05-11 16:39 ` David Sterba
  2 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2016-05-11 16:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Josef Bacik, btrfs

On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
> As merge window for v4.7 is coming, it would be good to hear your ideas 
> about the inband dedupe.

For me it's still in the process of review.

> We are addressing the ENOSPC problem which Josef pointed out, and we 
> believe the final fix patch would come out at the beginning of the merge 
> window.(Next week)
> 
> If it's fine, would you please consider to merge the in-memory backend 
> patchset for v4.7 as an experimental feature?
>
> Most of the patch won't be changed from v10 patchset, only ENOSPC fix 
> will be updated, and ioctl patchset will introduce a new Kconfig option 
> of "btrfs experimental features" for inband dedupe.
> (With explain about unstable ioctl/on-disk format for experimental features)
>
> If you are all OK to merge inband dedupe in-memory backend, I'll prepare 
> the new v11 patchset for this merge.

Given the unfixed bug(s) and lack of review/ack from others, 4.7 cannot
be considered as merging target.  I had promised to put the patchset
branch to my for-next (regarding v10) but forgot to do it, too many
patches to juggle sorry.  Ideally, a feature like this should spend a
full development cycle in next so it gets enough exposure and testing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-10 22:11 ` Mark Fasheh
  2016-05-11  1:03   ` Qu Wenruo
@ 2016-05-11 16:56   ` David Sterba
  2016-05-13  3:13   ` Wang Shilong
  2 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2016-05-11 16:56 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, Chris Mason, Josef Bacik, btrfs

On Tue, May 10, 2016 at 03:11:19PM -0700, Mark Fasheh wrote:
> On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
> > Hi, Chris, Josef and David,
> > 
> > As merge window for v4.7 is coming, it would be good to hear your
> > ideas about the inband dedupe.
> > 
> > We are addressing the ENOSPC problem which Josef pointed out, and we
> > believe the final fix patch would come out at the beginning of the
> > merge window.(Next week)
> 
> How about the fiemap performance problem you referenced before? My guess is
> that it happens because you don't coalesce writes into anything larger than
> a page so you're stuck deduping at some silly size like 4k. This in turn
> fragments the files so much that fiemap has a hard time walking backrefs.
> 
> I have to check the patches to be sure but perhaps you can tell me whether
> my hunch is correct or not.
> 
> 
> In fact, I actually asked privately for time to review your dedupe patches,

You can also ask for time to review publically, with me in CC.

> but I've been literally so busy cleaning up after the mess you left in your
> last qgroups rewrite I haven't had time.
> 
> You literally broke qgroups in almost every spot that matters. In some cases
> (drop_snapshot) you tore out working code and left in a /* TODO */ comment
> for someone else to complete.  Snapshot create was so trivially and
> completely broken by your changes that weeks later, I'm still hunting a
> solution which doesn't involve adding an extra _commit_ to our commit.  This
> is a MASSIVE regression from where we were before.

That's unfortunatelly true and I take part of the blame for allowing
that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11  2:52     ` Mark Fasheh
  2016-05-11  9:14       ` Qu Wenruo
@ 2016-05-11 17:36       ` David Sterba
  2016-05-12 20:54         ` Mark Fasheh
  2016-05-13  6:01         ` Zygo Blaxell
  1 sibling, 2 replies; 23+ messages in thread
From: David Sterba @ 2016-05-11 17:36 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, Chris Mason, Josef Bacik, btrfs

On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:
> Taking your history with qgroups out of this btw, my opinion does not
> change.
> 
> With respect to in-memory only dedupe, it is my honest opinion that such a
> limited feature is not worth the extra maintenance work. In particular
> there's about 800 lines of code in the userspace patches which I'm sure
> you'd want merged, because how could we test this then?

I like the in-memory dedup backend. It's lightweight, only a heuristic,
does not need any IO or persistent storage. OTOH I consider it a subpart
of the in-band deduplication that does all the persistency etc. So I
treat the ioctl interface from a broader aspect.

A usecase I find interesting is to keep the in-memory dedup cache and
then flush it to disk on demand, compared to automatically synced dedup
(eg. at commit time).

> A couple examples sore points in my review so far:
> 
> - Internally you're using a mutex (instead of a spinlock) to lock out queries
>  to the in-memory hash, which I can see becoming a performance problem in the
>  write path.
> 
> - Also, we're doing SHA256 in the write path which I expect will
>  slow it down even more dramatically. Given that all the work done gets
>  thrown out every time we fill the hash (or remount), I just don't see much
>  benefit to the user with this.

I had some ideas to use faster hashes and do sha256 when it's going to
be stored on disk, but there were some concerns. The objection against
speed and performance hit at write time is valid. But we'll need to
verify that in real performance tests, which haven't happend yet up to
my knowledge.

> Users can get better dedupe via the ioctl today than with what
> you propose go in as an experimental feature so I don't see many people
> caring to test it. IMHO you would have to provide a more compelling reason
> to include this code.

I see it as a complementary feature in the deduplication capabilities,
covering more usecases.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11 17:36       ` David Sterba
@ 2016-05-12 20:54         ` Mark Fasheh
  2016-05-13  7:14           ` Duncan
                             ` (2 more replies)
  2016-05-13  6:01         ` Zygo Blaxell
  1 sibling, 3 replies; 23+ messages in thread
From: Mark Fasheh @ 2016-05-12 20:54 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Chris Mason, Josef Bacik, btrfs

On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:
> On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:
> > Taking your history with qgroups out of this btw, my opinion does not
> > change.
> > 
> > With respect to in-memory only dedupe, it is my honest opinion that such a
> > limited feature is not worth the extra maintenance work. In particular
> > there's about 800 lines of code in the userspace patches which I'm sure
> > you'd want merged, because how could we test this then?
> 
> I like the in-memory dedup backend. It's lightweight, only a heuristic,
> does not need any IO or persistent storage. OTOH I consider it a subpart
> of the in-band deduplication that does all the persistency etc. So I
> treat the ioctl interface from a broader aspect.

Those are all nice qualities, but what do they all get us?

For example, my 'large' duperemove test involves about 750 gigabytes of
general purpose data - quite literally /home off my workstation.

After the run I'm usually seeing between 65-75 gigabytes saved for a total
of only 10% duplicated data. I would expect this to be fairly 'average' -
/home on my machine has the usual stuff - documents, source code, media,
etc.

So if you were writing your whole fs out you could expect about the same
from inline dedupe - 10%-ish. Let's be generous and go with that number
though as a general 'this is how much dedupe we get'.

What the memory backend is doing then is providing a cache of sha256/block
calculations. This cache is very expensive to fill, and every written block
must go through it. On top of that, the cache does not persist between
mounts, and has items regularly removed from it when we run low on memory.
All of this will drive down the amount of duplicated data we can find.

So our best case savings is probably way below 10% - let's be _really_ nice
and say 5%.

Now ask yourself the question - would you accept a write cache which is
expensive to fill and would only have a hit rate of less than 5%?

Oh and there's 800 lines of userspace we'd merge to manage this cache too,
kernel ioctls which would have to be finalized, etc.


> A usecase I find interesting is to keep the in-memory dedup cache and
> then flush it to disk on demand, compared to automatically synced dedup
> (eg. at commit time).

What's the benefit here? We're still going to be hashing blocks on the way
in, and if we're not deduping them at write time then we're just have to
remove the extents and dedupe them later.


> > A couple examples sore points in my review so far:
> > 
> > - Internally you're using a mutex (instead of a spinlock) to lock out queries
> >  to the in-memory hash, which I can see becoming a performance problem in the
> >  write path.
> > 
> > - Also, we're doing SHA256 in the write path which I expect will
> >  slow it down even more dramatically. Given that all the work done gets
> >  thrown out every time we fill the hash (or remount), I just don't see much
> >  benefit to the user with this.
> 
> I had some ideas to use faster hashes and do sha256 when it's going to
> be stored on disk, but there were some concerns. The objection against
> speed and performance hit at write time is valid. But we'll need to
> verify that in real performance tests, which haven't happend yet up to
> my knowledge.

This is the type of thing that IMHO absolutely must be provided with each
code drop of the feature. Dedupe is nice but _nobody_ will use it if it's
slow. I know this from experience. I personally feel that btrfs has had
enough of 'cute' and 'almost working' features. If we want inline dedupe we
should do it correctly and with the right metrics from the beginning.


This is slightly unrelated to our discussion but my other unsolicited
opinion: As a kernel developer and maintainer of a file system for well over
a decade I will say that balancing the number of out of tree patches is
necessary but we should never be accepting of large features just because
'they've been out for a long time'. Again I mention this because other parts
of the discussion felt like they were going in that direction.

Thanks,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-10 22:11 ` Mark Fasheh
  2016-05-11  1:03   ` Qu Wenruo
  2016-05-11 16:56   ` David Sterba
@ 2016-05-13  3:13   ` Wang Shilong
  2016-05-13  3:44     ` Qu Wenruo
  2016-05-16 16:40     ` David Sterba
  2 siblings, 2 replies; 23+ messages in thread
From: Wang Shilong @ 2016-05-13  3:13 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba, btrfs

Hello Guys,

        I am commenting not because of Qu's patch, of course, Qu and Mark Fasheh
Do a really good thing for Btrfs contributions.

        Just my two cents!

1) I think Currently, we should really focus on Btrfs stability, there
are still many
bugs hidden inside Btrfs, please See Filipe flighting patches here and
there. Unfortunately,
I have not seen Btrfs's Bug is reducing for these two years even we
have frozen new features here.
we are adopting Btrfs internally sometime before, but it is really
unstable unfortunately.

So Why not sit down and make it stable firstly?

2)I am not against new feature, but for a new feature, I think we
should be really
careful now,  Especially if a new feature affect normal write/read
path, I think following things can
be done to make things better:

    ->Give your design and documents(maybe put it in wiki or somewhere
else) So that
other guys can really review your design instead of looking a bunch of
codes firstly. And we really
need understand pros and cons of design, also if there is TODO, please
do clarity out how we
can solve this problem(or it is possible?).

   ->We need reallly a lot of testing and benchmarking if it affects
normal write/read path.

   ->I think Chris is nice to merge patches, but I really argue next
time if we want to merge new feautres
Please make sure at least two other guys review for patches.


Thank you!
Shilong

On Wed, May 11, 2016 at 6:11 AM, Mark Fasheh <mfasheh@suse.de> wrote:
> On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:
>> Hi, Chris, Josef and David,
>>
>> As merge window for v4.7 is coming, it would be good to hear your
>> ideas about the inband dedupe.
>>
>> We are addressing the ENOSPC problem which Josef pointed out, and we
>> believe the final fix patch would come out at the beginning of the
>> merge window.(Next week)
>
> How about the fiemap performance problem you referenced before? My guess is
> that it happens because you don't coalesce writes into anything larger than
> a page so you're stuck deduping at some silly size like 4k. This in turn
> fragments the files so much that fiemap has a hard time walking backrefs.
>
> I have to check the patches to be sure but perhaps you can tell me whether
> my hunch is correct or not.
>
>
> In fact, I actually asked privately for time to review your dedupe patches,
> but I've been literally so busy cleaning up after the mess you left in your
> last qgroups rewrite I haven't had time.
>
> You literally broke qgroups in almost every spot that matters. In some cases
> (drop_snapshot) you tore out working code and left in a /* TODO */ comment
> for someone else to complete.  Snapshot create was so trivially and
> completely broken by your changes that weeks later, I'm still hunting a
> solution which doesn't involve adding an extra _commit_ to our commit.  This
> is a MASSIVE regression from where we were before.
>
> IMHO, you should not be trusted with large features or rewrites until you
> can demonstrate:
>
>  - A willingness to *completely* solve the problem you are trying to 'fix',
>    not do half the job which someone else will have to complete for you.
>
>  - Actual testing. The snapshot bug I reference above exists purely because
>    nobody created a snapshot inside of one and checked the qgroup numbers!
>
> Sorry to be so harsh.
>    --Mark
>
> --
> Mark Fasheh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-13  3:13   ` Wang Shilong
@ 2016-05-13  3:44     ` Qu Wenruo
  2016-05-13  6:21       ` Zygo Blaxell
  2016-05-16 16:40     ` David Sterba
  1 sibling, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2016-05-13  3:44 UTC (permalink / raw)
  To: Wang Shilong; +Cc: Chris Mason, Josef Bacik, David Sterba, btrfs



Wang Shilong wrote on 2016/05/13 11:13 +0800:
> Hello Guys,
>
>         I am commenting not because of Qu's patch, of course, Qu and Mark Fasheh
> Do a really good thing for Btrfs contributions.
>
>         Just my two cents!
>
> 1) I think Currently, we should really focus on Btrfs stability, there
> are still many
> bugs hidden inside Btrfs, please See Filipe flighting patches here and
> there. Unfortunately,
> I have not seen Btrfs's Bug is reducing for these two years even we
> have frozen new features here.
> we are adopting Btrfs internally sometime before, but it is really
> unstable unfortunately.
>
> So Why not sit down and make it stable firstly?

Make sense.

Although maybe out of your expectation, inband de-dedupe did exposed 
some existing bugs we didn't ever found before.
And they are all reproducible without inband dedupe.

Some examples:
1) fiemap bugs
    Not only one, but at least 2.
    See the recently submitted generic/352 and 353.

    And the fix is already under testing and may come out soon.

2) inode->outstanding_extents problems.
    Currently we use SZ_128M hard coded max extent length for
    non-compressed file extents.
    But if we change the limit to a smaller one, for example, 128K.
    We will have outstanding extents leak, and tons of warning.

    Although it won't affect current codes, it's still better to fix it.
    And we're already testing the fix again.

3) Slow backref walk.
    Already in the comment of backref.c from ancient days, but we didn't
    put much concern until inband dedupe/heavily reflink work load.

Even not that obvious, we are doing our best to stabilizing btrfs during 
the push of inband dedupe.
(While a nitpicking jerk will never see this)

But you are still quite right on this case, we may be in a rush to push it.

>
> 2)I am not against new feature, but for a new feature, I think we
> should be really
> careful now,  Especially if a new feature affect normal write/read
> path, I think following things can
> be done to make things better:

Although the affect to normal routine is limited to minimal.
you're still right, we lacks the overall documentation to explain the 
design which tries to reduce the impact to existing write routine.

>
>     ->Give your design and documents(maybe put it in wiki or somewhere
> else) So that
> other guys can really review your design instead of looking a bunch of
> codes firstly. And we really
> need understand pros and cons of design, also if there is TODO, please
> do clarity out how we
> can solve this problem(or it is possible?).

Right, already planned before but always busy with other fixes.

The case will change in recent 2 month dramatically, as the modification 
to patchset is already minimal, we have time to create/improve the 
documentation now.

Thanks for the remind.

>
>    ->We need reallly a lot of testing and benchmarking if it affects
> normal write/read path.
>
>    ->I think Chris is nice to merge patches, but I really argue next
> time if we want to merge new feautres
> Please make sure at least two other guys review for patches.
>
>
> Thank you!
> Shilong

Thanks for you suggestions, really helps a lot!

Thanks,
Qu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-11 17:36       ` David Sterba
  2016-05-12 20:54         ` Mark Fasheh
@ 2016-05-13  6:01         ` Zygo Blaxell
  1 sibling, 0 replies; 23+ messages in thread
From: Zygo Blaxell @ 2016-05-13  6:01 UTC (permalink / raw)
  To: dsterba, Mark Fasheh, Qu Wenruo, Chris Mason, Josef Bacik, btrfs

[-- Attachment #1: Type: text/plain, Size: 2180 bytes --]

On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:
> I like the in-memory dedup backend. It's lightweight, only a heuristic,
> does not need any IO or persistent storage. OTOH I consider it a subpart
> of the in-band deduplication that does all the persistency etc. So I
> treat the ioctl interface from a broader aspect.
> 
> A usecase I find interesting is to keep the in-memory dedup cache and
> then flush it to disk on demand, compared to automatically synced dedup
> (eg. at commit time).

The tradeoff depends on a lot of parameters, like your expected dup rate,
memory size, and seek latency.  If the dup rate is high (say 40%) and
your seek latency is high (low-cost spinning rust) and you don't have
enough RAM to load the whole hash table into memory, then an on-disk
dedup cache _itself_ creates an unusable I/O load.  Hash table lookups
generate random I/O, and at 40% dup rate every other block you write
requires a performance-crippling disk seek to read the half of the cache
that isn't in RAM.

I looked at my parameters and concluded that an in-memory cache (with
persistence by saving the data at regular intervals) was the *only*
kind of cache I'd ever be able to use with any dedup implementation.

If the dup rate is lower or you're using SSD then you might trade some
IOs for more free RAM and consider an on-disk hash table with some sort of
paging scheme.  If you have huge amounts of RAM you don't need an on-disk
scheme at all--you can add persistence e.g. by trickle-writing the data
over the space of an hour to avoid adding a lot of latency or memory
pressure at once.

> > Users can get better dedupe via the ioctl today than with what
> > you propose go in as an experimental feature so I don't see many people
> > caring to test it. IMHO you would have to provide a more compelling reason
> > to include this code.
> 
> I see it as a complementary feature in the deduplication capabilities,
> covering more usecases.

If you have unlimited amounts of RAM, fast CPU, and slow disks, then it
certainly makes sense, even with the SHA256 hash.  That seems to be the
use case ZFS was designed for.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-13  3:44     ` Qu Wenruo
@ 2016-05-13  6:21       ` Zygo Blaxell
  0 siblings, 0 replies; 23+ messages in thread
From: Zygo Blaxell @ 2016-05-13  6:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Wang Shilong, Chris Mason, Josef Bacik, David Sterba, btrfs

[-- Attachment #1: Type: text/plain, Size: 1157 bytes --]

On Fri, May 13, 2016 at 11:44:40AM +0800, Qu Wenruo wrote:
> Although maybe out of your expectation, inband de-dedupe did exposed some
> existing bugs we didn't ever found before.
> And they are all reproducible without inband dedupe.
> 
> Some examples:
> [...]
> 3) Slow backref walk.
>    Already in the comment of backref.c from ancient days, but we didn't
>    put much concern until inband dedupe/heavily reflink work load.

Out-of-band dedup definitely hits that.  Most blocks are duplicated only
a few times, but a few common blocks are _really_ common.  I have some
filesystems where some extents have a few tens of thousands of references
by dedup, then a dozen snapshots are created to increase the ref count by
another order of magnitude.  'btrfs ins logical' takes hours to run on
these blocks--assuming the host doesn't run out of RAM and crash before
giving an answer.

Try doing a balance on a filesystem where some extents have a million+
references on a machine with 2G of RAM.  I _dare_ you.

It would be nice to get the bug fixes in sooner, whether or not the
upper layers of the in-band dedup feature are ready.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-12 20:54         ` Mark Fasheh
@ 2016-05-13  7:14           ` Duncan
  2016-05-13 12:14           ` Austin S. Hemmelgarn
  2016-05-16 15:26           ` David Sterba
  2 siblings, 0 replies; 23+ messages in thread
From: Duncan @ 2016-05-13  7:14 UTC (permalink / raw)
  To: linux-btrfs

Mark Fasheh posted on Thu, 12 May 2016 13:54:26 -0700 as excerpted:

> For example, my 'large' duperemove test involves about 750 gigabytes of
> general purpose data - quite literally /home off my workstation.
> 
> After the run I'm usually seeing between 65-75 gigabytes saved for a
> total of only 10% duplicated data. I would expect this to be fairly
> 'average' - /home on my machine has the usual stuff - documents, source
> code, media, etc.
> 
> So if you were writing your whole fs out you could expect about the same
> from inline dedupe - 10%-ish. Let's be generous and go with that number
> though as a general 'this is how much dedupe we get'.
> 
> What the memory backend is doing then is providing a cache of
> sha256/block calculations. This cache is very expensive to fill, and
> every written block must go through it. On top of that, the cache does
> not persist between mounts, and has items regularly removed from it when
> we run low on memory. All of this will drive down the amount of
> duplicated data we can find.
> 
> So our best case savings is probably way below 10% - let's be _really_
> nice and say 5%.

My understanding is that this "general purpose data" use-case isn't being 
targeted by the in-memory dedup at all, because indeed it's a very poor 
fit for exactly the reason you explain.

Instead, think data centers where perhaps 50% of all files are duplicated 
thousands of times... and it's exactly those files that are most 
frequently used.  Totally different use-case, where that 5% on general 
purpose data could easily skyrocket to 50%+.

Refining that a bit, as I understand it, the idea with the in-memory-
inline dedup is pretty much opportunity-based dedup.  Where there's an 
easy opportunity seen, grab it, but don't go out of your way to do 
anything fancy.  Then somewhat later, a much more thorough offline dedup 
process will come along and dedup-pack everything else.

In that scenario a quick-opportunity 20% hit rate may be acceptable, 
while actual hit rates may approach 50% due to skew toward the common.  
Then the dedup-pack comes along and finishes things, possibly resulting 
in total savings of say 70% or so.  Even if the in-memory doesn't get 
that common-skew boost and ends up nearer 20%, that's still a significant 
savings for the initial inline result, with the dedup-packer coming along 
later to clean things up properly.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-12 20:54         ` Mark Fasheh
  2016-05-13  7:14           ` Duncan
@ 2016-05-13 12:14           ` Austin S. Hemmelgarn
  2016-05-13 14:25             ` Qu Wenruo
  2016-05-13 16:37             ` Zygo Blaxell
  2016-05-16 15:26           ` David Sterba
  2 siblings, 2 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-13 12:14 UTC (permalink / raw)
  To: Mark Fasheh, dsterba, Qu Wenruo, Chris Mason, Josef Bacik, btrfs

On 2016-05-12 16:54, Mark Fasheh wrote:
> On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:
>> On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:
>>> Taking your history with qgroups out of this btw, my opinion does not
>>> change.
>>>
>>> With respect to in-memory only dedupe, it is my honest opinion that such a
>>> limited feature is not worth the extra maintenance work. In particular
>>> there's about 800 lines of code in the userspace patches which I'm sure
>>> you'd want merged, because how could we test this then?
>>
>> I like the in-memory dedup backend. It's lightweight, only a heuristic,
>> does not need any IO or persistent storage. OTOH I consider it a subpart
>> of the in-band deduplication that does all the persistency etc. So I
>> treat the ioctl interface from a broader aspect.
>
> Those are all nice qualities, but what do they all get us?
>
> For example, my 'large' duperemove test involves about 750 gigabytes of
> general purpose data - quite literally /home off my workstation.
>
> After the run I'm usually seeing between 65-75 gigabytes saved for a total
> of only 10% duplicated data. I would expect this to be fairly 'average' -
> /home on my machine has the usual stuff - documents, source code, media,
> etc.
>
> So if you were writing your whole fs out you could expect about the same
> from inline dedupe - 10%-ish. Let's be generous and go with that number
> though as a general 'this is how much dedupe we get'.
>
> What the memory backend is doing then is providing a cache of sha256/block
> calculations. This cache is very expensive to fill, and every written block
> must go through it. On top of that, the cache does not persist between
> mounts, and has items regularly removed from it when we run low on memory.
> All of this will drive down the amount of duplicated data we can find.
>
> So our best case savings is probably way below 10% - let's be _really_ nice
> and say 5%.
>
> Now ask yourself the question - would you accept a write cache which is
> expensive to fill and would only have a hit rate of less than 5%?
In-band deduplication is a feature that is not used by typical desktop 
users or even many developers because it's computationally expensive, 
but it's used _all the time_ by big data-centers and similar places 
where processor time is cheap and storage efficiency is paramount. 
Deduplication is more useful in general the more data you have.  5% of 1 
TB is 20 GB, which is not much.  5% of 1 PB is 20 TB, which is at least 
3-5 disks, which can then be used for storing more data, or providing 
better resiliency against failures.

To look at it another way, deduplicating an individual's home directory 
will almost never get you decent space savings, the majority of shared 
data is usually file headers and nothing more, which can't be 
deduplicated efficiently because of block size requirements. 
Deduplicating all the home directories on a terminal server with 500 
users usually will get you decent space savings, as there very likely 
are a number of files that multiple people have exact copies of, but 
most of them are probably not big files.  Deduplicating the entirety of 
a multi-petabyte file server used for storing VM disk images will 
probably save you a very significant amount of space, because the 
probability of having data that can be deduplicated goes up as you store 
more data, and there is likely to be a lot of data shared between the 
disk images.

This is exactly why I don't use deduplication on any of my personal 
systems.  On my laptop, the space saved is just not worth the time spent 
doing it, as I fall pretty solidly into the first case (most of the data 
duplication on my systems is in file headers).  On my home server, I'm 
not storing enough data with sufficient internal duplication that it 
would save more than 10-20 GB, which doesn't matter for me given that 
I'm using roughly half of the 2.2 TB of effective storage space I have. 
However, once we (eventually) get all the file servers where I work 
moved over to Linux systems running BTRFS, we will absolutely be using 
deduplication there, as we have enough duplication in our data that it 
will probably cut our storage requirements by around 20% on average.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-13 12:14           ` Austin S. Hemmelgarn
@ 2016-05-13 14:25             ` Qu Wenruo
  2016-05-13 16:37             ` Zygo Blaxell
  1 sibling, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2016-05-13 14:25 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, dsterba, Qu Wenruo, Chris Mason,
	Josef Bacik, btrfs



On 05/13/2016 08:14 PM, Austin S. Hemmelgarn wrote:
> On 2016-05-12 16:54, Mark Fasheh wrote:
>> On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:
>>> On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:
>>>> Taking your history with qgroups out of this btw, my opinion does not
>>>> change.
>>>>
>>>> With respect to in-memory only dedupe, it is my honest opinion that
>>>> such a
>>>> limited feature is not worth the extra maintenance work. In particular
>>>> there's about 800 lines of code in the userspace patches which I'm sure
>>>> you'd want merged, because how could we test this then?
>>>
>>> I like the in-memory dedup backend. It's lightweight, only a heuristic,
>>> does not need any IO or persistent storage. OTOH I consider it a subpart
>>> of the in-band deduplication that does all the persistency etc. So I
>>> treat the ioctl interface from a broader aspect.
>>
>> Those are all nice qualities, but what do they all get us?
>>
>> For example, my 'large' duperemove test involves about 750 gigabytes of
>> general purpose data - quite literally /home off my workstation.
>>
>> After the run I'm usually seeing between 65-75 gigabytes saved for a
>> total
>> of only 10% duplicated data. I would expect this to be fairly 'average' -
>> /home on my machine has the usual stuff - documents, source code, media,
>> etc.
>>
>> So if you were writing your whole fs out you could expect about the same
>> from inline dedupe - 10%-ish. Let's be generous and go with that number
>> though as a general 'this is how much dedupe we get'.
>>
>> What the memory backend is doing then is providing a cache of
>> sha256/block
>> calculations. This cache is very expensive to fill, and every written
>> block
>> must go through it. On top of that, the cache does not persist between
>> mounts, and has items regularly removed from it when we run low on
>> memory.
>> All of this will drive down the amount of duplicated data we can find.
>>
>> So our best case savings is probably way below 10% - let's be _really_
>> nice
>> and say 5%.
>>
>> Now ask yourself the question - would you accept a write cache which is
>> expensive to fill and would only have a hit rate of less than 5%?
> In-band deduplication is a feature that is not used by typical desktop
> users or even many developers because it's computationally expensive,
> but it's used _all the time_ by big data-centers and similar places
> where processor time is cheap and storage efficiency is paramount.
> Deduplication is more useful in general the more data you have.  5% of 1
> TB is 20 GB, which is not much.  5% of 1 PB is 20 TB, which is at least
> 3-5 disks, which can then be used for storing more data, or providing
> better resiliency against failures.
>
> To look at it another way, deduplicating an individual's home directory
> will almost never get you decent space savings, the majority of shared
> data is usually file headers and nothing more, which can't be
> deduplicated efficiently because of block size requirements.
> Deduplicating all the home directories on a terminal server with 500
> users usually will get you decent space savings, as there very likely
> are a number of files that multiple people have exact copies of, but
> most of them are probably not big files.  Deduplicating the entirety of
> a multi-petabyte file server used for storing VM disk images will
> probably save you a very significant amount of space, because the
> probability of having data that can be deduplicated goes up as you store
> more data, and there is likely to be a lot of data shared between the
> disk images.
>
> This is exactly why I don't use deduplication on any of my personal
> systems.  On my laptop, the space saved is just not worth the time spent
> doing it, as I fall pretty solidly into the first case (most of the data
> duplication on my systems is in file headers).  On my home server, I'm
> not storing enough data with sufficient internal duplication that it
> would save more than 10-20 GB, which doesn't matter for me given that
> I'm using roughly half of the 2.2 TB of effective storage space I have.
> However, once we (eventually) get all the file servers where I work
> moved over to Linux systems running BTRFS, we will absolutely be using
> deduplication there, as we have enough duplication in our data that it
> will probably cut our storage requirements by around 20% on average.
> --

Nice to hear that.

Although the feature is not going to be merged soon, but there is some 
info which may help anyone interested with inband dedupe to plan their 
usage or tests.

1) CPU core number
    Since currently in-band dedupe balance all SHA256 calculation using
    async-thread, for default thread_pool number, it's
    max(nr_cpu + 2, 8).
    So if using default thread_pool number, and the system doesn't have
    more than 8 cores, all CPU can be easily eaten up by dedupe.

    So either use dedupe with default thread_pool setting when you have
    more  than 8 cores, or tunning thread_pool to nr_cpu - 1 or 2.

    Or your system will be less responsive then running delalloc
    (fsync/sync).

2) Dedupe speed.
    In our internal 1 core test. It's about the same speed compared to
    compress=lzo mount option or a little faster, when the data doesn't
    have a quite good compress ratio (80~100% compression ratio).

    We will do more completely performance btests ased on file contents
    (compress ratio and dedupe ratio), core numbers, and more different
    settings like dedupe bs, when writing all these things into wiki.

    But personally I recommend to think twice before choosing compress
    with dedupe.
    It's really dependent on the file contents.

3) Per-file/dir control.
    Same as compression, don't forget such useful flag (although will be
    delayed even further). It would save a lot of CPU time.

    So if tuned well (admin knows the dedupe hotspot), the space save
    can go up to 40% for specific dirs, while doesn't spend any extra CPU
    for other data.

To conclude, this feature is not intended to be used without any 
thinking/reading/checking/tuning, and not for everyone.

Although any one can try dedupe very easy on existing btrfs with patched 
kernel, and then mount it back with stable kernel after the trial.

Just as Austin said, it's some useful feature for large storage server 
where CPU time is relative cheap compared to storage or the data is 
specially patterned for dedupe.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-13 12:14           ` Austin S. Hemmelgarn
  2016-05-13 14:25             ` Qu Wenruo
@ 2016-05-13 16:37             ` Zygo Blaxell
  1 sibling, 0 replies; 23+ messages in thread
From: Zygo Blaxell @ 2016-05-13 16:37 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Mark Fasheh, dsterba, Qu Wenruo, Chris Mason, Josef Bacik, btrfs

[-- Attachment #1: Type: text/plain, Size: 2761 bytes --]

On Fri, May 13, 2016 at 08:14:10AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-12 16:54, Mark Fasheh wrote:
> >Now ask yourself the question - would you accept a write cache which is
> >expensive to fill and would only have a hit rate of less than 5%?
> In-band deduplication is a feature that is not used by typical desktop users
> or even many developers because it's computationally expensive, but it's
> used _all the time_ by big data-centers and similar places where processor
> time is cheap and storage efficiency is paramount. Deduplication is more
> useful in general the more data you have.  5% of 1 TB is 20 GB, which is not
> much.  5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be
> used for storing more data, or providing better resiliency against failures.

There's also a big cost difference between a 5-drive and 6-drive array
when all your servers are built with 5-drive cages.  Delay that expansion
for six months, and you can buy 5 larger drives for the same price.

I have laptops with filesystems that are _just_ over 512GB before dedup.
SSDs seem to come in 512GB and 1TB sizes with nothing in between, so
saving even a few dozen MB can translate into hundreds of dollars per
user (not to mention the bureaucratic hardware provisioning side-effects
which easily triple that cost).

Working on big software integration projects can generate 60% duplication
rates.  Think checkout of entire OS + disk images + embedded media,
build + install trees with copied data, long build times so there are
multiple working directories, multiple active branches of the same
project, and oh yeah it's all in SVN, which stores a complete second
copy of everything on disk just because wasting space is cool.

> To look at it another way, deduplicating an individual's home directory will
> almost never get you decent space savings, the majority of shared data is
> usually file headers and nothing more, which can't be deduplicated
> efficiently because of block size requirements. Deduplicating all the home
> directories on a terminal server with 500 users usually will get you decent
> space savings, as there very likely are a number of files that multiple
> people have exact copies of, but most of them are probably not big files.
> Deduplicating the entirety of a multi-petabyte file server used for storing
> VM disk images will probably save you a very significant amount of space,
> because the probability of having data that can be deduplicated goes up as
> you store more data, and there is likely to be a lot of data shared between
> the disk images.

Mail stores benefit from this too.  For some reason, Microsoft Office
users seem to enjoy emailing multi-megabyte runs of MIME-encoded zeros
to each other.  ;)


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-12 20:54         ` Mark Fasheh
  2016-05-13  7:14           ` Duncan
  2016-05-13 12:14           ` Austin S. Hemmelgarn
@ 2016-05-16 15:26           ` David Sterba
  2 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2016-05-16 15:26 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, Chris Mason, Josef Bacik, btrfs

On Thu, May 12, 2016 at 01:54:26PM -0700, Mark Fasheh wrote:
> On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:
> > On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:
> > > Taking your history with qgroups out of this btw, my opinion does not
> > > change.
> > > 
> > > With respect to in-memory only dedupe, it is my honest opinion that such a
> > > limited feature is not worth the extra maintenance work. In particular
> > > there's about 800 lines of code in the userspace patches which I'm sure
> > > you'd want merged, because how could we test this then?
> > 
> > I like the in-memory dedup backend. It's lightweight, only a heuristic,
> > does not need any IO or persistent storage. OTOH I consider it a subpart
> > of the in-band deduplication that does all the persistency etc. So I
> > treat the ioctl interface from a broader aspect.
> 
> Those are all nice qualities, but what do they all get us?
> 
> For example, my 'large' duperemove test involves about 750 gigabytes of
> general purpose data - quite literally /home off my workstation.

Home directory is IMO a typicial anti-usecase for in-band deduplication.

> After the run I'm usually seeing between 65-75 gigabytes saved for a total
> of only 10% duplicated data. I would expect this to be fairly 'average' -
> /home on my machine has the usual stuff - documents, source code, media,
> etc.
> 
> So if you were writing your whole fs out you could expect about the same
> from inline dedupe - 10%-ish. Let's be generous and go with that number
> though as a general 'this is how much dedupe we get'.
> 
> What the memory backend is doing then is providing a cache of sha256/block
> calculations. This cache is very expensive to fill, and every written block
> must go through it. On top of that, the cache does not persist between
> mounts, and has items regularly removed from it when we run low on memory.
> All of this will drive down the amount of duplicated data we can find.
> 
> So our best case savings is probably way below 10% - let's be _really_ nice
> and say 5%.
> 
> Now ask yourself the question - would you accept a write cache which is
> expensive to fill and would only have a hit rate of less than 5%?

I'd ask myself a diffrent question: would I turn on a write cache if the
utilization is that low? No.

The in-memory cache is not going to be on by default, it's going to help
in spefific scenarios, as described in other replies.

> Oh and there's 800 lines of userspace we'd merge to manage this cache too,
> kernel ioctls which would have to be finalized, etc.

Most of the userspace code is going to be shared with the generic dedup.
I haven't reviewed the userspace because the ioctl interfaces are not
settled so the code can change, but now it should be enough to test the
feature.

> > A usecase I find interesting is to keep the in-memory dedup cache and
> > then flush it to disk on demand, compared to automatically synced dedup
> > (eg. at commit time).
> 
> What's the benefit here? We're still going to be hashing blocks on the way
> in, and if we're not deduping them at write time then we're just have to
> remove the extents and dedupe them later.
> 
> 
> > > A couple examples sore points in my review so far:
> > > 
> > > - Internally you're using a mutex (instead of a spinlock) to lock out queries
> > >  to the in-memory hash, which I can see becoming a performance problem in the
> > >  write path.
> > > 
> > > - Also, we're doing SHA256 in the write path which I expect will
> > >  slow it down even more dramatically. Given that all the work done gets
> > >  thrown out every time we fill the hash (or remount), I just don't see much
> > >  benefit to the user with this.
> > 
> > I had some ideas to use faster hashes and do sha256 when it's going to
> > be stored on disk, but there were some concerns. The objection against
> > speed and performance hit at write time is valid. But we'll need to
> > verify that in real performance tests, which haven't happend yet up to
> > my knowledge.
> 
> This is the type of thing that IMHO absolutely must be provided with each
> code drop of the feature. Dedupe is nice but _nobody_ will use it if it's
> slow. I know this from experience. I personally feel that btrfs has had
> enough of 'cute' and 'almost working' features. If we want inline dedupe we
> should do it correctly and with the right metrics from the beginning.

That's why this feature is on slow-merge track and I'm glad that so many
people are interested. And I hope it's not just because Qu asked to
merge it. I'm pushing back on the interface side, so far I think the
feature will address real usecases.

To evaluate the feature we have to put into some more visible git tree,
not only to see how it plays together with other changes, but because
the number of people who would apply the patches from mailinglist is
low. Leaving out the patchset from a for-next is easy if we encounter
problems.

> This is slightly unrelated to our discussion but my other unsolicited
> opinion: As a kernel developer and maintainer of a file system for well over
> a decade I will say that balancing the number of out of tree patches is
> necessary but we should never be accepting of large features just because
> 'they've been out for a long time'. Again I mention this because other parts
> of the discussion felt like they were going in that direction.

The amount of attention is higher if a patchset goes from out-of-tree to
some staging level, which would be the for-next when it's going via my
trees. But this does not say anything about the timeframe of merge. The
people involved usually know where it stands but this might not be
obvious to everybody as the discussion could happen on IRC or in private
mails. Changelogs of patchset revisions usually track the progress.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: About in-band dedupe for v4.7
  2016-05-13  3:13   ` Wang Shilong
  2016-05-13  3:44     ` Qu Wenruo
@ 2016-05-16 16:40     ` David Sterba
  1 sibling, 0 replies; 23+ messages in thread
From: David Sterba @ 2016-05-16 16:40 UTC (permalink / raw)
  To: Wang Shilong
  Cc: Mark Fasheh, Qu Wenruo, Chris Mason, Josef Bacik, David Sterba, btrfs

On Fri, May 13, 2016 at 11:13:18AM +0800, Wang Shilong wrote:
>         I am commenting not because of Qu's patch, of course, Qu and Mark Fasheh
> Do a really good thing for Btrfs contributions.
> 
>         Just my two cents!
> 
> 1) I think Currently, we should really focus on Btrfs stability, there
> are still many
> bugs hidden inside Btrfs, please See Filipe flighting patches here and
> there. Unfortunately,
> I have not seen Btrfs's Bug is reducing for these two years even we
> have frozen new features here.

So this means that number of bugs is increasing despite all the bugfixes
taht get merged? I wonder if you have numbers to back that or if it's
just handwaving. Complex code has bugs and it's not easy to discover
them, that's not surprising.

> we are adopting Btrfs internally sometime before, but it is really
> unstable unfortunately.
> 
> So Why not sit down and make it stable firstly?

And do you thing this does not happen? Most of patches that are merged
are fixes or updates related to fixes. The number of new big features is
at most one per release.

> 2)I am not against new feature, but for a new feature, I think we
> should be really
> careful now,  Especially if a new feature affect normal write/read
> path, I think following things can
> be done to make things better:
> 
>     ->Give your design and documents(maybe put it in wiki or somewhere
> else) So that
> other guys can really review your design instead of looking a bunch of
> codes firstly. And we really
> need understand pros and cons of design, also if there is TODO, please
> do clarity out how we
> can solve this problem(or it is possible?).

The design document should come as th 0-th patch in a series. This
patchset had the overall idea described in
1438072250-2871-1-git-send-email-quwenruo@cn.fujitsu.com, further
comments about details are scattered under the patchset revisions, so
this could be improved so everybody knows what's the consensus.

>    ->We need reallly a lot of testing and benchmarking if it affects
> normal write/read path.

Sure, new tests and test results are welcome but they do not come in
hundreds.

>    ->I think Chris is nice to merge patches, but I really argue next
> time if we want to merge new feautres
> Please make sure at least two other guys review for patches.

Thanks for the advice, but that's nothing new. People willing to do
reviews are scarce and overloaded. More reviews are not a problem,
everybody does it in a different style, the more the better coverage
we'll have in the end.

Doing reviews cannot be forced otherwise the quality would drop,
announcing 'reviewer of the month' would not help either. Everybody who
considers self a frequent contributor and knows some area should be also
self-motivated to review patches that touch it. Reviewing has a positive
educational effect as one has to make sure he understands what the old
code does and how it's going to be changed by the patch. There's usually
overlap with other areas so this slowly extends the expertise which has
a good impact on the developer community etc.

Maintainers do reviews, but not necessarily as deep as another developer
could do, because sometimes it's difficult, or too lengthy to get
familiar (again) with some particular code or code path.  This leads to
slow merging pace or worse merging buggy patches if the review misses
some dark corners.

I think I'm stating the obvious but it should not hurt to repeat this
from time to time.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-05-16 16:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-10  7:19 About in-band dedupe for v4.7 Qu Wenruo
2016-05-10 22:11 ` Mark Fasheh
2016-05-11  1:03   ` Qu Wenruo
2016-05-11  2:52     ` Mark Fasheh
2016-05-11  9:14       ` Qu Wenruo
2016-05-11 17:36       ` David Sterba
2016-05-12 20:54         ` Mark Fasheh
2016-05-13  7:14           ` Duncan
2016-05-13 12:14           ` Austin S. Hemmelgarn
2016-05-13 14:25             ` Qu Wenruo
2016-05-13 16:37             ` Zygo Blaxell
2016-05-16 15:26           ` David Sterba
2016-05-13  6:01         ` Zygo Blaxell
2016-05-11 16:56   ` David Sterba
2016-05-13  3:13   ` Wang Shilong
2016-05-13  3:44     ` Qu Wenruo
2016-05-13  6:21       ` Zygo Blaxell
2016-05-16 16:40     ` David Sterba
2016-05-11  0:37 ` Chris Mason
2016-05-11  1:40   ` Qu Wenruo
2016-05-11  2:26     ` Satoru Takeuchi
2016-05-11  4:22     ` Mark Fasheh
2016-05-11 16:39 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.