All of lore.kernel.org
 help / color / mirror / Atom feed
* Any chance to get snapshot-aware defragmentation?
@ 2018-05-11 15:22 Niccolò Belli
  2018-05-18 16:20 ` David Sterba
  0 siblings, 1 reply; 18+ messages in thread
From: Niccolò Belli @ 2018-05-11 15:22 UTC (permalink / raw)
  To: linux-btrfs

Hi,
I'm waiting for this feature since years and initially it seemed like 
something which would have been worked on, sooner or later.
A long time had passed without any progress on this, so I would like to 
know if there is any technical limitation preventing this or if it's 
something which could possibly land in the near future.

Thanks,
Niccolò

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-11 15:22 Any chance to get snapshot-aware defragmentation? Niccolò Belli
@ 2018-05-18 16:20 ` David Sterba
  2018-05-18 16:36   ` Niccolò Belli
  0 siblings, 1 reply; 18+ messages in thread
From: David Sterba @ 2018-05-18 16:20 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: linux-btrfs

On Fri, May 11, 2018 at 05:22:26PM +0200, Niccolò Belli wrote:
> I'm waiting for this feature since years and initially it seemed like 
> something which would have been worked on, sooner or later.
> A long time had passed without any progress on this, so I would like to 
> know if there is any technical limitation preventing this or if it's 
> something which could possibly land in the near future.

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 16:20 ` David Sterba
@ 2018-05-18 16:36   ` Niccolò Belli
  2018-05-18 17:10     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 18+ messages in thread
From: Niccolò Belli @ 2018-05-18 16:36 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:
> Josef started working on that in 2014 and did not finish it. The patches
> can be still found in his tree. The problem is in excessive memory
> consumption when there are many snapshots that need to be tracked during
> the defragmentation, so there are measures to avoid OOM. There's
> infrastructure ready for use (shrinkers), there are maybe some problems
> but fundamentally is should work.
>
> I'd like to get the snapshot-aware working again too, we'd need to find
> a volunteer to resume the work on the patchset.

Yeah I know of Josef's work, but 4 years had passed since then without any 
news on this front.

What I would really like to know is why nobody resumed his work: is it 
because it's impossible to implement snapshot-aware degram without 
excessive ram usage or is it simply because nobody is interested?

Niccolò

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 16:36   ` Niccolò Belli
@ 2018-05-18 17:10     ` Austin S. Hemmelgarn
  2018-05-18 17:18       ` Niccolò Belli
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-18 17:10 UTC (permalink / raw)
  To: Niccolò Belli, David Sterba; +Cc: linux-btrfs

On 2018-05-18 12:36, Niccolò Belli wrote:
> On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:
>> Josef started working on that in 2014 and did not finish it. The patches
>> can be still found in his tree. The problem is in excessive memory
>> consumption when there are many snapshots that need to be tracked during
>> the defragmentation, so there are measures to avoid OOM. There's
>> infrastructure ready for use (shrinkers), there are maybe some problems
>> but fundamentally is should work.
>>
>> I'd like to get the snapshot-aware working again too, we'd need to find
>> a volunteer to resume the work on the patchset.
> 
> Yeah I know of Josef's work, but 4 years had passed since then without 
> any news on this front.
> 
> What I would really like to know is why nobody resumed his work: is it 
> because it's impossible to implement snapshot-aware degram without 
> excessive ram usage or is it simply because nobody is interested?
I think it's because nobody who is interested has both the time and the 
coding skills to tackle it.

Personally though, I think the biggest issue with what was done was not 
the memory consumption, but the fact that there was no switch to turn it 
on or off.  Making defrag unconditionally snapshot aware removes one of 
the easiest ways to forcibly unshare data without otherwise altering the 
files (which, as stupid as it sounds, is actually really useful for some 
storage setups), and also forces the people who have ridiculous numbers 
of snapshots to deal with the memory usage or never defrag.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 17:10     ` Austin S. Hemmelgarn
@ 2018-05-18 17:18       ` Niccolò Belli
  2018-05-18 18:33         ` Austin S. Hemmelgarn
  2018-05-18 23:55       ` Tomasz Pala
  2018-05-21 17:43       ` David Sterba
  2 siblings, 1 reply; 18+ messages in thread
From: Niccolò Belli @ 2018-05-18 17:18 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: David Sterba, linux-btrfs

On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
> and also forces the people who have ridiculous numbers of 
> snapshots to deal with the memory usage or never defrag

Whoever has at least one snapshot is never going to defrag anyway, unless 
he is willing to double the used space.

Niccolò

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 17:18       ` Niccolò Belli
@ 2018-05-18 18:33         ` Austin S. Hemmelgarn
  2018-05-18 22:26           ` Chris Murphy
  2018-05-19  8:54           ` Niccolò Belli
  0 siblings, 2 replies; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-18 18:33 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: David Sterba, linux-btrfs

On 2018-05-18 13:18, Niccolò Belli wrote:
> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
>> and also forces the people who have ridiculous numbers of snapshots to 
>> deal with the memory usage or never defrag
> 
> Whoever has at least one snapshot is never going to defrag anyway, 
> unless he is willing to double the used space.
> 
With a bit of work, it's possible to handle things sanely.  You can 
deduplicate data from snapshots, even if they are read-only (you need to 
pass the `-A` option to duperemove and run it as root), so it's 
perfectly reasonable to only defrag the main subvolume, and then 
deduplicate the snapshots against that (so that they end up all being 
reflinks to the main subvolume).  Of course, this won't work if you're 
short on space, but if you're dealing with snapshots, you should have 
enough space that this will work (because even without defrag, it's 
fully possible for something to cause the snapshots to suddenly take up 
a lot more space).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 18:33         ` Austin S. Hemmelgarn
@ 2018-05-18 22:26           ` Chris Murphy
  2018-05-18 22:46             ` Omar Sandoval
  2018-05-19  8:54           ` Niccolò Belli
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2018-05-18 22:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Niccolò Belli, David Sterba, Btrfs BTRFS

On Fri, May 18, 2018 at 12:33 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2018-05-18 13:18, Niccolò Belli wrote:
>>
>> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
>>>
>>> and also forces the people who have ridiculous numbers of snapshots to
>>> deal with the memory usage or never defrag
>>
>>
>> Whoever has at least one snapshot is never going to defrag anyway, unless
>> he is willing to double the used space.
>>
> With a bit of work, it's possible to handle things sanely.  You can
> deduplicate data from snapshots, even if they are read-only (you need to
> pass the `-A` option to duperemove and run it as root), so it's perfectly
> reasonable to only defrag the main subvolume, and then deduplicate the
> snapshots against that (so that they end up all being reflinks to the main
> subvolume).  Of course, this won't work if you're short on space, but if
> you're dealing with snapshots, you should have enough space that this will
> work (because even without defrag, it's fully possible for something to
> cause the snapshots to suddenly take up a lot more space).


Curiously, snapshot aware defragmentation is going to increase free
space fragmentation. For busy in-use systems, it might be necessary to
use space cache v2 to avoid performance problems.

I forget the exact reason why the free space tree is not the default,
I think it has to do with missing repair support?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 22:26           ` Chris Murphy
@ 2018-05-18 22:46             ` Omar Sandoval
  0 siblings, 0 replies; 18+ messages in thread
From: Omar Sandoval @ 2018-05-18 22:46 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Austin S. Hemmelgarn, Niccolò Belli, David Sterba, Btrfs BTRFS

On Fri, May 18, 2018 at 04:26:16PM -0600, Chris Murphy wrote:
> On Fri, May 18, 2018 at 12:33 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > On 2018-05-18 13:18, Niccolò Belli wrote:
> >>
> >> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
> >>>
> >>> and also forces the people who have ridiculous numbers of snapshots to
> >>> deal with the memory usage or never defrag
> >>
> >>
> >> Whoever has at least one snapshot is never going to defrag anyway, unless
> >> he is willing to double the used space.
> >>
> > With a bit of work, it's possible to handle things sanely.  You can
> > deduplicate data from snapshots, even if they are read-only (you need to
> > pass the `-A` option to duperemove and run it as root), so it's perfectly
> > reasonable to only defrag the main subvolume, and then deduplicate the
> > snapshots against that (so that they end up all being reflinks to the main
> > subvolume).  Of course, this won't work if you're short on space, but if
> > you're dealing with snapshots, you should have enough space that this will
> > work (because even without defrag, it's fully possible for something to
> > cause the snapshots to suddenly take up a lot more space).
> 
> 
> Curiously, snapshot aware defragmentation is going to increase free
> space fragmentation. For busy in-use systems, it might be necessary to
> use space cache v2 to avoid performance problems.
> 
> I forget the exact reason why the free space tree is not the default,
> I think it has to do with missing repair support?

Yeah, Nikolay is working on that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 17:10     ` Austin S. Hemmelgarn
  2018-05-18 17:18       ` Niccolò Belli
@ 2018-05-18 23:55       ` Tomasz Pala
  2018-05-19  8:56         ` Niccolò Belli
  2018-05-21 17:43       ` David Sterba
  2 siblings, 1 reply; 18+ messages in thread
From: Tomasz Pala @ 2018-05-18 23:55 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Niccol?? Belli, David Sterba, linux-btrfs

On Fri, May 18, 2018 at 13:10:02 -0400, Austin S. Hemmelgarn wrote:

> Personally though, I think the biggest issue with what was done was not 
> the memory consumption, but the fact that there was no switch to turn it 
> on or off.  Making defrag unconditionally snapshot aware removes one of 
> the easiest ways to forcibly unshare data without otherwise altering the 

The "defrag only not-snapshotted data" mode would be enough for many
use cases and wouldn't require more RAM. One could run this before
taking a snapshot and merge _at least_ the new data.

And even with current approach it should be possible to interlace
defragmentation with some kind of naive-deduplication; "naive" in the
approach of comparing blocks only within the same in-subvolume paths.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 18:33         ` Austin S. Hemmelgarn
  2018-05-18 22:26           ` Chris Murphy
@ 2018-05-19  8:54           ` Niccolò Belli
  2018-05-21 13:15             ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 18+ messages in thread
From: Niccolò Belli @ 2018-05-19  8:54 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: David Sterba, linux-btrfs

On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> With a bit of work, it's possible to handle things sanely.  You 
> can deduplicate data from snapshots, even if they are read-only 
> (you need to pass the `-A` option to duperemove and run it as 
> root), so it's perfectly reasonable to only defrag the main 
> subvolume, and then deduplicate the snapshots against that (so 
> that they end up all being reflinks to the main subvolume).  Of 
> course, this won't work if you're short on space, but if you're 
> dealing with snapshots, you should have enough space that this 
> will work (because even without defrag, it's fully possible for 
> something to cause the snapshots to suddenly take up a lot more 
> space).

Been there, tried that. Unfortunately even if I skip the defreg a simple

duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs

is going to eat more space than it was previously available (probably due 
to autodefrag?).

Niccolò

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 23:55       ` Tomasz Pala
@ 2018-05-19  8:56         ` Niccolò Belli
       [not found]           ` <20180520105928.GA17117@polanet.pl>
  0 siblings, 1 reply; 18+ messages in thread
From: Niccolò Belli @ 2018-05-19  8:56 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Austin S. Hemmelgarn, David Sterba, linux-btrfs

On sabato 19 maggio 2018 01:55:30 CEST, Tomasz Pala wrote:
> The "defrag only not-snapshotted data" mode would be enough for many
> use cases and wouldn't require more RAM. One could run this before
> taking a snapshot and merge _at least_ the new data.

snapper users with hourly snapshots will not have any use for it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-19  8:54           ` Niccolò Belli
@ 2018-05-21 13:15             ` Austin S. Hemmelgarn
  2018-05-21 13:42               ` Timofey Titovets
  0 siblings, 1 reply; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-21 13:15 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: David Sterba, linux-btrfs

On 2018-05-19 04:54, Niccolò Belli wrote:
> On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
>> With a bit of work, it's possible to handle things sanely.  You can 
>> deduplicate data from snapshots, even if they are read-only (you need 
>> to pass the `-A` option to duperemove and run it as root), so it's 
>> perfectly reasonable to only defrag the main subvolume, and then 
>> deduplicate the snapshots against that (so that they end up all being 
>> reflinks to the main subvolume).  Of course, this won't work if you're 
>> short on space, but if you're dealing with snapshots, you should have 
>> enough space that this will work (because even without defrag, it's 
>> fully possible for something to cause the snapshots to suddenly take 
>> up a lot more space).
> 
> Been there, tried that. Unfortunately even if I skip the defreg a simple
> 
> duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> 
> is going to eat more space than it was previously available (probably 
> due to autodefrag?).
It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME 
ioctl).  There's two things involved here:

* BTRFS has somewhat odd and inefficient handling of partial extents. 
When part of an extent becomes unused (because of a CLONE ioctl, or an 
EXTENT_SAME ioctl, or something similar), that part stays allocated 
until the whole extent would be unused.
* You're using the default deduplication block size (128k), which is 
larger than your filesystem block size (which is at most 64k, most 
likely 16k, but might be 4k if it's an old filesystem), so deduplicating 
can split extents.

Because of this, if a duplicate region happens to overlap the front of 
an already shared extent, and the end of said shared extent isn't 
aligned with the deduplication block size, the EXTENT_SAME call will 
deduplicate the first part, creating a new shared extent, but not the 
tail end of the existing shared region, and all of that original shared 
region will stick around, taking up extra space that it wasn't before.

Additionally, if only part of an extent is duplicated, then that area of 
the extent will stay allocated, because the rest of the extent is still 
referenced (so you won't necessarily see any actual space savings).

You can mitigate this by telling duperemove to use the same block size 
as your filesystem using the `-b` option.   Note that using a smaller 
block size will also slow down the deduplication process and greatly 
increase the size of the hash file.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-21 13:15             ` Austin S. Hemmelgarn
@ 2018-05-21 13:42               ` Timofey Titovets
  2018-05-21 15:38                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 18+ messages in thread
From: Timofey Titovets @ 2018-05-21 13:42 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: darkbasic, David Sterba, linux-btrfs

пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn <ahferroin7@gmail.com>:

> On 2018-05-19 04:54, Niccolò Belli wrote:
> > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> >> With a bit of work, it's possible to handle things sanely.  You can
> >> deduplicate data from snapshots, even if they are read-only (you need
> >> to pass the `-A` option to duperemove and run it as root), so it's
> >> perfectly reasonable to only defrag the main subvolume, and then
> >> deduplicate the snapshots against that (so that they end up all being
> >> reflinks to the main subvolume).  Of course, this won't work if you're
> >> short on space, but if you're dealing with snapshots, you should have
> >> enough space that this will work (because even without defrag, it's
> >> fully possible for something to cause the snapshots to suddenly take
> >> up a lot more space).
> >
> > Been there, tried that. Unfortunately even if I skip the defreg a simple
> >
> > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> >
> > is going to eat more space than it was previously available (probably
> > due to autodefrag?).
> It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
> ioctl).  There's two things involved here:

> * BTRFS has somewhat odd and inefficient handling of partial extents.
> When part of an extent becomes unused (because of a CLONE ioctl, or an
> EXTENT_SAME ioctl, or something similar), that part stays allocated
> until the whole extent would be unused.
> * You're using the default deduplication block size (128k), which is
> larger than your filesystem block size (which is at most 64k, most
> likely 16k, but might be 4k if it's an old filesystem), so deduplicating
> can split extents.

That's a metadata node leaf != fs block size.
btrfs fs block size == machine page size currently.

> Because of this, if a duplicate region happens to overlap the front of
> an already shared extent, and the end of said shared extent isn't
> aligned with the deduplication block size, the EXTENT_SAME call will
> deduplicate the first part, creating a new shared extent, but not the
> tail end of the existing shared region, and all of that original shared
> region will stick around, taking up extra space that it wasn't before.

> Additionally, if only part of an extent is duplicated, then that area of
> the extent will stay allocated, because the rest of the extent is still
> referenced (so you won't necessarily see any actual space savings).

> You can mitigate this by telling duperemove to use the same block size
> as your filesystem using the `-b` option.   Note that using a smaller
> block size will also slow down the deduplication process and greatly
> increase the size of the hash file.

duperemove -b control "how hash data", not more or less and only support
4KiB..1MiB

And size of block for dedup will change efficiency of deduplication,
when count of hash-block pairs, will change hash file size and time
complexity.

Let's assume that: 'A' - 1KiB of data 'AAAA' - 4KiB with repeated pattern.

So, example, you have 2 of 2x4KiB blocks:
1: 'AAAABBBB'
2: 'BBBBAAAA'

With -b 8KiB hash of first block not same as second.
But with -b 4KiB duperemove will see both 'AAAA' and 'BBBB'
And then that blocks will be deduped.

Even, duperemove have 2 modes of deduping:
1. By extents
2. By blocks

Thanks.

--
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
       [not found]           ` <20180520105928.GA17117@polanet.pl>
@ 2018-05-21 13:49             ` Niccolò Belli
  0 siblings, 0 replies; 18+ messages in thread
From: Niccolò Belli @ 2018-05-21 13:49 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-btrfs

On domenica 20 maggio 2018 12:59:28 CEST, Tomasz Pala wrote:
> On Sat, May 19, 2018 at 10:56:32 +0200, Niccol? Belli wrote:
>> snapper users with hourly snapshots will not have any use for it.
> Anyone with hourly snapshots anyone is doomed anyway.

I do not agree: having hourly snapshots doesn't mean you cannot limit 
snapshots at a reasonable number. In fact you can simply keep a dozen of 
them, then start discarding the older ones.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-21 13:42               ` Timofey Titovets
@ 2018-05-21 15:38                 ` Austin S. Hemmelgarn
  2018-06-01  3:19                   ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-21 15:38 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: darkbasic, David Sterba, linux-btrfs

On 2018-05-21 09:42, Timofey Titovets wrote:
> пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> 
>> On 2018-05-19 04:54, Niccolò Belli wrote:
>>> On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
>>>> With a bit of work, it's possible to handle things sanely.  You can
>>>> deduplicate data from snapshots, even if they are read-only (you need
>>>> to pass the `-A` option to duperemove and run it as root), so it's
>>>> perfectly reasonable to only defrag the main subvolume, and then
>>>> deduplicate the snapshots against that (so that they end up all being
>>>> reflinks to the main subvolume).  Of course, this won't work if you're
>>>> short on space, but if you're dealing with snapshots, you should have
>>>> enough space that this will work (because even without defrag, it's
>>>> fully possible for something to cause the snapshots to suddenly take
>>>> up a lot more space).
>>>
>>> Been there, tried that. Unfortunately even if I skip the defreg a simple
>>>
>>> duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
>>>
>>> is going to eat more space than it was previously available (probably
>>> due to autodefrag?).
>> It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
>> ioctl).  There's two things involved here:
> 
>> * BTRFS has somewhat odd and inefficient handling of partial extents.
>> When part of an extent becomes unused (because of a CLONE ioctl, or an
>> EXTENT_SAME ioctl, or something similar), that part stays allocated
>> until the whole extent would be unused.
>> * You're using the default deduplication block size (128k), which is
>> larger than your filesystem block size (which is at most 64k, most
>> likely 16k, but might be 4k if it's an old filesystem), so deduplicating
>> can split extents.
> 
> That's a metadata node leaf != fs block size.
> btrfs fs block size == machine page size currently.
You're right, I keep forgetting about that (probably because BTRFS is 
pretty much the only modern filesystem that doesn't let you change the 
block size).
> 
>> Because of this, if a duplicate region happens to overlap the front of
>> an already shared extent, and the end of said shared extent isn't
>> aligned with the deduplication block size, the EXTENT_SAME call will
>> deduplicate the first part, creating a new shared extent, but not the
>> tail end of the existing shared region, and all of that original shared
>> region will stick around, taking up extra space that it wasn't before.
> 
>> Additionally, if only part of an extent is duplicated, then that area of
>> the extent will stay allocated, because the rest of the extent is still
>> referenced (so you won't necessarily see any actual space savings).
> 
>> You can mitigate this by telling duperemove to use the same block size
>> as your filesystem using the `-b` option.   Note that using a smaller
>> block size will also slow down the deduplication process and greatly
>> increase the size of the hash file.
> 
> duperemove -b control "how hash data", not more or less and only support
> 4KiB..1MiB
And you can only deduplicate the data at the granularity you hashed it 
at.  In particular:

* The total size of a region being deduplicated has to be an exact 
multiple of the hash block size (what you pass to `-b`).  So for the 
default 128k size, you can only deduplicate regions that are multiples 
of 128k long (128k, 256k, 384k, 512k, etc).   This is a simple limit 
derived from how blocks are matched for deduplication.
* Because duperemove uses fixed hash blocks (as opposed to using a 
rolling hash window like many file synchronization tools do), the 
regions being deduplicated also have to be exactly aligned to the hash 
block size.  So, with the default 128k size, you can only deduplicate 
regions starting at 0k, 128k, 256k, 384k, 512k, etc, but not ones 
starting at, for example, 64k into the file.
> 
> And size of block for dedup will change efficiency of deduplication,
> when count of hash-block pairs, will change hash file size and time
> complexity.
> 
> Let's assume that: 'A' - 1KiB of data 'AAAA' - 4KiB with repeated pattern.
> 
> So, example, you have 2 of 2x4KiB blocks:
> 1: 'AAAABBBB'
> 2: 'BBBBAAAA'
> 
> With -b 8KiB hash of first block not same as second.
> But with -b 4KiB duperemove will see both 'AAAA' and 'BBBB'
> And then that blocks will be deduped.
This supports what I'm saying though.  Your deduplication granularity is 
bounded by your hash granularity.  If in addition to the above you have 
a file that looks like:

AABBBAA

It would not get deduplicated against the first two at either `-b 4k` or 
`-b 8k` despite the middle 4k of the file being an exact duplicate of 
the final 4k of the first file and first 4k of the second one.

If instead you have:

AABBBBBB

And the final 6k is a single on-disk extent, that extent will get split 
when you go to deduplicate against the first two files with a 4k block 
size because only the final 4k can be deduplicated, and the entire 6k 
original extent will stay completely allocated.
> 
> Even, duperemove have 2 modes of deduping:
> 1. By extents
> 2. By blocks
Yes, you can force it to not collapse runs of duplicate blocks into 
single extents, but that doesn't matter for this at all, you are still 
limited by your hash granularity.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-18 17:10     ` Austin S. Hemmelgarn
  2018-05-18 17:18       ` Niccolò Belli
  2018-05-18 23:55       ` Tomasz Pala
@ 2018-05-21 17:43       ` David Sterba
  2018-05-21 19:22         ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 18+ messages in thread
From: David Sterba @ 2018-05-21 17:43 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Niccolò Belli, linux-btrfs, gotar

On Fri, May 18, 2018 at 01:10:02PM -0400, Austin S. Hemmelgarn wrote:
> On 2018-05-18 12:36, Niccolò Belli wrote:
> > On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:
> >> Josef started working on that in 2014 and did not finish it. The patches
> >> can be still found in his tree. The problem is in excessive memory
> >> consumption when there are many snapshots that need to be tracked during
> >> the defragmentation, so there are measures to avoid OOM. There's
> >> infrastructure ready for use (shrinkers), there are maybe some problems
> >> but fundamentally is should work.
> >>
> >> I'd like to get the snapshot-aware working again too, we'd need to find
> >> a volunteer to resume the work on the patchset.
> > 
> > Yeah I know of Josef's work, but 4 years had passed since then without 
> > any news on this front.
> > 
> > What I would really like to know is why nobody resumed his work: is it 
> > because it's impossible to implement snapshot-aware degram without 
> > excessive ram usage or is it simply because nobody is interested?
> I think it's because nobody who is interested has both the time and the 
> coding skills to tackle it.
> 
> Personally though, I think the biggest issue with what was done was not 
> the memory consumption, but the fact that there was no switch to turn it 
> on or off.  Making defrag unconditionally snapshot aware removes one of 
> the easiest ways to forcibly unshare data without otherwise altering the 
> files (which, as stupid as it sounds, is actually really useful for some 
> storage setups), and also forces the people who have ridiculous numbers 
> of snapshots to deal with the memory usage or never defrag.

Good points. The logic of the sharing-aware is a technical detail,
what's being discussed is the usecase and I think this would be good to
clarify.

1) always -- the old (and now disabled) way, unconditionally (ie. no
   option for the user), problems with memory consumption

2) more fine grained:

2.1) defragment only the non-shared extents, ie. no sharing awareness
     needed, shared extents will be silently skipped

2.2) defragment only within the given subvolume -- like 1) but by user's choice

The naive dedup, that Tomasz (CCed) mentions in another mail, would be
probably beyond the defrag purpose and would make things more
complicated.

I'd vote for keeping complexity of the ioctl interface and defrag
implementation low, so if it's simply saying "do forcible defrag" or
"skip shared", then it sounds ok.

If there's eg. "keep sharing only on this <list> subvolunes", then it
would need to read the snapshot ids from ioctl structure, then enumerate
all extent owners and do some magic to unshare/defrag/share. That's a
quick idea, lots of details would need to be clarified.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-21 17:43       ` David Sterba
@ 2018-05-21 19:22         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-21 19:22 UTC (permalink / raw)
  To: dsterba, Niccolò Belli, linux-btrfs, gotar

On 2018-05-21 13:43, David Sterba wrote:
> On Fri, May 18, 2018 at 01:10:02PM -0400, Austin S. Hemmelgarn wrote:
>> On 2018-05-18 12:36, Niccolò Belli wrote:
>>> On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:
>>>> Josef started working on that in 2014 and did not finish it. The patches
>>>> can be still found in his tree. The problem is in excessive memory
>>>> consumption when there are many snapshots that need to be tracked during
>>>> the defragmentation, so there are measures to avoid OOM. There's
>>>> infrastructure ready for use (shrinkers), there are maybe some problems
>>>> but fundamentally is should work.
>>>>
>>>> I'd like to get the snapshot-aware working again too, we'd need to find
>>>> a volunteer to resume the work on the patchset.
>>>
>>> Yeah I know of Josef's work, but 4 years had passed since then without
>>> any news on this front.
>>>
>>> What I would really like to know is why nobody resumed his work: is it
>>> because it's impossible to implement snapshot-aware degram without
>>> excessive ram usage or is it simply because nobody is interested?
>> I think it's because nobody who is interested has both the time and the
>> coding skills to tackle it.
>>
>> Personally though, I think the biggest issue with what was done was not
>> the memory consumption, but the fact that there was no switch to turn it
>> on or off.  Making defrag unconditionally snapshot aware removes one of
>> the easiest ways to forcibly unshare data without otherwise altering the
>> files (which, as stupid as it sounds, is actually really useful for some
>> storage setups), and also forces the people who have ridiculous numbers
>> of snapshots to deal with the memory usage or never defrag.
> 
> Good points. The logic of the sharing-aware is a technical detail,
> what's being discussed is the usecase and I think this would be good to
> clarify.
> 
> 1) always -- the old (and now disabled) way, unconditionally (ie. no
>     option for the user), problems with memory consumption
> 
> 2) more fine grained:
> 
> 2.1) defragment only the non-shared extents, ie. no sharing awareness
>       needed, shared extents will be silently skipped
> 
> 2.2) defragment only within the given subvolume -- like 1) but by user's choice
> 
> The naive dedup, that Tomasz (CCed) mentions in another mail, would be
> probably beyond the defrag purpose and would make things more
> complicated.
> 
> I'd vote for keeping complexity of the ioctl interface and defrag
> implementation low, so if it's simply saying "do forcible defrag" or
> "skip shared", then it sounds ok.
> 
> If there's eg. "keep sharing only on this <list> subvolunes", then it
> would need to read the snapshot ids from ioctl structure, then enumerate
> all extent owners and do some magic to unshare/defrag/share. That's a
> quick idea, lots of details would need to be clarified.
> 
 From my perspective, I see two things to consider that are somewhat 
orthogonal to each other:

1. Whether to recurse into subvolumes or not (IIRC, we currently do not 
do so, because we see them like a mount point).
2. Whether to use the simple (not reflink-aware) defrag, the reflink 
aware one, or to base it on the extent/file type (use old simpler one 
for unshared extents, and new reflink aware one for shared extents).

This second set of options is what I'd like to see the most (possibly 
without the option to base it on file or extent sharing automatically), 
though the first one would be nice to have.

Better yet, having that second set of options and making the new 
reflink-aware defrag opt-in would allow people who really want it to use 
it, and those of us who don't need it for our storage setups to not need 
to worry about it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Any chance to get snapshot-aware defragmentation?
  2018-05-21 15:38                 ` Austin S. Hemmelgarn
@ 2018-06-01  3:19                   ` Zygo Blaxell
  0 siblings, 0 replies; 18+ messages in thread
From: Zygo Blaxell @ 2018-06-01  3:19 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Timofey Titovets, darkbasic, David Sterba, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8626 bytes --]

On Mon, May 21, 2018 at 11:38:28AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-05-21 09:42, Timofey Titovets wrote:
> > пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> > > On 2018-05-19 04:54, Niccolò Belli wrote:
> > > > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> > > > > With a bit of work, it's possible to handle things sanely.  You can
> > > > > deduplicate data from snapshots, even if they are read-only (you need
> > > > > to pass the `-A` option to duperemove and run it as root), so it's
> > > > > perfectly reasonable to only defrag the main subvolume, and then
> > > > > deduplicate the snapshots against that (so that they end up all being
> > > > > reflinks to the main subvolume).  Of course, this won't work if you're
> > > > > short on space, but if you're dealing with snapshots, you should have
> > > > > enough space that this will work (because even without defrag, it's
> > > > > fully possible for something to cause the snapshots to suddenly take
> > > > > up a lot more space).
> > > > 
> > > > Been there, tried that. Unfortunately even if I skip the defreg a simple
> > > > 
> > > > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> > > > 
> > > > is going to eat more space than it was previously available (probably
> > > > due to autodefrag?).
> > > It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
> > > ioctl).  There's two things involved here:
> > 
> > > * BTRFS has somewhat odd and inefficient handling of partial extents.
> > > When part of an extent becomes unused (because of a CLONE ioctl, or an
> > > EXTENT_SAME ioctl, or something similar), that part stays allocated
> > > until the whole extent would be unused.
> > > * You're using the default deduplication block size (128k), which is
> > > larger than your filesystem block size (which is at most 64k, most
> > > likely 16k, but might be 4k if it's an old filesystem), so deduplicating
> > > can split extents.
> > 
> > That's a metadata node leaf != fs block size.
> > btrfs fs block size == machine page size currently.
> You're right, I keep forgetting about that (probably because BTRFS is pretty
> much the only modern filesystem that doesn't let you change the block size).
> > 
> > > Because of this, if a duplicate region happens to overlap the front of
> > > an already shared extent, and the end of said shared extent isn't
> > > aligned with the deduplication block size, the EXTENT_SAME call will
> > > deduplicate the first part, creating a new shared extent, but not the
> > > tail end of the existing shared region, and all of that original shared
> > > region will stick around, taking up extra space that it wasn't before.
> > 
> > > Additionally, if only part of an extent is duplicated, then that area of
> > > the extent will stay allocated, because the rest of the extent is still
> > > referenced (so you won't necessarily see any actual space savings).
> > 
> > > You can mitigate this by telling duperemove to use the same block size
> > > as your filesystem using the `-b` option.   Note that using a smaller
> > > block size will also slow down the deduplication process and greatly
> > > increase the size of the hash file.
> > 
> > duperemove -b control "how hash data", not more or less and only support
> > 4KiB..1MiB
> And you can only deduplicate the data at the granularity you hashed it at.
> In particular:
> 
> * The total size of a region being deduplicated has to be an exact multiple
> of the hash block size (what you pass to `-b`).  So for the default 128k
> size, you can only deduplicate regions that are multiples of 128k long
> (128k, 256k, 384k, 512k, etc).   This is a simple limit derived from how
> blocks are matched for deduplication.
> * Because duperemove uses fixed hash blocks (as opposed to using a rolling
> hash window like many file synchronization tools do), the regions being
> deduplicated also have to be exactly aligned to the hash block size.  So,
> with the default 128k size, you can only deduplicate regions starting at 0k,
> 128k, 256k, 384k, 512k, etc, but not ones starting at, for example, 64k into
> the file.
> > 
> > And size of block for dedup will change efficiency of deduplication,
> > when count of hash-block pairs, will change hash file size and time
> > complexity.
> > 
> > Let's assume that: 'A' - 1KiB of data 'AAAA' - 4KiB with repeated pattern.
> > 
> > So, example, you have 2 of 2x4KiB blocks:
> > 1: 'AAAABBBB'
> > 2: 'BBBBAAAA'
> > 
> > With -b 8KiB hash of first block not same as second.
> > But with -b 4KiB duperemove will see both 'AAAA' and 'BBBB'
> > And then that blocks will be deduped.
> This supports what I'm saying though.  Your deduplication granularity is
> bounded by your hash granularity.  If in addition to the above you have a
> file that looks like:
> 
> AABBBAA
> 
> It would not get deduplicated against the first two at either `-b 4k` or `-b
> 8k` despite the middle 4k of the file being an exact duplicate of the final
> 4k of the first file and first 4k of the second one.
> 
> If instead you have:
> 
> AABBBBBB
> 
> And the final 6k is a single on-disk extent, that extent will get split when
> you go to deduplicate against the first two files with a 4k block size
> because only the final 4k can be deduplicated, and the entire 6k original
> extent will stay completely allocated.

It's the extent *ref* (in the subvol) that gets split.  The original
extent *data* (in the extent tree) is never modified, only deleted when
the last ref to any part of the extent data item is removed.  It looks
like there was intent in early btrfs to support splitting the extent data
too, but any code that might actually do that seems to have been removed
(among other things, there are gotchas with compression--you can't simply
truncate a compressed extent without modifying its data).

bees uses 4K block matching to find a common block in both extents, then
searches blocks adjacent to the matching blocks for more duplicates
until a complete extent is found.  This enables bees to ignore the
dedup-block-size/extent-size alignment problem.  This is similar to
a rolling hash window like rsync, but relying on slightly different
assumptions about how duplicate data is distributed through a typical
filesystem.  To get a bigger block size, bees discards block hashes
(e.g. 32K block size = 7 out of 8 4K block hashes discarded) because
bees can find a 32K contiguous duplicate extent with just one 4K hash.

bees replaces the entire extent ref containing a duplicate block with
reflinks to duplicate blocks in other extents.  If some blocks within
an extent are unique, bees creates a duplicate extent containing the
unique data, then dedups the new duplicate blocks over the old ones.
So if you have AABBBBB and AABBBCC, bees will make a copy of CC in a
new extent, then replace AABBBCC with reflinks to AABBB (from AABBBBB)
and the new CC.  This eliminates the entire original AABBBCC extent
from the filesystem.  At the moment bees isn't very smart about that,
which results in increased fragmentation when deduping data with lots
of non-extent-aligned duplication, like VM images and ELF binaries.

In the future bees could combine short extents (not necessarily
duplicates) into larger ones as it goes, making it an integrated
dedup and defrag tool.  This would not really be snapshot-aware per
se--bees would be optimizing the layout of extent data items first,
then rewriting all of the extent ref (subvol/snapshot) trees to point
to the updated extent data items without having to care about whether
the original extent references come from snapshots, clones, or dedup.
I guess you could call that snapshot-agnostic, since a tool could do
this without an understanding of the snapshot concept at all.

Teaching bees that trick is a project I am working on *extremely*
slowly--it's practically a rewrite of bees, and $DAYJOB and home life
keep me stretched pretty thin these days.

> > Even, duperemove have 2 modes of deduping:
> > 1. By extents
> > 2. By blocks
> Yes, you can force it to not collapse runs of duplicate blocks into single
> extents, but that doesn't matter for this at all, you are still limited by
> your hash granularity.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-06-01  3:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-11 15:22 Any chance to get snapshot-aware defragmentation? Niccolò Belli
2018-05-18 16:20 ` David Sterba
2018-05-18 16:36   ` Niccolò Belli
2018-05-18 17:10     ` Austin S. Hemmelgarn
2018-05-18 17:18       ` Niccolò Belli
2018-05-18 18:33         ` Austin S. Hemmelgarn
2018-05-18 22:26           ` Chris Murphy
2018-05-18 22:46             ` Omar Sandoval
2018-05-19  8:54           ` Niccolò Belli
2018-05-21 13:15             ` Austin S. Hemmelgarn
2018-05-21 13:42               ` Timofey Titovets
2018-05-21 15:38                 ` Austin S. Hemmelgarn
2018-06-01  3:19                   ` Zygo Blaxell
2018-05-18 23:55       ` Tomasz Pala
2018-05-19  8:56         ` Niccolò Belli
     [not found]           ` <20180520105928.GA17117@polanet.pl>
2018-05-21 13:49             ` Niccolò Belli
2018-05-21 17:43       ` David Sterba
2018-05-21 19:22         ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.