linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Feature requests: online backup - defrag - change RAID level
@ 2019-09-09  2:55 zedlryqc
  2019-09-09  3:51 ` Qu Wenruo
  0 siblings, 1 reply; 111+ messages in thread
From: zedlryqc @ 2019-09-09  2:55 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone!

I have been programming for a long time (over 20 years), and I am  
quite interested in a lot of low-level stuff. But in reality I have  
never done anything related to kernels or filesystems. But I did a lot  
of assembly, C, OS stuff etc...
 
Looking at your project status page (at  
https://btrfs.wiki.kernel.org/index.php/Status), I must say that your  
priorities don't quite match mine. Of course, the opinions usually  
differ. It is my opinion that there are some quite essential features  
which btrfs is, unfortunately, still missing.
 
So here is a list of features which I would rate as very important  
(for a modern COW filesystem like btrfs is), so perhaps you can think  
about it at least a little bit.
 
1) Full online backup (or copy, whatever you want to call it)
btrfs backup <filesystem name> <partition name> [-f]
- backups a btrfs filesystem given by <filesystem name> to a partition  
<partition name> (with all subvolumes).
 
- To be performed by creating a new btrfs filesystem in the  
destination partition <partition name>, with a new GUID.
- All data from the source filesystem <filesystem name> is than copied  
to the destination partition, similar to how RAID1 works.
- The size of the destination partition must be sufficient to hold the  
used data from the source filesystem, otherwise the operation fails.  
The point is that the destination doesn't have to be as large as  
source, just sufficient to hold the data (of course, many details and  
concerns are skipped in this short proposal)
- When the operation completes, the destination partition contains a  
fully featured, mountable and unmountable btrfs filesystem, which is  
an exact copy of the source filesystem at some point in time, with all  
the snapshots and subvolumes of the source filesystem.
- There are two possible implementations about how this operation is  
to be performed, depending on whether the destination drive is slower  
than source drive(s) or not (like, when the destination is HDD and the  
source is SDD). If the source and the destination are of similar  
speed, than a RAID1-alike algorithm can be used (all writes  
simultaneously go to the source and the destination). This mode can  
also be used if the user/admin is willing to tolerate a performance  
hit for some relatively short period of time.
The second possible implementation is a bit more complex, it can be  
done by creating a temporary snapshot or by buffering all the current  
writes until they can be written to the destination drive, but this  
implementation is of lesser priority (see if you can make the RAID1  
implementation work first).
 
2) Sensible defrag
The defrag is currently a joke. If you use defrag than you better not  
use subvolumes/snapshots. That's... very… hard to tolerate. Quite a  
necessary feature. I mean, defrag is an operation that should be  
performed in many circumstances, and in many cases it is even  
automatically initiated. But, btrfs defrag is virtually unusable. And,  
it is unusable where it is most needed, as the presence of subvolumes  
will, predictably, increase fragmentation by quite a lot.
 
How to do it:
- The extents must not be unshared, but just shuffled a bit. Unsharing  
the extents is, in most situations, not tolerable.
 
- The defrag should work by doing a full defrag of one 'selected  
subvolume' (which can be selected by user, or it can be guessed  
because the user probably wants to defrag the currently mounted  
subvolume, or default subvolume). The other subvolumes should than  
share data (shared extents) with the 'selected subvolume' (as much as  
possible).
 
- If you want it even more feature-full and complicated, then you  
could allow the user to specify a list of selected subvolumes, like:  
subvol1, subvol2, subvol3… etc. and the defrag algorithm than defrags  
subvol1 in full, than subvol2 as much as possible while not changing  
subvol1 and at the same time sharing extents with subvol1, than defrag  
subvol3 while not changing subvol1 and subvol2… etc.
 
- I think it would be wrong to use a general deduplication algorithm  
for this. Instead, the information about the shared extents should be  
analyzed given the starting state of the filesystem, and than the  
algorithm should produce an optimal solution based on the currently  
shared extents.
 
Deduplication is a different task.
 
3) Downgrade to 'single' or 'DUP' (also, general easy way to switch  
between RAID levels)
 
Currently, as much as I gather, user has to do a "btrfs balance start  
-dconvert=single -mconvert=single
", than delete a drive, which is a bit ridiculous sequence of operations.
 
Can you do something like "btrfs delete", but such that it also  
simultaneously converts to 'single', or some other chosen RAID level?
 
## I hope that you will consider my suggestions, I hope that I'm  
helpful (although, I guess, the short time I spent working with btrfs  
and writing this mail can not compare with the amount of work you are  
putting into it). Perhaps, teams sometimes need a different  
perspective, outsiders perspective, in order to better understand the  
situation.
 
So long!


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09  2:55 Feature requests: online backup - defrag - change RAID level zedlryqc
@ 2019-09-09  3:51 ` Qu Wenruo
  2019-09-09 11:25   ` zedlryqc
  0 siblings, 1 reply; 111+ messages in thread
From: Qu Wenruo @ 2019-09-09  3:51 UTC (permalink / raw)
  To: zedlryqc, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4263 bytes --]



On 2019/9/9 上午10:55, zedlryqc@server53.web-hosting.com wrote:
> Hello everyone!
> 
[...]
>  
> 1) Full online backup (or copy, whatever you want to call it)
> btrfs backup <filesystem name> <partition name> [-f]
> - backups a btrfs filesystem given by <filesystem name> to a partition
> <partition name> (with all subvolumes).

Why not just btrfs send?

Or you want to keep the whole subvolume structures/layout?

>  
> - To be performed by creating a new btrfs filesystem in the destination
> partition <partition name>, with a new GUID.

I'd say current send/receive is more flex.
And you also needs to understand btrfs also integrates volume
management, thus it's not just <partition name>, you also needs RAID
level and things like that.


> - All data from the source filesystem <filesystem name> is than copied
> to the destination partition, similar to how RAID1 works.
> - The size of the destination partition must be sufficient to hold the
> used data from the source filesystem, otherwise the operation fails. The
> point is that the destination doesn't have to be as large as source,
> just sufficient to hold the data (of course, many details and concerns
> are skipped in this short proposal)

All can be done already by send/receive, although at subvolume level.

Please check if send/receive is suitable for your use case.

[...]
>  
> 2) Sensible defrag
> The defrag is currently a joke. If you use defrag than you better not
> use subvolumes/snapshots. That's... very… hard to tolerate. Quite a
> necessary feature. I mean, defrag is an operation that should be
> performed in many circumstances, and in many cases it is even
> automatically initiated. But, btrfs defrag is virtually unusable. And,
> it is unusable where it is most needed, as the presence of subvolumes
> will, predictably, increase fragmentation by quite a lot.
>  
> How to do it:
> - The extents must not be unshared, but just shuffled a bit. Unsharing
> the extents is, in most situations, not tolerable.

I definitely see cases unsharing extents makes sense, so at least we
should let user to determine what they want.

>  
> - The defrag should work by doing a full defrag of one 'selected
> subvolume' (which can be selected by user, or it can be guessed because
> the user probably wants to defrag the currently mounted subvolume, or
> default subvolume). The other subvolumes should than share data (shared
> extents) with the 'selected subvolume' (as much as possible).

What's wrong with current file based defrag?
If you want to defrag a subvolume, just iterate through all files.

>  
> - I think it would be wrong to use a general deduplication algorithm for
> this. Instead, the information about the shared extents should be
> analyzed given the starting state of the filesystem, and than the
> algorithm should produce an optimal solution based on the currently
> shared extents.

Please be more specific, like giving an example for it.

>  
> Deduplication is a different task.
>  
> 3) Downgrade to 'single' or 'DUP' (also, general easy way to switch
> between RAID levels)
>  
> Currently, as much as I gather, user has to do a "btrfs balance start
> -dconvert=single -mconvert=single
> ", than delete a drive, which is a bit ridiculous sequence of operations.
>  
> Can you do something like "btrfs delete", but such that it also
> simultaneously converts to 'single', or some other chosen RAID level?

That's a shortcut for chunk profile change.
My first idea of this is, it could cause more problem than benefit.
(It only benefits profile downgrade, thus only makes sense for
RAID1->SINGLE, DUP->SINGLE, and RAID10->RAID0, nothing else)

I still prefer the safer allocate-new-chunk way to convert chunks, even
at a cost of extra IO.

Thanks,
Qu

>  
> ## I hope that you will consider my suggestions, I hope that I'm helpful
> (although, I guess, the short time I spent working with btrfs and
> writing this mail can not compare with the amount of work you are
> putting into it). Perhaps, teams sometimes need a different perspective,
> outsiders perspective, in order to better understand the situation.
>  
> So long!
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09  3:51 ` Qu Wenruo
@ 2019-09-09 11:25   ` zedlryqc
  2019-09-09 12:18     ` Qu Wenruo
  2019-09-10 11:12     ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 111+ messages in thread
From: zedlryqc @ 2019-09-09 11:25 UTC (permalink / raw)
  To: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> 1) Full online backup (or copy, whatever you want to call it)
>> btrfs backup <filesystem name> <partition name> [-f]
>> - backups a btrfs filesystem given by <filesystem name> to a partition
>> <partition name> (with all subvolumes).
>
> Why not just btrfs send?
>
> Or you want to keep the whole subvolume structures/layout?

Yes, I want to keep the whole subvolume structures/layout. I want to  
keep everything. Usually, when I want to backup a partition, I want to  
keep everything, and I suppose most other people have a similar idea.

> I'd say current send/receive is more flex.

Um, 'flexibility' has nothing to do with it. Send/receive is a  
completely different use case.
So, each one has some benefits and some drawbacks, but 'send/receive'  
cannot replace 'full online backup'

Here is where send/receive is lacking:
	- too complicated to do if many subvolumes are involved
	- may require recursive subvolume enumeration in order to emulate  
'full online backup'
	- may require extra storage space
	- is not mountable, not easy to browse the backup contents
	- not easy to recover just a few selected files from backup
There's probably more things where send/receive is lacking, but I  
think I have given sufficient number of important differences which  
show that send/receive cannot successfully replace the functionality  
of 'full online backup'.

> And you also needs to understand btrfs also integrates volume
> management, thus it's not just <partition name>, you also needs RAID
> level and things like that.

This is a minor point. So, please, let's not get into too many  
irrelevant details here.

There can be a sensible default to 'single data, DUP metadata', and a  
way for a user to override this default, but that feature is  
not-so-important. If the user wants to change the RAID level, he can  
easily do it later by mounting the backup.

>
> All can be done already by send/receive, although at subvolume level.

Yeah, maybe I should manually type it all for all subvolumes, one by  
one. Also must be carefull to do it in the correct order if I want it  
not to consume extra space.
And the backup is not mountable.

This proposal (workaround) of yours appears to me as too complicated,  
too error prone,
missing important features.

But, I just thought, you can actually emulate 'full online backup'  
with this send/receive. Here is how.
You do a script which does the following:
	- makes a temporary snapshot of every subvolume
	- use 'btrfs send' to send all the temporary snapshots, on-the-fly  
(maybe via pipe), in the correct order, to a proces running a 'brtfs  
receive', which should then immediately write it all to the  
destination partition. All the buffers can stay in-memory.
	- when all the snapshots are received and written to destination, fix  
subvol IDs
	- delete temporary snapshots from source
Of course, this script should then be a part of standard btrfs tools.

> Please check if send/receive is suitable for your use case.

No. Absolutely not.


>> 2) Sensible defrag
>> The defrag is currently a joke.

>> How to do it:
>> - The extents must not be unshared, but just shuffled a bit. Unsharing
>> the extents is, in most situations, not tolerable.

> I definitely see cases unsharing extents makes sense, so at least we
> should let user to determine what they want.

Maybe there are such cases, but I would say that a vast majority of  
users (99,99%) in a vast majority of cases (99,99%) don't want the  
defrag operation to reduce free disk space.

> What's wrong with current file based defrag?
> If you want to defrag a subvolume, just iterate through all files.

I repeat: The defrag should not decrease free space. That's the  
'normal' expectation.

>> - I think it would be wrong to use a general deduplication algorithm for
>> this. Instead, the information about the shared extents should be
>> analyzed given the starting state of the filesystem, and than the
>> algorithm should produce an optimal solution based on the currently
>> shared extents.
>
> Please be more specific, like giving an example for it.

Let's say that there is a file FFF with extents e11, e12, e13, e22,  
e23, e33, e34
- in subvolA the file FFF consists of e11, e12, e13
- in subvolB the file FFF consists of e11, e22, e23
- in subvolC the file FFF consists of e11, e22, e33, e34

After defrag, where 'selected subvolume' is subvolC, the extents are  
ordered on disk as follows:

e11,e22,e33,e34 - e23 - e12,e13

In the list above, the comma denotes neighbouring extents, the dash  
indicates that there can be a possible gap.
As you can see in the list, the file FFF is fully defragmented in  
subvolC, since its extents are occupying neighbouring disk sectors.


>> 3) Downgrade to 'single' or 'DUP' (also, general easy way to switch
>> between RAID levels)
>>  Currently, as much as I gather, user has to do a "btrfs balance start
>> -dconvert=single -mconvert=single
>> ", than delete a drive, which is a bit ridiculous sequence of operations.

> That's a shortcut for chunk profile change.
> My first idea of this is, it could cause more problem than benefit.
> (It only benefits profile downgrade, thus only makes sense for
> RAID1->SINGLE, DUP->SINGLE, and RAID10->RAID0, nothing else)

Those listed cases are exactly the ones I judge to be most important.  
Three important cases.

> I still prefer the safer allocate-new-chunk way to convert chunks, even
> at a cost of extra IO.

I don't mind whether it allocates new chunks or not. It is better, in  
my opinion, if new chunks are not allocated, but both ways are  
essentially OK.

What I am complaining about is that at one point in time, after  
issuing the command:
	btrfs balance start -dconvert=single -mconvert=single
and before issuing the 'btrfs delete', the system could be in a too  
fragile state, with extents unnecesarily spread out over two drives,  
which is both a completely unnecessary operation, and it also seems to  
me that it could be dangerous in some situations involving potentially  
malfunctioning drives.

Please reconsider.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 11:25   ` zedlryqc
@ 2019-09-09 12:18     ` Qu Wenruo
  2019-09-09 12:28       ` Qu Wenruo
                         ` (2 more replies)
  2019-09-10 11:12     ` Austin S. Hemmelgarn
  1 sibling, 3 replies; 111+ messages in thread
From: Qu Wenruo @ 2019-09-09 12:18 UTC (permalink / raw)
  To: zedlryqc, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8728 bytes --]



On 2019/9/9 下午7:25, zedlryqc@server53.web-hosting.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>> 1) Full online backup (or copy, whatever you want to call it)
>>> btrfs backup <filesystem name> <partition name> [-f]
>>> - backups a btrfs filesystem given by <filesystem name> to a partition
>>> <partition name> (with all subvolumes).
>>
>> Why not just btrfs send?
>>
>> Or you want to keep the whole subvolume structures/layout?
> 
> Yes, I want to keep the whole subvolume structures/layout. I want to
> keep everything. Usually, when I want to backup a partition, I want to
> keep everything, and I suppose most other people have a similar idea.
> 
>> I'd say current send/receive is more flex.
> 
> Um, 'flexibility' has nothing to do with it. Send/receive is a
> completely different use case.
> So, each one has some benefits and some drawbacks, but 'send/receive'
> cannot replace 'full online backup'
> 
> Here is where send/receive is lacking:
>     - too complicated to do if many subvolumes are involved
>     - may require recursive subvolume enumeration in order to emulate
> 'full online backup'
>     - may require extra storage space
>     - is not mountable, not easy to browse the backup contents
>     - not easy to recover just a few selected files from backup
> There's probably more things where send/receive is lacking, but I think
> I have given sufficient number of important differences which show that
> send/receive cannot successfully replace the functionality of 'full
> online backup'.
> 
>> And you also needs to understand btrfs also integrates volume
>> management, thus it's not just <partition name>, you also needs RAID
>> level and things like that.
> 
> This is a minor point. So, please, let's not get into too many
> irrelevant details here.
> 
> There can be a sensible default to 'single data, DUP metadata', and a
> way for a user to override this default, but that feature is
> not-so-important. If the user wants to change the RAID level, he can
> easily do it later by mounting the backup.
> 
>>
>> All can be done already by send/receive, although at subvolume level.
> 
> Yeah, maybe I should manually type it all for all subvolumes, one by
> one. Also must be carefull to do it in the correct order if I want it
> not to consume extra space.
> And the backup is not mountable.
> 
> This proposal (workaround) of yours appears to me as too complicated,
> too error prone,
> missing important features.
> 
> But, I just thought, you can actually emulate 'full online backup' with
> this send/receive. Here is how.
> You do a script which does the following:
>     - makes a temporary snapshot of every subvolume
>     - use 'btrfs send' to send all the temporary snapshots, on-the-fly
> (maybe via pipe), in the correct order, to a proces running a 'brtfs
> receive', which should then immediately write it all to the destination
> partition. All the buffers can stay in-memory.
>     - when all the snapshots are received and written to destination,
> fix subvol IDs
>     - delete temporary snapshots from source
> Of course, this script should then be a part of standard btrfs tools.
> 
>> Please check if send/receive is suitable for your use case.
> 
> No. Absolutely not.
> 
> 
>>> 2) Sensible defrag
>>> The defrag is currently a joke.
> 
>>> How to do it:
>>> - The extents must not be unshared, but just shuffled a bit. Unsharing
>>> the extents is, in most situations, not tolerable.
> 
>> I definitely see cases unsharing extents makes sense, so at least we
>> should let user to determine what they want.
> 
> Maybe there are such cases, but I would say that a vast majority of
> users (99,99%) in a vast majority of cases (99,99%) don't want the
> defrag operation to reduce free disk space.
> 
>> What's wrong with current file based defrag?
>> If you want to defrag a subvolume, just iterate through all files.
> 
> I repeat: The defrag should not decrease free space. That's the 'normal'
> expectation.

Since you're talking about btrfs, it's going to do CoW for metadata not
matter whatever, as long as you're going to change anything, btrfs will
cause extra space usage.
(Although the final result may not cause extra used disk space as freed
space is as large as newly allocated space, but to maintain CoW, newly
allocated space can't overlap with old data)

Further more, talking about snapshots with space wasted by extent
booking, it's definitely possible user want to break the shared extents:

Subvol 257, inode 257 has the following file extents:
(257 EXTENT_DATA 0)
disk bytenr X len 16M
offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.

Subvol 258, inode 257 has the following file extents:
(257 EXTENT_DATA 0)
disk bytenr X len 16M
offset 0 num_bytes 4K  << Shared with that one in subv 257
(257 EXTENT_DATA 4K)
disk bytenr Y len 16M
offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.

In that case, user definitely want to defrag file in subvol 258, as if
that extent at bytenr Y can be freed, we can free up 16M, and allocate a
new 8K extent for subvol 258, ino 257.
(And will also want to defrag the extent in subvol 257 ino 257 too)

That's why knowledge in btrfs tech details can make a difference.
Sometimes you may find some ideas are brilliant and why btrfs is not
implementing it, but if you understand btrfs to some extent, you will
know the answer by yourself.


> 
>>> - I think it would be wrong to use a general deduplication algorithm for
>>> this. Instead, the information about the shared extents should be
>>> analyzed given the starting state of the filesystem, and than the
>>> algorithm should produce an optimal solution based on the currently
>>> shared extents.
>>
>> Please be more specific, like giving an example for it.
> 
> Let's say that there is a file FFF with extents e11, e12, e13, e22, e23,
> e33, e34
> - in subvolA the file FFF consists of e11, e12, e13
> - in subvolB the file FFF consists of e11, e22, e23
> - in subvolC the file FFF consists of e11, e22, e33, e34
> 
> After defrag, where 'selected subvolume' is subvolC, the extents are
> ordered on disk as follows:
> 
> e11,e22,e33,e34 - e23 - e12,e13

Inode FFF in different subvolumes are different inodes. They have no
knowledge of other inodes in other subvolumes.

If FFF in subvol C is e11, e22, e33, e34, then that's it.
I didn't see the point still.

And what's the on-disk bytenr of all these extents? Which has larger
bytenr and length?

Please provide a better description like xfs_io -c "fiemap -v" output
before and after.

> 
> In the list above, the comma denotes neighbouring extents, the dash
> indicates that there can be a possible gap.
> As you can see in the list, the file FFF is fully defragmented in
> subvolC, since its extents are occupying neighbouring disk sectors.
> 
> 
>>> 3) Downgrade to 'single' or 'DUP' (also, general easy way to switch
>>> between RAID levels)
>>>  Currently, as much as I gather, user has to do a "btrfs balance start
>>> -dconvert=single -mconvert=single
>>> ", than delete a drive, which is a bit ridiculous sequence of
>>> operations.
> 
>> That's a shortcut for chunk profile change.
>> My first idea of this is, it could cause more problem than benefit.
>> (It only benefits profile downgrade, thus only makes sense for
>> RAID1->SINGLE, DUP->SINGLE, and RAID10->RAID0, nothing else)
> 
> Those listed cases are exactly the ones I judge to be most important.
> Three important cases.

I'd argue it's downgrade, not that important. As most users want to
replace the missing/bad device and maintain the raid profile.

> 
>> I still prefer the safer allocate-new-chunk way to convert chunks, even
>> at a cost of extra IO.
> 
> I don't mind whether it allocates new chunks or not. It is better, in my
> opinion, if new chunks are not allocated, but both ways are essentially OK.
> 
> What I am complaining about is that at one point in time, after issuing
> the command:
>     btrfs balance start -dconvert=single -mconvert=single
> and before issuing the 'btrfs delete', the system could be in a too
> fragile state, with extents unnecesarily spread out over two drives,
> which is both a completely unnecessary operation, and it also seems to
> me that it could be dangerous in some situations involving potentially
> malfunctioning drives.

In that case, you just need to replace that malfunctioning device other
than fall back to SINGLE.

Thanks,
Qu

> 
> Please reconsider.
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 12:18     ` Qu Wenruo
@ 2019-09-09 12:28       ` Qu Wenruo
  2019-09-09 17:11         ` webmaster
  2019-09-09 15:29       ` Graham Cobb
  2019-09-09 16:38       ` webmaster
  2 siblings, 1 reply; 111+ messages in thread
From: Qu Wenruo @ 2019-09-09 12:28 UTC (permalink / raw)
  To: zedlryqc, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9858 bytes --]



On 2019/9/9 下午8:18, Qu Wenruo wrote:
> 
> 
> On 2019/9/9 下午7:25, zedlryqc@server53.web-hosting.com wrote:
>>
>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>> 1) Full online backup (or copy, whatever you want to call it)
>>>> btrfs backup <filesystem name> <partition name> [-f]
>>>> - backups a btrfs filesystem given by <filesystem name> to a partition
>>>> <partition name> (with all subvolumes).
>>>
>>> Why not just btrfs send?
>>>
>>> Or you want to keep the whole subvolume structures/layout?
>>
>> Yes, I want to keep the whole subvolume structures/layout. I want to
>> keep everything. Usually, when I want to backup a partition, I want to
>> keep everything, and I suppose most other people have a similar idea.
>>
>>> I'd say current send/receive is more flex.
>>
>> Um, 'flexibility' has nothing to do with it. Send/receive is a
>> completely different use case.
>> So, each one has some benefits and some drawbacks, but 'send/receive'
>> cannot replace 'full online backup'
>>
>> Here is where send/receive is lacking:
>>     - too complicated to do if many subvolumes are involved
>>     - may require recursive subvolume enumeration in order to emulate
>> 'full online backup'
>>     - may require extra storage space
>>     - is not mountable, not easy to browse the backup contents
>>     - not easy to recover just a few selected files from backup
>> There's probably more things where send/receive is lacking, but I think
>> I have given sufficient number of important differences which show that
>> send/receive cannot successfully replace the functionality of 'full
>> online backup'.

Forgot to mention this part.

If your primary objective is to migrate your data to another device
online (mounted, without unmount any of the fs).

Then I could say, you can still add a new device, then remove the old
device to do that.

That would be even more efficient than LVM (not thin provisioned one),
as we only move used space.

If your objective is to create a full copy as backup, then I'd say my
new patchset of btrfs-image data dump may be your best choice.

The only down side is, you need to at least mount the source fs to RO mode.

The true on-line backup is not that easy, especially any write can screw
up your backup process, so it must be done unmounted.

Even btrfs send handles this by forcing the source subvolume to be RO,
so I can't find an easy solution to address that.

Thanks,
Qu

>>
>>> And you also needs to understand btrfs also integrates volume
>>> management, thus it's not just <partition name>, you also needs RAID
>>> level and things like that.
>>
>> This is a minor point. So, please, let's not get into too many
>> irrelevant details here.
>>
>> There can be a sensible default to 'single data, DUP metadata', and a
>> way for a user to override this default, but that feature is
>> not-so-important. If the user wants to change the RAID level, he can
>> easily do it later by mounting the backup.
>>
>>>
>>> All can be done already by send/receive, although at subvolume level.
>>
>> Yeah, maybe I should manually type it all for all subvolumes, one by
>> one. Also must be carefull to do it in the correct order if I want it
>> not to consume extra space.
>> And the backup is not mountable.
>>
>> This proposal (workaround) of yours appears to me as too complicated,
>> too error prone,
>> missing important features.
>>
>> But, I just thought, you can actually emulate 'full online backup' with
>> this send/receive. Here is how.
>> You do a script which does the following:
>>     - makes a temporary snapshot of every subvolume
>>     - use 'btrfs send' to send all the temporary snapshots, on-the-fly
>> (maybe via pipe), in the correct order, to a proces running a 'brtfs
>> receive', which should then immediately write it all to the destination
>> partition. All the buffers can stay in-memory.
>>     - when all the snapshots are received and written to destination,
>> fix subvol IDs
>>     - delete temporary snapshots from source
>> Of course, this script should then be a part of standard btrfs tools.
>>
>>> Please check if send/receive is suitable for your use case.
>>
>> No. Absolutely not.
>>
>>
>>>> 2) Sensible defrag
>>>> The defrag is currently a joke.
>>
>>>> How to do it:
>>>> - The extents must not be unshared, but just shuffled a bit. Unsharing
>>>> the extents is, in most situations, not tolerable.
>>
>>> I definitely see cases unsharing extents makes sense, so at least we
>>> should let user to determine what they want.
>>
>> Maybe there are such cases, but I would say that a vast majority of
>> users (99,99%) in a vast majority of cases (99,99%) don't want the
>> defrag operation to reduce free disk space.
>>
>>> What's wrong with current file based defrag?
>>> If you want to defrag a subvolume, just iterate through all files.
>>
>> I repeat: The defrag should not decrease free space. That's the 'normal'
>> expectation.
> 
> Since you're talking about btrfs, it's going to do CoW for metadata not
> matter whatever, as long as you're going to change anything, btrfs will
> cause extra space usage.
> (Although the final result may not cause extra used disk space as freed
> space is as large as newly allocated space, but to maintain CoW, newly
> allocated space can't overlap with old data)
> 
> Further more, talking about snapshots with space wasted by extent
> booking, it's definitely possible user want to break the shared extents:
> 
> Subvol 257, inode 257 has the following file extents:
> (257 EXTENT_DATA 0)
> disk bytenr X len 16M
> offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.
> 
> Subvol 258, inode 257 has the following file extents:
> (257 EXTENT_DATA 0)
> disk bytenr X len 16M
> offset 0 num_bytes 4K  << Shared with that one in subv 257
> (257 EXTENT_DATA 4K)
> disk bytenr Y len 16M
> offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.
> 
> In that case, user definitely want to defrag file in subvol 258, as if
> that extent at bytenr Y can be freed, we can free up 16M, and allocate a
> new 8K extent for subvol 258, ino 257.
> (And will also want to defrag the extent in subvol 257 ino 257 too)
> 
> That's why knowledge in btrfs tech details can make a difference.
> Sometimes you may find some ideas are brilliant and why btrfs is not
> implementing it, but if you understand btrfs to some extent, you will
> know the answer by yourself.
> 
> 
>>
>>>> - I think it would be wrong to use a general deduplication algorithm for
>>>> this. Instead, the information about the shared extents should be
>>>> analyzed given the starting state of the filesystem, and than the
>>>> algorithm should produce an optimal solution based on the currently
>>>> shared extents.
>>>
>>> Please be more specific, like giving an example for it.
>>
>> Let's say that there is a file FFF with extents e11, e12, e13, e22, e23,
>> e33, e34
>> - in subvolA the file FFF consists of e11, e12, e13
>> - in subvolB the file FFF consists of e11, e22, e23
>> - in subvolC the file FFF consists of e11, e22, e33, e34
>>
>> After defrag, where 'selected subvolume' is subvolC, the extents are
>> ordered on disk as follows:
>>
>> e11,e22,e33,e34 - e23 - e12,e13
> 
> Inode FFF in different subvolumes are different inodes. They have no
> knowledge of other inodes in other subvolumes.
> 
> If FFF in subvol C is e11, e22, e33, e34, then that's it.
> I didn't see the point still.
> 
> And what's the on-disk bytenr of all these extents? Which has larger
> bytenr and length?
> 
> Please provide a better description like xfs_io -c "fiemap -v" output
> before and after.
> 
>>
>> In the list above, the comma denotes neighbouring extents, the dash
>> indicates that there can be a possible gap.
>> As you can see in the list, the file FFF is fully defragmented in
>> subvolC, since its extents are occupying neighbouring disk sectors.
>>
>>
>>>> 3) Downgrade to 'single' or 'DUP' (also, general easy way to switch
>>>> between RAID levels)
>>>>  Currently, as much as I gather, user has to do a "btrfs balance start
>>>> -dconvert=single -mconvert=single
>>>> ", than delete a drive, which is a bit ridiculous sequence of
>>>> operations.
>>
>>> That's a shortcut for chunk profile change.
>>> My first idea of this is, it could cause more problem than benefit.
>>> (It only benefits profile downgrade, thus only makes sense for
>>> RAID1->SINGLE, DUP->SINGLE, and RAID10->RAID0, nothing else)
>>
>> Those listed cases are exactly the ones I judge to be most important.
>> Three important cases.
> 
> I'd argue it's downgrade, not that important. As most users want to
> replace the missing/bad device and maintain the raid profile.
> 
>>
>>> I still prefer the safer allocate-new-chunk way to convert chunks, even
>>> at a cost of extra IO.
>>
>> I don't mind whether it allocates new chunks or not. It is better, in my
>> opinion, if new chunks are not allocated, but both ways are essentially OK.
>>
>> What I am complaining about is that at one point in time, after issuing
>> the command:
>>     btrfs balance start -dconvert=single -mconvert=single
>> and before issuing the 'btrfs delete', the system could be in a too
>> fragile state, with extents unnecesarily spread out over two drives,
>> which is both a completely unnecessary operation, and it also seems to
>> me that it could be dangerous in some situations involving potentially
>> malfunctioning drives.
> 
> In that case, you just need to replace that malfunctioning device other
> than fall back to SINGLE.
> 
> Thanks,
> Qu
> 
>>
>> Please reconsider.
>>
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 12:18     ` Qu Wenruo
  2019-09-09 12:28       ` Qu Wenruo
@ 2019-09-09 15:29       ` Graham Cobb
  2019-09-09 17:24         ` Remi Gauvin
                           ` (3 more replies)
  2019-09-09 16:38       ` webmaster
  2 siblings, 4 replies; 111+ messages in thread
From: Graham Cobb @ 2019-09-09 15:29 UTC (permalink / raw)
  To: Qu Wenruo, zedlryqc, linux-btrfs

On 09/09/2019 13:18, Qu Wenruo wrote:
> 
> 
> On 2019/9/9 下午7:25, zedlryqc@server53.web-hosting.com wrote:
>> What I am complaining about is that at one point in time, after issuing
>> the command:
>>     btrfs balance start -dconvert=single -mconvert=single
>> and before issuing the 'btrfs delete', the system could be in a too
>> fragile state, with extents unnecesarily spread out over two drives,
>> which is both a completely unnecessary operation, and it also seems to
>> me that it could be dangerous in some situations involving potentially
>> malfunctioning drives.
> 
> In that case, you just need to replace that malfunctioning device other
> than fall back to SINGLE.

Actually, this case is the (only) one of the three that I think would be
very useful (backup is better handled by having a choice of userspace
tools to choose from - I use btrbk - and does anyone really care about
defrag any more?).

I did, recently, have a case where I had started to move my main data
disk to a raid1 setup but my new disk started reporting errors. I didn't
have a spare disk (and didn't have a spare SCSI slot for another disk
anyway). So, I wanted to stop using the new disk and revert to my former
(m=dup, d=single) setup as quickly as possible.

I spent time trying to find a way to do that balance without risking the
single copy of some of the data being stored on the failing disk between
starting the balance and completing the remove. That has two problems:
obviously having the single copy on the failing disk is bad news but,
also, it increases the time taken for the subsequent remove which has to
copy that data back to the remaining disk (where there used to be a
perfectly good copy which was subsequently thrown away during the balance).

In the end, I took the risk and the time of the two steps. In my case, I
had good backups, and actually most of my data was still in a single
profile on the old disk (because the errors starting happening before I
had done the balance to change the profile of all the old data from
single to raid1).

But a balance -dconvert=single-but-force-it-to-go-on-disk-1 would have
been useful. (Actually a "btrfs device mark-for-removal" command would
be better - allow a failing device to be retained for a while, and used
to provide data, but ignore it when looking to store data).

Graham

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 12:18     ` Qu Wenruo
  2019-09-09 12:28       ` Qu Wenruo
  2019-09-09 15:29       ` Graham Cobb
@ 2019-09-09 16:38       ` webmaster
  2019-09-09 23:44         ` Qu Wenruo
  2 siblings, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-09 16:38 UTC (permalink / raw)
  To: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:

>>>> 2) Sensible defrag
>>>> The defrag is currently a joke.
>>
>> Maybe there are such cases, but I would say that a vast majority of
>> users (99,99%) in a vast majority of cases (99,99%) don't want the
>> defrag operation to reduce free disk space.
>>
>>> What's wrong with current file based defrag?
>>> If you want to defrag a subvolume, just iterate through all files.
>>
>> I repeat: The defrag should not decrease free space. That's the 'normal'
>> expectation.
>
> Since you're talking about btrfs, it's going to do CoW for metadata not
> matter whatever, as long as you're going to change anything, btrfs will
> cause extra space usage.
> (Although the final result may not cause extra used disk space as freed
> space is as large as newly allocated space, but to maintain CoW, newly
> allocated space can't overlap with old data)

It is OK for defrag to temporarily decrease free space while defrag  
operation is in progress. That's normal.

> Further more, talking about snapshots with space wasted by extent
> booking, it's definitely possible user want to break the shared extents:
>
> Subvol 257, inode 257 has the following file extents:
> (257 EXTENT_DATA 0)
> disk bytenr X len 16M
> offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.
>
> Subvol 258, inode 257 has the following file extents:
> (257 EXTENT_DATA 0)
> disk bytenr X len 16M
> offset 0 num_bytes 4K  << Shared with that one in subv 257
> (257 EXTENT_DATA 4K)
> disk bytenr Y len 16M
> offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.
>
> In that case, user definitely want to defrag file in subvol 258, as if
> that extent at bytenr Y can be freed, we can free up 16M, and allocate a
> new 8K extent for subvol 258, ino 257.
> (And will also want to defrag the extent in subvol 257 ino 257 too)

You are confusing the actual defrag with a separate concern, let's  
call it 'reserved space optimization'. It is about partially used  
extents. The actual name 'reserved space optimization' doesn't matter,  
I just made it up.

'reserved space optimization' is usually performed as a part of the  
defrag operation, but it doesn't have to be, as the actual defrag is  
something separate.

Yes, 'reserved space optimization' can break up extents.

'reserved space optimization' can either decrease or increase the free  
space. If the algorithm determines that more space should be reserved,  
than free space will decrease. If the algorithm determines that less  
space should be reserved, than free space will increase.

The 'reserved space optimization' can be accomplished such that the  
free space does not decrease, if such behavior is needed.

Also, the defrag operation can join some extents. In my original example,
the extents e33 and e34 can be fused into one.

> That's why knowledge in btrfs tech details can make a difference.
> Sometimes you may find some ideas are brilliant and why btrfs is not
> implementing it, but if you understand btrfs to some extent, you will
> know the answer by yourself.

Yes, it is true, but what you are posting so far are all 'red  
herring'-type arguments. It's just some irrelevant concerns, and you  
just got me explaining thinks like I would to a little baby. I don't  
know whether I stumbled on some rookie member of btrfs project, or you  
are just lazy and you don't want to think or consider my proposals.

When I post an explanation, please try to UNDERSTAND HOW IT CAN WORK,  
fill in the missing gaps, because there are tons of them, because I  
can't explain everything via three e-mail posts. Don't just come up  
with some half-baked, forced, illogical reason why things are better  
as they are.

>>>> - I think it would be wrong to use a general deduplication algorithm for
>>>> this. Instead, the information about the shared extents should be
>>>> analyzed given the starting state of the filesystem, and than the
>>>> algorithm should produce an optimal solution based on the currently
>>>> shared extents.
>>>
>>> Please be more specific, like giving an example for it.
>>
>> Let's say that there is a file FFF with extents e11, e12, e13, e22, e23,
>> e33, e34
>> - in subvolA the file FFF consists of e11, e12, e13
>> - in subvolB the file FFF consists of e11, e22, e23
>> - in subvolC the file FFF consists of e11, e22, e33, e34
>>
>> After defrag, where 'selected subvolume' is subvolC, the extents are
>> ordered on disk as follows:
>>
>> e11,e22,e33,e34 - e23 - e12,e13
>
> Inode FFF in different subvolumes are different inodes. They have no
> knowledge of other inodes in other subvolumes.

You can easily notice that, if necessary, the defrag algorithm can  
work without this knowledge, that is, without knowledge of other  
versions of FFF.

This time I'm leaving it to you to figure out how.

Another red herring.

> If FFF in subvol C is e11, e22, e33, e34, then that's it.
> I didn't see the point still.

Now I need to explain like I would to a baby.

If the extents e11, e22, e33, e34 are stored in neighbouring sectors,  
then the disk data reads are faster because they become sequential, as  
opposed to spread out.

So, while the file FFF in subvolC still has 4 extents like it had  
before defrag, reading of those 4 extents is much faster than before  
because the read can be sequential.

So, the defrag can actually be performed without fusing any extents.  
It would still have a noticeable performance benefit.

As I have already said, the defrag operation can join(fuse) some  
extents. In my original example,
the extents e33 and e34 can be fused into one.

> And what's the on-disk bytenr of all these extents? Which has larger
> bytenr and length?

For the sake of simplicity, let's say that all the extents in the  
example have equal length (so, you can choose ANY size), and are fully  
used.

> Please provide a better description like xfs_io -c "fiemap -v" output
> before and after.

No. My example is simple and clear. Nit-picking, like this you are  
doing, is not helpful. Concentrate, think, try to figure it out.

>>> That's a shortcut for chunk profile change.
>>> My first idea of this is, it could cause more problem than benefit.
>>> (It only benefits profile downgrade, thus only makes sense for
>>> RAID1->SINGLE, DUP->SINGLE, and RAID10->RAID0, nothing else)
>>
>> Those listed cases are exactly the ones I judge to be most important.
>> Three important cases.
>
> I'd argue it's downgrade, not that important. As most users want to
> replace the missing/bad device and maintain the raid profile.
>
>> What I am complaining about is that at one point in time, after issuing
>> the command:
>>     btrfs balance start -dconvert=single -mconvert=single
>> and before issuing the 'btrfs delete', the system could be in a too
>> fragile state, with extents unnecesarily spread out over two drives,
>> which is both a completely unnecessary operation, and it also seems to
>> me that it could be dangerous in some situations involving potentially
>> malfunctioning drives.
>
> In that case, you just need to replace that malfunctioning device other
> than fall back to SINGLE.

You are assuming that user has the time and money to replace the  
malfunctioning drive. In A LOT of cases, this is not true.

What if the drive is failing, but the user has some important work to  
do finish.
He has a presentation to perform. He doesn't want the presentation to  
be interrupted by a failing disk drive.

What if the user doesn't have any spare SATA cables on hand?

What if user doesn't have any space space in the case? What if it is a  
laptop computer?

While a user might want to maintain a RAID1 long-term, in the short  
term he might want to perform a downgrade.





^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 12:28       ` Qu Wenruo
@ 2019-09-09 17:11         ` webmaster
  2019-09-10 17:39           ` Andrei Borzenkov
  0 siblings, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-09 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:

> On 2019/9/9 下午8:18, Qu Wenruo wrote:
>>
>>
>> On 2019/9/9 下午7:25, zedlryqc@server53.web-hosting.com wrote:
>>>
>>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>>> 1) Full online backup (or copy, whatever you want to call it)
>>>>> btrfs backup <filesystem name> <partition name> [-f]
>>>>> - backups a btrfs filesystem given by <filesystem name> to a partition
>>>>> <partition name> (with all subvolumes).
>>>>
>>>> Why not just btrfs send?
>>>>
>>>> Or you want to keep the whole subvolume structures/layout?
>>>
>>> Yes, I want to keep the whole subvolume structures/layout. I want to
>>> keep everything. Usually, when I want to backup a partition, I want to
>>> keep everything, and I suppose most other people have a similar idea.
>>>
>>>> I'd say current send/receive is more flex.
>>>
>>> Um, 'flexibility' has nothing to do with it. Send/receive is a
>>> completely different use case.
>>> So, each one has some benefits and some drawbacks, but 'send/receive'
>>> cannot replace 'full online backup'
>>>
>>> Here is where send/receive is lacking:
>>>     - too complicated to do if many subvolumes are involved
>>>     - may require recursive subvolume enumeration in order to emulate
>>> 'full online backup'
>>>     - may require extra storage space
>>>     - is not mountable, not easy to browse the backup contents
>>>     - not easy to recover just a few selected files from backup
>>> There's probably more things where send/receive is lacking, but I think
>>> I have given sufficient number of important differences which show that
>>> send/receive cannot successfully replace the functionality of 'full
>>> online backup'.
>
> Forgot to mention this part.
>
> If your primary objective is to migrate your data to another device
> online (mounted, without unmount any of the fs).

This is not the primary objective. The primary objective is to produce  
a full, online, easy-to-use, robust backup. But let's say we need to  
do migration...
>
> Then I could say, you can still add a new device, then remove the old
> device to do that.

If the source filesystem already uses RAID1, then, yes, you could do  
it, but it would be too slow, it would need a lot of user  
intervention, so many commands typed, so many ways to do it wrong, to  
make a mistake.

Too cumbersome. Too wastefull of time and resources.

> That would be even more efficient than LVM (not thin provisioned one),
> as we only move used space.

In fact, you can do this kind of full-online-backup with the help of  
mdadm RAID, or some other RAID solution. It can already be done, no  
need to add 'btrfs backup'.

But, again, to cumbersome, too inflexible, too many problems, and, the  
user would have to setup a downgraded mdadm RAID in front and run with  
a degraded mdadm RAID all the time (since btrfs RAID would be actually  
protecting the data).

> If your objective is to create a full copy as backup, then I'd say my
> new patchset of btrfs-image data dump may be your best choice.

It should be mountable. It should be performed online. Never heard of  
btrfs-image, i need the docs to see whether this btrfs-image is good  
enough.

> The only down side is, you need to at least mount the source fs to RO mode.

No. That's not really an online backup. Not good enough.

> The true on-line backup is not that easy, especially any write can screw
> up your backup process, so it must be done unmounted.

Nope, I disagree.

First, there is the RAID1-alike solution, which is easy to perform  
(just send all new writes to both source and destination). It's the  
same thing that mdadm RAID1 would do (like I mentioned a few  
paragraphs above).
But, this solution may have a performance concern, when the  
destination drive is too slow.

Fortunately, with btrfs, an online backup is easier tha usual. To  
produce a frozen snapshot of the entire filesystem, just create a  
read-only snapshot of every subvolume (this is not 100% consistent, I  
know, but it is good enough).

But I'm just repeating myself, I already wrote this in the first email.

So, in conclusion I disagree that true on-line backup is not easy.

> Even btrfs send handles this by forcing the source subvolume to be RO,
> so I can't find an easy solution to address that.

This is a digression, but I would say that you first make a temporary  
RO snapshot of the source subvolume, then use 'btrfs send' on the  
temporary snapshot, then delete the temporary snapshot.

Oh, my.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 15:29       ` Graham Cobb
@ 2019-09-09 17:24         ` Remi Gauvin
  2019-09-09 19:26         ` webmaster
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 111+ messages in thread
From: Remi Gauvin @ 2019-09-09 17:24 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1057 bytes --]

On 2019-09-09 11:29 a.m., Graham Cobb wrote:

>  and does anyone really care about
> defrag any more?).
> 


Err, yes, yes absolutely.

I don't have any issues with the current btrfs defrag implementions, but
it's *vital* for btrfs. (which works just as the OP requested, as far as
I can tell, recursively for a subvolume)

Just booting Windows on a BTRFS virtual image, for example, will create
almost 20,000 file fragments.  Even on SSD's, you get into problems
trying to work with files that are over 200,000 fragments.

Another huge problem is rsync --inplace.  which is perfect backup
solution to take advantage of BTRFS snapshots, but fragments larges
files into tiny pieces (and subsequently creates files that are very
slow to read.).. for some reason, autodefrag doesn't catch that one either.

But the wiki could do a beter job of trying to explain that the snapshot
duplication of defrag only affects the fragmented portions.  As I
understand, it's really only a problem when using defrag to change
compression.





[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 15:29       ` Graham Cobb
  2019-09-09 17:24         ` Remi Gauvin
@ 2019-09-09 19:26         ` webmaster
  2019-09-10 19:22           ` Austin S. Hemmelgarn
  2019-09-09 23:24         ` Qu Wenruo
  2019-09-09 23:25         ` webmaster
  3 siblings, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-09 19:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-btrfs

This post is a reply to Remi Gauvin's post, but the email got lost so  
I can't reply to him directly.

Remi Gauvin wrote on 2019-09-09 17:24 :
>
> On 2019-09-09 11:29 a.m., Graham Cobb wrote:
>
>>  and does anyone really care about
>> defrag any more?).
>>
>
>
> Err, yes, yes absolutely.
>
> I don't have any issues with the current btrfs defrag implementions, but
> it's *vital* for btrfs. (which works just as the OP requested, as far as
> I can tell, recursively for a subvolume)
>
> Just booting Windows on a BTRFS virtual image, for example, will create
> almost 20,000 file fragments.  Even on SSD's, you get into problems
> trying to work with files that are over 200,000 fragments.
>
> Another huge problem is rsync --inplace.  which is perfect backup
> solution to take advantage of BTRFS snapshots, but fragments larges
> files into tiny pieces (and subsequently creates files that are very
> slow to read.).. for some reason, autodefrag doesn't catch that one either.
>
> But the wiki could do a beter job of trying to explain that the snapshot
> duplication of defrag only affects the fragmented portions.  As I
> understand, it's really only a problem when using defrag to change
> compression.


Ok, a few things.

First, my defrag suggestion doesn't EVER unshare extents. The defrag  
should never unshare, not even a single extent. Why? Because it  
violates the expectation that defrag would not decrease free space.

Defrag may break up extents. Defrag may fuse extents. But it shouln't  
ever unshare extents.

Therefore, I doubt that the current defrag does "just as the OP  
requested". Nonsense. The current implementation does the unsharing  
all the time.

Second, I never used btrfs defrag in my life, despite mananging at  
least 10 btrfs filesystems. I can't. Because, all my btrfs volumes  
have lot of subvolumes, so I'm afraid that defrag will unshare much  
more than I can tolerate. In my subvolumes, over 90% of data is  
shared. If all subvolumes were to be unshared, the disk usage would  
likely increase tenfold, and that I cannot afford.

I agree that btrfs defrag is vital. But currently, it's unusable for  
many use cases.

Also, I don't quite understand what the poster means by "the snapshot  
duplication of defrag only affects the fragmented portions". Possibly  
it means approximately: if a file wasn't modified in the current  
(latest) subvolume, it doesn't need to be unshared. But, that would  
still unshare all the log files, for example, even all files that have  
been appended, etc... that's quite bad. Even if just one byte was  
appended to a log file, then defrag will unshare the entire file (I  
suppose).


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 15:29       ` Graham Cobb
  2019-09-09 17:24         ` Remi Gauvin
  2019-09-09 19:26         ` webmaster
@ 2019-09-09 23:24         ` Qu Wenruo
  2019-09-09 23:25         ` webmaster
  3 siblings, 0 replies; 111+ messages in thread
From: Qu Wenruo @ 2019-09-09 23:24 UTC (permalink / raw)
  To: Graham Cobb, zedlryqc, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2766 bytes --]



On 2019/9/9 下午11:29, Graham Cobb wrote:
> On 09/09/2019 13:18, Qu Wenruo wrote:
>>
>>
>> On 2019/9/9 下午7:25, zedlryqc@server53.web-hosting.com wrote:
>>> What I am complaining about is that at one point in time, after issuing
>>> the command:
>>>     btrfs balance start -dconvert=single -mconvert=single
>>> and before issuing the 'btrfs delete', the system could be in a too
>>> fragile state, with extents unnecesarily spread out over two drives,
>>> which is both a completely unnecessary operation, and it also seems to
>>> me that it could be dangerous in some situations involving potentially
>>> malfunctioning drives.
>>
>> In that case, you just need to replace that malfunctioning device other
>> than fall back to SINGLE.
> 
> Actually, this case is the (only) one of the three that I think would be
> very useful (backup is better handled by having a choice of userspace
> tools to choose from - I use btrbk - and does anyone really care about
> defrag any more?).
> 
> I did, recently, have a case where I had started to move my main data
> disk to a raid1 setup but my new disk started reporting errors. I didn't
> have a spare disk (and didn't have a spare SCSI slot for another disk
> anyway). So, I wanted to stop using the new disk and revert to my former
> (m=dup, d=single) setup as quickly as possible.
> 
> I spent time trying to find a way to do that balance without risking the
> single copy of some of the data being stored on the failing disk between
> starting the balance and completing the remove. That has two problems:
> obviously having the single copy on the failing disk is bad news but,
> also, it increases the time taken for the subsequent remove which has to
> copy that data back to the remaining disk (where there used to be a
> perfectly good copy which was subsequently thrown away during the balance).
> 
> In the end, I took the risk and the time of the two steps. In my case, I
> had good backups, and actually most of my data was still in a single
> profile on the old disk (because the errors starting happening before I
> had done the balance to change the profile of all the old data from
> single to raid1).
> 
> But a balance -dconvert=single-but-force-it-to-go-on-disk-1 would have
> been useful. (Actually a "btrfs device mark-for-removal" command would
> be better - allow a failing device to be retained for a while, and used
> to provide data, but ignore it when looking to store data).

Indeed, it makes sense.

It would be some user-defined chunk allocation behavior, in that case,
we need to double think about the interface first.

BTW, have you tried to mark the malfunctioning disk RO and mount it?

Thanks,
Qu
> 
> Graham
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 15:29       ` Graham Cobb
                           ` (2 preceding siblings ...)
  2019-09-09 23:24         ` Qu Wenruo
@ 2019-09-09 23:25         ` webmaster
  3 siblings, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-09 23:25 UTC (permalink / raw)
  To: Graham Cobb; +Cc: Qu Wenruo, linux-btrfs


Quoting Graham Cobb <g.btrfs@cobb.uk.net>:

> On 09/09/2019 13:18, Qu Wenruo wrote:
>>
>>
>> On 2019/9/9 下午7:25, zedlryqc@server53.web-hosting.com wrote:
>>> What I am complaining about is that at one point in time, after issuing
>>> the command:
>>>     btrfs balance start -dconvert=single -mconvert=single
>>> and before issuing the 'btrfs delete', the system could be in a too
>>> fragile state, with extents unnecesarily spread out over two drives,
>>> which is both a completely unnecessary operation, and it also seems to
>>> me that it could be dangerous in some situations involving potentially
>>> malfunctioning drives.
>>
>> In that case, you just need to replace that malfunctioning device other
>> than fall back to SINGLE.
>
> Actually, this case is the (only) one of the three that I think would be
> very useful (backup is better handled by having a choice of userspace
> tools to choose from - I use btrbk - and does anyone really care about
> defrag any more?).
>
> I did, recently, have a case where I had started to move my main data
> disk to a raid1 setup but my new disk started reporting errors. I didn't
> have a spare disk (and didn't have a spare SCSI slot for another disk
> anyway). So, I wanted to stop using the new disk and revert to my former
> (m=dup, d=single) setup as quickly as possible.
>
> I spent time trying to find a way to do that balance without risking the
> single copy of some of the data being stored on the failing disk between
> starting the balance and completing the remove. That has two problems:
> obviously having the single copy on the failing disk is bad news but,
> also, it increases the time taken for the subsequent remove which has to
> copy that data back to the remaining disk (where there used to be a
> perfectly good copy which was subsequently thrown away during the balance).
>
> In the end, I took the risk and the time of the two steps. In my case, I
> had good backups, and actually most of my data was still in a single
> profile on the old disk (because the errors starting happening before I
> had done the balance to change the profile of all the old data from
> single to raid1).
>
> But a balance -dconvert=single-but-force-it-to-go-on-disk-1 would have
> been useful. (Actually a "btrfs device mark-for-removal" command would
> be better - allow a failing device to be retained for a while, and used
> to provide data, but ignore it when looking to store data).
>
> Graham

Thank you. I like your very nice example/story/use case, and I also  
like the solution: btrfs device mark-for-removal.

Of course, you didn't have the spare drive. I guess most users don't  
have it. I don't. They buy it when one of the drives in RAID1 fails.

I was thinking of a more general solution, so I proposed RAID level  
change feature, but this suggestion of yours covers most of the use  
cases. Maybe the case of RAID10 level downgrade should also be  
considered, in case of circumstances similar to what you described.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 16:38       ` webmaster
@ 2019-09-09 23:44         ` Qu Wenruo
  2019-09-10  0:00           ` Chris Murphy
  2019-09-10  0:06           ` webmaster
  0 siblings, 2 replies; 111+ messages in thread
From: Qu Wenruo @ 2019-09-09 23:44 UTC (permalink / raw)
  To: webmaster, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4057 bytes --]



On 2019/9/10 上午12:38, webmaster@zedlx.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
> 
>>>>> 2) Sensible defrag
>>>>> The defrag is currently a joke.
>>>
>>> Maybe there are such cases, but I would say that a vast majority of
>>> users (99,99%) in a vast majority of cases (99,99%) don't want the
>>> defrag operation to reduce free disk space.
>>>
>>>> What's wrong with current file based defrag?
>>>> If you want to defrag a subvolume, just iterate through all files.
>>>
>>> I repeat: The defrag should not decrease free space. That's the 'normal'
>>> expectation.
>>
>> Since you're talking about btrfs, it's going to do CoW for metadata not
>> matter whatever, as long as you're going to change anything, btrfs will
>> cause extra space usage.
>> (Although the final result may not cause extra used disk space as freed
>> space is as large as newly allocated space, but to maintain CoW, newly
>> allocated space can't overlap with old data)
> 
> It is OK for defrag to temporarily decrease free space while defrag
> operation is in progress. That's normal.
> 
>> Further more, talking about snapshots with space wasted by extent
>> booking, it's definitely possible user want to break the shared extents:
>>
>> Subvol 257, inode 257 has the following file extents:
>> (257 EXTENT_DATA 0)
>> disk bytenr X len 16M
>> offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.
>>
>> Subvol 258, inode 257 has the following file extents:
>> (257 EXTENT_DATA 0)
>> disk bytenr X len 16M
>> offset 0 num_bytes 4K  << Shared with that one in subv 257
>> (257 EXTENT_DATA 4K)
>> disk bytenr Y len 16M
>> offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.
>>
>> In that case, user definitely want to defrag file in subvol 258, as if
>> that extent at bytenr Y can be freed, we can free up 16M, and allocate a
>> new 8K extent for subvol 258, ino 257.
>> (And will also want to defrag the extent in subvol 257 ino 257 too)
> 
> You are confusing the actual defrag with a separate concern, let's call
> it 'reserved space optimization'. It is about partially used extents.
> The actual name 'reserved space optimization' doesn't matter, I just
> made it up.

Then when it's not snapshotted, it's plain defrag.

How things go from "reserved space optimization" to "plain defrag" just
because snapshots?

> 
> 'reserved space optimization' is usually performed as a part of the
> defrag operation, but it doesn't have to be, as the actual defrag is
> something separate.
> 
> Yes, 'reserved space optimization' can break up extents.
> 
> 'reserved space optimization' can either decrease or increase the free
> space. If the algorithm determines that more space should be reserved,
> than free space will decrease. If the algorithm determines that less
> space should be reserved, than free space will increase.
> 
> The 'reserved space optimization' can be accomplished such that the free
> space does not decrease, if such behavior is needed.
> 
> Also, the defrag operation can join some extents. In my original example,
> the extents e33 and e34 can be fused into one.

Btrfs defrag works by creating new extents containing the old data.

So if btrfs decides to defrag, no old extents will be used.
It will all be new extents.

That's why your proposal is freaking strange here.

> 
>> That's why knowledge in btrfs tech details can make a difference.
>> Sometimes you may find some ideas are brilliant and why btrfs is not
>> implementing it, but if you understand btrfs to some extent, you will
>> know the answer by yourself.
> 
> Yes, it is true, but what you are posting so far are all 'red
> herring'-type arguments. It's just some irrelevant concerns, and you
> just got me explaining thinks like I would to a little baby. I don't
> know whether I stumbled on some rookie member of btrfs project, or you
> are just lazy and you don't want to think or consider my proposals.

Go check my name in git log.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 23:44         ` Qu Wenruo
@ 2019-09-10  0:00           ` Chris Murphy
  2019-09-10  0:51             ` Qu Wenruo
  2019-09-10  0:06           ` webmaster
  1 sibling, 1 reply; 111+ messages in thread
From: Chris Murphy @ 2019-09-10  0:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: webmaster, Btrfs BTRFS

On Mon, Sep 9, 2019 at 5:44 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/9/10 上午12:38, webmaster@zedlx.com wrote:
> > Yes, it is true, but what you are posting so far are all 'red
> > herring'-type arguments. It's just some irrelevant concerns, and you
> > just got me explaining thinks like I would to a little baby. I don't
> > know whether I stumbled on some rookie member of btrfs project, or you
> > are just lazy and you don't want to think or consider my proposals.
>
> Go check my name in git log.
>

Hey Qu, how do I join this rookie club? :-D

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 23:44         ` Qu Wenruo
  2019-09-10  0:00           ` Chris Murphy
@ 2019-09-10  0:06           ` webmaster
  2019-09-10  0:48             ` Qu Wenruo
  1 sibling, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-10  0:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:

> On 2019/9/10 上午12:38, webmaster@zedlx.com wrote:
>>
>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>>>>> 2) Sensible defrag
>>>>>> The defrag is currently a joke.
>>>>
>>>> Maybe there are such cases, but I would say that a vast majority of
>>>> users (99,99%) in a vast majority of cases (99,99%) don't want the
>>>> defrag operation to reduce free disk space.
>>>>
>>>>> What's wrong with current file based defrag?
>>>>> If you want to defrag a subvolume, just iterate through all files.
>>>>
>>>> I repeat: The defrag should not decrease free space. That's the 'normal'
>>>> expectation.
>>>
>>> Since you're talking about btrfs, it's going to do CoW for metadata not
>>> matter whatever, as long as you're going to change anything, btrfs will
>>> cause extra space usage.
>>> (Although the final result may not cause extra used disk space as freed
>>> space is as large as newly allocated space, but to maintain CoW, newly
>>> allocated space can't overlap with old data)
>>
>> It is OK for defrag to temporarily decrease free space while defrag
>> operation is in progress. That's normal.
>>
>>> Further more, talking about snapshots with space wasted by extent
>>> booking, it's definitely possible user want to break the shared extents:
>>>
>>> Subvol 257, inode 257 has the following file extents:
>>> (257 EXTENT_DATA 0)
>>> disk bytenr X len 16M
>>> offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.
>>>
>>> Subvol 258, inode 257 has the following file extents:
>>> (257 EXTENT_DATA 0)
>>> disk bytenr X len 16M
>>> offset 0 num_bytes 4K  << Shared with that one in subv 257
>>> (257 EXTENT_DATA 4K)
>>> disk bytenr Y len 16M
>>> offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.
>>>
>>> In that case, user definitely want to defrag file in subvol 258, as if
>>> that extent at bytenr Y can be freed, we can free up 16M, and allocate a
>>> new 8K extent for subvol 258, ino 257.
>>> (And will also want to defrag the extent in subvol 257 ino 257 too)
>>
>> You are confusing the actual defrag with a separate concern, let's call
>> it 'reserved space optimization'. It is about partially used extents.
>> The actual name 'reserved space optimization' doesn't matter, I just
>> made it up.
>
> Then when it's not snapshotted, it's plain defrag.
>
> How things go from "reserved space optimization" to "plain defrag" just
> because snapshots?

I'm not sure that I'm still following you here.

I'm just saying that when you have some unused space within an extent  
and you want the defrag to free it up, that is OK, but such thing is  
not the main focus of the defrag operation. So you are giving me some  
edge case here which is half-relevant and it can be easily solved. The  
extent just needs to be split up into pieces, it's nothing special.

>> 'reserved space optimization' is usually performed as a part of the
>> defrag operation, but it doesn't have to be, as the actual defrag is
>> something separate.
>>
>> Yes, 'reserved space optimization' can break up extents.
>>
>> 'reserved space optimization' can either decrease or increase the free
>> space. If the algorithm determines that more space should be reserved,
>> than free space will decrease. If the algorithm determines that less
>> space should be reserved, than free space will increase.
>>
>> The 'reserved space optimization' can be accomplished such that the free
>> space does not decrease, if such behavior is needed.
>>
>> Also, the defrag operation can join some extents. In my original example,
>> the extents e33 and e34 can be fused into one.
>
> Btrfs defrag works by creating new extents containing the old data.
>
> So if btrfs decides to defrag, no old extents will be used.
> It will all be new extents.
>
> That's why your proposal is freaking strange here.

Ok, but: can the NEW extents still be shared? If you had an extent E88  
shared by 4 files in different subvolumes, can it be copied to another  
place and still be shared by the original 4 files? I guess that the  
answer is YES. And, that's the only requirement for a good defrag  
algorithm that doesn't shrink free space.

Perhaps the metadata extents need to be unshared. That is OK. But I  
guess that after a typical defrag, the sharing ratio in metadata  
woudn't change much.

>>> That's why knowledge in btrfs tech details can make a difference.
>>> Sometimes you may find some ideas are brilliant and why btrfs is not
>>> implementing it, but if you understand btrfs to some extent, you will
>>> know the answer by yourself.
>>
>> Yes, it is true, but what you are posting so far are all 'red
>> herring'-type arguments. It's just some irrelevant concerns, and you
>> just got me explaining thinks like I would to a little baby. I don't
>> know whether I stumbled on some rookie member of btrfs project, or you
>> are just lazy and you don't want to think or consider my proposals.
>
> Go check my name in git log.

I didn't check yet. Ok, let's just try to communicate here, I'm dead serious.

I can't understand a defrag that substantially decreases free space. I  
mean, each such defrag is a lottery, because you might end up with  
practically unusable file system if the partition fills up.

CURRENT DEFRAG IS A LOTTERY!

How bad is that?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  0:06           ` webmaster
@ 2019-09-10  0:48             ` Qu Wenruo
  2019-09-10  1:24               ` webmaster
  2019-09-11  0:26               ` webmaster
  0 siblings, 2 replies; 111+ messages in thread
From: Qu Wenruo @ 2019-09-10  0:48 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8151 bytes --]



On 2019/9/10 上午8:06, webmaster@zedlx.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
> 
>> On 2019/9/10 上午12:38, webmaster@zedlx.com wrote:
>>>
>>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>>>> 2) Sensible defrag
>>>>>>> The defrag is currently a joke.
>>>>>
>>>>> Maybe there are such cases, but I would say that a vast majority of
>>>>> users (99,99%) in a vast majority of cases (99,99%) don't want the
>>>>> defrag operation to reduce free disk space.
>>>>>
>>>>>> What's wrong with current file based defrag?
>>>>>> If you want to defrag a subvolume, just iterate through all files.
>>>>>
>>>>> I repeat: The defrag should not decrease free space. That's the
>>>>> 'normal'
>>>>> expectation.
>>>>
>>>> Since you're talking about btrfs, it's going to do CoW for metadata not
>>>> matter whatever, as long as you're going to change anything, btrfs will
>>>> cause extra space usage.
>>>> (Although the final result may not cause extra used disk space as freed
>>>> space is as large as newly allocated space, but to maintain CoW, newly
>>>> allocated space can't overlap with old data)
>>>
>>> It is OK for defrag to temporarily decrease free space while defrag
>>> operation is in progress. That's normal.
>>>
>>>> Further more, talking about snapshots with space wasted by extent
>>>> booking, it's definitely possible user want to break the shared
>>>> extents:
>>>>
>>>> Subvol 257, inode 257 has the following file extents:
>>>> (257 EXTENT_DATA 0)
>>>> disk bytenr X len 16M
>>>> offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.
>>>>
>>>> Subvol 258, inode 257 has the following file extents:
>>>> (257 EXTENT_DATA 0)
>>>> disk bytenr X len 16M
>>>> offset 0 num_bytes 4K  << Shared with that one in subv 257
>>>> (257 EXTENT_DATA 4K)
>>>> disk bytenr Y len 16M
>>>> offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.
>>>>
>>>> In that case, user definitely want to defrag file in subvol 258, as if
>>>> that extent at bytenr Y can be freed, we can free up 16M, and
>>>> allocate a
>>>> new 8K extent for subvol 258, ino 257.
>>>> (And will also want to defrag the extent in subvol 257 ino 257 too)
>>>
>>> You are confusing the actual defrag with a separate concern, let's call
>>> it 'reserved space optimization'. It is about partially used extents.
>>> The actual name 'reserved space optimization' doesn't matter, I just
>>> made it up.
>>
>> Then when it's not snapshotted, it's plain defrag.
>>
>> How things go from "reserved space optimization" to "plain defrag" just
>> because snapshots?
> 
> I'm not sure that I'm still following you here.
> 
> I'm just saying that when you have some unused space within an extent
> and you want the defrag to free it up, that is OK, but such thing is not
> the main focus of the defrag operation. So you are giving me some edge
> case here which is half-relevant and it can be easily solved. The extent
> just needs to be split up into pieces, it's nothing special.
> 
>>> 'reserved space optimization' is usually performed as a part of the
>>> defrag operation, but it doesn't have to be, as the actual defrag is
>>> something separate.
>>>
>>> Yes, 'reserved space optimization' can break up extents.
>>>
>>> 'reserved space optimization' can either decrease or increase the free
>>> space. If the algorithm determines that more space should be reserved,
>>> than free space will decrease. If the algorithm determines that less
>>> space should be reserved, than free space will increase.
>>>
>>> The 'reserved space optimization' can be accomplished such that the free
>>> space does not decrease, if such behavior is needed.
>>>
>>> Also, the defrag operation can join some extents. In my original
>>> example,
>>> the extents e33 and e34 can be fused into one.
>>
>> Btrfs defrag works by creating new extents containing the old data.
>>
>> So if btrfs decides to defrag, no old extents will be used.
>> It will all be new extents.
>>
>> That's why your proposal is freaking strange here.
> 
> Ok, but: can the NEW extents still be shared?

Can only be shared by reflink.
Not automatically, so if btrfs decides to defrag, it will not be shared
at all.

> If you had an extent E88
> shared by 4 files in different subvolumes, can it be copied to another
> place and still be shared by the original 4 files?

Not for current btrfs.

> I guess that the
> answer is YES. And, that's the only requirement for a good defrag
> algorithm that doesn't shrink free space.

We may go that direction.

The biggest burden here is, btrfs needs to do expensive full-backref
walk to determine how many files are referring to this extent.
And then change them all to refer to the new extent.

It's feasible if the extent is not shared by many.
E.g the extent only get shared by ~10 or ~50 subvolumes/files.

But what will happen if it's shared by 1000 subvolumes? That would be a
performance burden.
And trust me, we have already experienced such disaster in qgroup,
that's why we want to avoid such case.

Another problem is, what if some of the subvolumes are read-only, should
we touch it or not? (I guess not)
Then the defrag will be not so complete. Bad fragmented extents are
still in RO subvols.

So the devil is still in the detail, again and again.

> 
> Perhaps the metadata extents need to be unshared. That is OK. But I
> guess that after a typical defrag, the sharing ratio in metadata woudn't
> change much.

Metadata (tree blocks) in btrfs is always get unshared as long as you
modified the tree.
But indeed, the ratio isn't that high.

> 
>>>> That's why knowledge in btrfs tech details can make a difference.
>>>> Sometimes you may find some ideas are brilliant and why btrfs is not
>>>> implementing it, but if you understand btrfs to some extent, you will
>>>> know the answer by yourself.
>>>
>>> Yes, it is true, but what you are posting so far are all 'red
>>> herring'-type arguments. It's just some irrelevant concerns, and you
>>> just got me explaining thinks like I would to a little baby. I don't
>>> know whether I stumbled on some rookie member of btrfs project, or you
>>> are just lazy and you don't want to think or consider my proposals.
>>
>> Go check my name in git log.
> 
> I didn't check yet. Ok, let's just try to communicate here, I'm dead
> serious.
> 
> I can't understand a defrag that substantially decreases free space. I
> mean, each such defrag is a lottery, because you might end up with
> practically unusable file system if the partition fills up.
> 
> CURRENT DEFRAG IS A LOTTERY!
> 
> How bad is that?
>

Now you see why btrfs defrag has problem.

On one hand, guys like you don't want to unshare extents. I understand
and it makes sense to some extents. And used to be the default behavior.

On the other hand, btrfs has to CoW extents to do defrag, and we have
extreme cases where we want to defrag shared extents even it's going to
decrease free space.

And I have to admit, my memory made the discussion a little off-topic,
as I still remember some older kernel doesn't touch shared extents at all.

So here what we could do is: (From easy to hard)
- Introduce an interface to allow defrag not to touch shared extents
  it shouldn't be that difficult compared to other work we are going
  to do.
  At least, user has their choice.

- Introduce different levels for defrag
  Allow btrfs to do some calculation and space usage policy to
  determine if it's a good idea to defrag some shared extents.
  E.g. my extreme case, unshare the extent would make it possible to
  defrag the other subvolume to free a huge amount of space.
  A compromise, let user to choose if they want to sacrifice some space.

- Ultimate super-duper cross subvolume defrag
  Defrag could also automatically change all the referencers.
  That's why we call it ultimate super duper, but as I already mentioned
  it's a big performance problem, and if Ro subvolume is involved, it'll
  go super tricky.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  0:00           ` Chris Murphy
@ 2019-09-10  0:51             ` Qu Wenruo
  0 siblings, 0 replies; 111+ messages in thread
From: Qu Wenruo @ 2019-09-10  0:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: webmaster, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 766 bytes --]



On 2019/9/10 上午8:00, Chris Murphy wrote:
> On Mon, Sep 9, 2019 at 5:44 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2019/9/10 上午12:38, webmaster@zedlx.com wrote:
>>> Yes, it is true, but what you are posting so far are all 'red
>>> herring'-type arguments. It's just some irrelevant concerns, and you
>>> just got me explaining thinks like I would to a little baby. I don't
>>> know whether I stumbled on some rookie member of btrfs project, or you
>>> are just lazy and you don't want to think or consider my proposals.
>>
>> Go check my name in git log.
>>
> 
> Hey Qu, how do I join this rookie club? :-D

By reading the funny code. :)

And try to write the funny code.
(Then you have double fun!)

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  0:48             ` Qu Wenruo
@ 2019-09-10  1:24               ` webmaster
  2019-09-10  1:48                 ` Qu Wenruo
  2019-09-11  0:26               ` webmaster
  1 sibling, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-10  1:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:

>>> Btrfs defrag works by creating new extents containing the old data.
>>>
>>> So if btrfs decides to defrag, no old extents will be used.
>>> It will all be new extents.
>>>
>>> That's why your proposal is freaking strange here.
>>
>> Ok, but: can the NEW extents still be shared?
>
> Can only be shared by reflink.
> Not automatically, so if btrfs decides to defrag, it will not be shared
> at all.
>
>> If you had an extent E88
>> shared by 4 files in different subvolumes, can it be copied to another
>> place and still be shared by the original 4 files?
>
> Not for current btrfs.
>
>> I guess that the
>> answer is YES. And, that's the only requirement for a good defrag
>> algorithm that doesn't shrink free space.
>
> We may go that direction.
>
> The biggest burden here is, btrfs needs to do expensive full-backref
> walk to determine how many files are referring to this extent.
> And then change them all to refer to the new extent.

YES! That! Exactly THAT. That is what needs to be done.

I mean, you just create an (perhaps associative) array which links an  
extent (the array index contains the extent ID) to all the files that  
reference that extent.

To initialize it, you do one single walk through the entire b-tree.

Than the data you require can be retrieved in an instant.

> It's feasible if the extent is not shared by many.
> E.g the extent only get shared by ~10 or ~50 subvolumes/files.
>
> But what will happen if it's shared by 1000 subvolumes? That would be a
> performance burden.
> And trust me, we have already experienced such disaster in qgroup,
> that's why we want to avoid such case.

Um, I don't quite get where this 'performance burden' is comming from.  
If you mean that moving a single extent requires rewriting a lot of  
b-trees, than perhaps it could be solved by moving extents in bigger  
batches. So, fo example, you move(create new) extents, but you do that  
for 100 megabytes of extents at the time, then you update the b-trees.  
So then, there would be much less b-tree writes to disk.

Also, if the defrag detects 1000 subvolumes, it can warn the user.

By the way, isn't the current recommendation to stay below 100  
subvolumes?. So if defrag can do 100 subvolumes, that is great. The  
defrag doesn't need to do 1000. If there are 1000 subvolumes, than the  
user should delete most of them before doing defrag.

> Another problem is, what if some of the subvolumes are read-only, should
> we touch it or not? (I guess not)

I guess YES. Except if the user overrides it with some switch.

> Then the defrag will be not so complete. Bad fragmented extents are
> still in RO subvols.

Let the user choose!

> So the devil is still in the detail, again and again.

Ok, let's flesh out some details.

>> I can't understand a defrag that substantially decreases free space. I
>> mean, each such defrag is a lottery, because you might end up with
>> practically unusable file system if the partition fills up.
>>
>> CURRENT DEFRAG IS A LOTTERY!
>>
>> How bad is that?
>>
>
> Now you see why btrfs defrag has problem.
>
> On one hand, guys like you don't want to unshare extents. I understand
> and it makes sense to some extents. And used to be the default behavior.
>
> On the other hand, btrfs has to CoW extents to do defrag, and we have
> extreme cases where we want to defrag shared extents even it's going to
> decrease free space.
>
> And I have to admit, my memory made the discussion a little off-topic,
> as I still remember some older kernel doesn't touch shared extents at all.
>
> So here what we could do is: (From easy to hard)
> - Introduce an interface to allow defrag not to touch shared extents
>   it shouldn't be that difficult compared to other work we are going
>   to do.
>   At least, user has their choice.

That defrag wouldn't acomplish much. You can call it defrag, but it is  
more like nothing happens.

> - Introduce different levels for defrag
>   Allow btrfs to do some calculation and space usage policy to
>   determine if it's a good idea to defrag some shared extents.
>   E.g. my extreme case, unshare the extent would make it possible to
>   defrag the other subvolume to free a huge amount of space.
>   A compromise, let user to choose if they want to sacrifice some space.

Meh. You can always defrag one chosen subvolume perfectly, without  
unsharing any file extents.
So, since it can be done perfectly without unsharing, why unshare at all?

> - Ultimate super-duper cross subvolume defrag
>   Defrag could also automatically change all the referencers.
>   That's why we call it ultimate super duper, but as I already mentioned
>   it's a big performance problem, and if Ro subvolume is involved, it'll
>   go super tricky.

Yes, that is what's needed. I don't really see where the big problem  
is. I mean, it is just a defrag, like any other. Nothing special.
The usual defrag algorithm is somewhat complicated, but I don't see  
why this one is much worse.

OK, if RO subvolumes are tricky, than exclude them for the time being.  
So later, after many years, maybe someone will add the code for this  
tricky RO case.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  1:24               ` webmaster
@ 2019-09-10  1:48                 ` Qu Wenruo
  2019-09-10  3:32                   ` webmaster
  2019-09-10 23:14                   ` webmaster
  0 siblings, 2 replies; 111+ messages in thread
From: Qu Wenruo @ 2019-09-10  1:48 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7482 bytes --]



On 2019/9/10 上午9:24, webmaster@zedlx.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
> 
>>>> Btrfs defrag works by creating new extents containing the old data.
>>>>
>>>> So if btrfs decides to defrag, no old extents will be used.
>>>> It will all be new extents.
>>>>
>>>> That's why your proposal is freaking strange here.
>>>
>>> Ok, but: can the NEW extents still be shared?
>>
>> Can only be shared by reflink.
>> Not automatically, so if btrfs decides to defrag, it will not be shared
>> at all.
>>
>>> If you had an extent E88
>>> shared by 4 files in different subvolumes, can it be copied to another
>>> place and still be shared by the original 4 files?
>>
>> Not for current btrfs.
>>
>>> I guess that the
>>> answer is YES. And, that's the only requirement for a good defrag
>>> algorithm that doesn't shrink free space.
>>
>> We may go that direction.
>>
>> The biggest burden here is, btrfs needs to do expensive full-backref
>> walk to determine how many files are referring to this extent.
>> And then change them all to refer to the new extent.
> 
> YES! That! Exactly THAT. That is what needs to be done.
> 
> I mean, you just create an (perhaps associative) array which links an
> extent (the array index contains the extent ID) to all the files that
> reference that extent.

You're exactly in the pitfall of btrfs backref walk.

For btrfs, it's definitely not an easy work to do backref walk.
btrfs uses hidden backref, that means, under most case, one extent
shared by 1000 snapshots, in extent tree (shows the backref) it can
completely be possible to only have one ref, for the initial subvolume.

For btrfs, you need to walk up the tree to find how it's shared.

It has to be done like that, that's why we call it backref-*walk*.

E.g
          A (subvol 257)     B (Subvol 258, snapshot of 257)
          |    \        /    |
          |        X         |
          |    /        \    |
          C                  D
         / \                / \
        E   F              G   H

In extent tree, E is only referred by subvol 257.
While C has two referencers, 257 and 258.

So in reality, you need to:
1) Do a tree search from subvol 257
   You got a path, E -> C -> A
2) Check each node to see if it's shared.
   E is only referred by C, no extra referencer.
   C is refered by two new tree blocks, A and B.
   A is refered by subvol 257.
   B is refered by subvol 258.
   So E is shared by 257 and 258.

Now, you see how things would go mad, for each extent you must go that
way to determine the real owner of each extent, not to mention we can
have at most 8 levels, tree blocks at level 0~7 can all be shared.

If it's shared by 1000 subvolumes, hope you had a good day then.

> 
> To initialize it, you do one single walk through the entire b-tree.
> 
> Than the data you require can be retrieved in an instant.

In an instant, think again after reading above backref walk things.

> 
>> It's feasible if the extent is not shared by many.
>> E.g the extent only get shared by ~10 or ~50 subvolumes/files.
>>
>> But what will happen if it's shared by 1000 subvolumes? That would be a
>> performance burden.
>> And trust me, we have already experienced such disaster in qgroup,
>> that's why we want to avoid such case.
> 
> Um, I don't quite get where this 'performance burden' is comming from.

That's why I'd say you need to understand btrfs tech details.

> If you mean that moving a single extent requires rewriting a lot of
> b-trees, than perhaps it could be solved by moving extents in bigger
> batches. So, fo example, you move(create new) extents, but you do that
> for 100 megabytes of extents at the time, then you update the b-trees.
> So then, there would be much less b-tree writes to disk.
> 
> Also, if the defrag detects 1000 subvolumes, it can warn the user.
> 
> By the way, isn't the current recommendation to stay below 100
> subvolumes?. So if defrag can do 100 subvolumes, that is great. The
> defrag doesn't need to do 1000. If there are 1000 subvolumes, than the
> user should delete most of them before doing defrag.
> 
>> Another problem is, what if some of the subvolumes are read-only, should
>> we touch it or not? (I guess not)
> 
> I guess YES. Except if the user overrides it with some switch.
> 
>> Then the defrag will be not so complete. Bad fragmented extents are
>> still in RO subvols.
> 
> Let the user choose!
> 
>> So the devil is still in the detail, again and again.
> 
> Ok, let's flesh out some details.
> 
>>> I can't understand a defrag that substantially decreases free space. I
>>> mean, each such defrag is a lottery, because you might end up with
>>> practically unusable file system if the partition fills up.
>>>
>>> CURRENT DEFRAG IS A LOTTERY!
>>>
>>> How bad is that?
>>>
>>
>> Now you see why btrfs defrag has problem.
>>
>> On one hand, guys like you don't want to unshare extents. I understand
>> and it makes sense to some extents. And used to be the default behavior.
>>
>> On the other hand, btrfs has to CoW extents to do defrag, and we have
>> extreme cases where we want to defrag shared extents even it's going to
>> decrease free space.
>>
>> And I have to admit, my memory made the discussion a little off-topic,
>> as I still remember some older kernel doesn't touch shared extents at
>> all.
>>
>> So here what we could do is: (From easy to hard)
>> - Introduce an interface to allow defrag not to touch shared extents
>>   it shouldn't be that difficult compared to other work we are going
>>   to do.
>>   At least, user has their choice.
> 
> That defrag wouldn't acomplish much. You can call it defrag, but it is
> more like nothing happens.

If one subvolume is not shared by snapshots or reflinks at all, I'd say
that's exactly what user want.

> 
>> - Introduce different levels for defrag
>>   Allow btrfs to do some calculation and space usage policy to
>>   determine if it's a good idea to defrag some shared extents.
>>   E.g. my extreme case, unshare the extent would make it possible to
>>   defrag the other subvolume to free a huge amount of space.
>>   A compromise, let user to choose if they want to sacrifice some space.
> 
> Meh. You can always defrag one chosen subvolume perfectly, without
> unsharing any file extents.

If the subvolume is shared by another snapshot, you always need to face
the decision whether to unshare.
It's unavoidable.

It's only whether if it's worthy to unshare.

> So, since it can be done perfectly without unsharing, why unshare at all?

No, you can't.

Go check my initial "red-herring" case.

> 
>> - Ultimate super-duper cross subvolume defrag
>>   Defrag could also automatically change all the referencers.
>>   That's why we call it ultimate super duper, but as I already mentioned
>>   it's a big performance problem, and if Ro subvolume is involved, it'll
>>   go super tricky.
> 
> Yes, that is what's needed. I don't really see where the big problem is.
> I mean, it is just a defrag, like any other. Nothing special.
> The usual defrag algorithm is somewhat complicated, but I don't see why
> this one is much worse.
> 
> OK, if RO subvolumes are tricky, than exclude them for the time being.
> So later, after many years, maybe someone will add the code for this
> tricky RO case.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  1:48                 ` Qu Wenruo
@ 2019-09-10  3:32                   ` webmaster
  2019-09-10 14:14                     ` Nikolay Borisov
  2019-09-10 22:48                     ` webmaster
  2019-09-10 23:14                   ` webmaster
  1 sibling, 2 replies; 111+ messages in thread
From: webmaster @ 2019-09-10  3:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:

> On 2019/9/10 上午9:24, webmaster@zedlx.com wrote:
>>
>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>>>> Btrfs defrag works by creating new extents containing the old data.
>>>>>
>>>>> So if btrfs decides to defrag, no old extents will be used.
>>>>> It will all be new extents.
>>>>>
>>>>> That's why your proposal is freaking strange here.
>>>>
>>>> Ok, but: can the NEW extents still be shared?
>>>
>>> Can only be shared by reflink.
>>> Not automatically, so if btrfs decides to defrag, it will not be shared
>>> at all.
>>>
>>>> If you had an extent E88
>>>> shared by 4 files in different subvolumes, can it be copied to another
>>>> place and still be shared by the original 4 files?
>>>
>>> Not for current btrfs.
>>>
>>>> I guess that the
>>>> answer is YES. And, that's the only requirement for a good defrag
>>>> algorithm that doesn't shrink free space.
>>>
>>> We may go that direction.
>>>
>>> The biggest burden here is, btrfs needs to do expensive full-backref
>>> walk to determine how many files are referring to this extent.
>>> And then change them all to refer to the new extent.
>>
>> YES! That! Exactly THAT. That is what needs to be done.
>>
>> I mean, you just create an (perhaps associative) array which links an
>> extent (the array index contains the extent ID) to all the files that
>> reference that extent.
>
> You're exactly in the pitfall of btrfs backref walk.
>
> For btrfs, it's definitely not an easy work to do backref walk.
> btrfs uses hidden backref, that means, under most case, one extent
> shared by 1000 snapshots, in extent tree (shows the backref) it can
> completely be possible to only have one ref, for the initial subvolume.
>
> For btrfs, you need to walk up the tree to find how it's shared.
>
> It has to be done like that, that's why we call it backref-*walk*.
>
> E.g
>           A (subvol 257)     B (Subvol 258, snapshot of 257)
>           |    \        /    |
>           |        X         |
>           |    /        \    |
>           C                  D
>          / \                / \
>         E   F              G   H
>
> In extent tree, E is only referred by subvol 257.
> While C has two referencers, 257 and 258.
>
> So in reality, you need to:
> 1) Do a tree search from subvol 257
>    You got a path, E -> C -> A
> 2) Check each node to see if it's shared.
>    E is only referred by C, no extra referencer.
>    C is refered by two new tree blocks, A and B.
>    A is refered by subvol 257.
>    B is refered by subvol 258.
>    So E is shared by 257 and 258.
>
> Now, you see how things would go mad, for each extent you must go that
> way to determine the real owner of each extent, not to mention we can
> have at most 8 levels, tree blocks at level 0~7 can all be shared.
>
> If it's shared by 1000 subvolumes, hope you had a good day then.

Ok, let's do just this issue for the time being. One issue at a time.  
It will be easier.

The solution is to temporarily create a copy of the entire  
backref-tree in memory. To create this copy, you just do a preorder  
depth-first traversal following only forward references.

So this preorder depth-first traversal would visit the nodes in the  
following order:
A,C,E,F,D,G,H,B

Oh, it is not a tree, it is a DAG in that example of yours. OK,  
preorder is possible on DAG, too. But how did you get a DAG, shouldn't  
it be all trees?

When you have the entire backref-tree (backref-DAG?) in memory, doing  
a backref-walk is a piece of cake.

Of course, this in-memory backref tree has to be kept in sync with the  
filesystem, that is it has to be updated whenever there is a write to  
disk. That's not so hard.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 11:25   ` zedlryqc
  2019-09-09 12:18     ` Qu Wenruo
@ 2019-09-10 11:12     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-10 11:12 UTC (permalink / raw)
  To: zedlryqc, linux-btrfs

On 2019-09-09 07:25, zedlryqc@server53.web-hosting.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>> 1) Full online backup (or copy, whatever you want to call it)
>>> btrfs backup <filesystem name> <partition name> [-f]
>>> - backups a btrfs filesystem given by <filesystem name> to a partition
>>> <partition name> (with all subvolumes).
>>
>> Why not just btrfs send?
>>
>> Or you want to keep the whole subvolume structures/layout?
> 
> Yes, I want to keep the whole subvolume structures/layout. I want to 
> keep everything. Usually, when I want to backup a partition, I want to 
> keep everything, and I suppose most other people have a similar idea.\
This: https://github.com/Ferroin/btrfs-subv-backup may be of interest to 
you.  It's been a while since I updated it, but it still works perfectly 
fine.

Won't exactly let you do an exact block-level backup, and it won't 
preserve reflinks, but it will let you store subvolume info in an 
otherwise 'normal' backup through tools like Borg, Bacula, or Amanda.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  3:32                   ` webmaster
@ 2019-09-10 14:14                     ` Nikolay Borisov
  2019-09-10 22:35                       ` webmaster
  2019-09-10 22:48                     ` webmaster
  1 sibling, 1 reply; 111+ messages in thread
From: Nikolay Borisov @ 2019-09-10 14:14 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs



On 10.09.19 г. 6:32 ч., webmaster@zedlx.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
> 
>> On 2019/9/10 上午9:24, webmaster@zedlx.com wrote:
>>>
>>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>>> Btrfs defrag works by creating new extents containing the old data.
>>>>>>
>>>>>> So if btrfs decides to defrag, no old extents will be used.
>>>>>> It will all be new extents.
>>>>>>
>>>>>> That's why your proposal is freaking strange here.
>>>>>
>>>>> Ok, but: can the NEW extents still be shared?
>>>>
>>>> Can only be shared by reflink.
>>>> Not automatically, so if btrfs decides to defrag, it will not be shared
>>>> at all.
>>>>
>>>>> If you had an extent E88
>>>>> shared by 4 files in different subvolumes, can it be copied to another
>>>>> place and still be shared by the original 4 files?
>>>>
>>>> Not for current btrfs.
>>>>
>>>>> I guess that the
>>>>> answer is YES. And, that's the only requirement for a good defrag
>>>>> algorithm that doesn't shrink free space.
>>>>
>>>> We may go that direction.
>>>>
>>>> The biggest burden here is, btrfs needs to do expensive full-backref
>>>> walk to determine how many files are referring to this extent.
>>>> And then change them all to refer to the new extent.
>>>
>>> YES! That! Exactly THAT. That is what needs to be done.
>>>
>>> I mean, you just create an (perhaps associative) array which links an
>>> extent (the array index contains the extent ID) to all the files that
>>> reference that extent.
>>
>> You're exactly in the pitfall of btrfs backref walk.
>>
>> For btrfs, it's definitely not an easy work to do backref walk.
>> btrfs uses hidden backref, that means, under most case, one extent
>> shared by 1000 snapshots, in extent tree (shows the backref) it can
>> completely be possible to only have one ref, for the initial subvolume.
>>
>> For btrfs, you need to walk up the tree to find how it's shared.
>>
>> It has to be done like that, that's why we call it backref-*walk*.
>>
>> E.g
>>           A (subvol 257)     B (Subvol 258, snapshot of 257)
>>           |    \        /    |
>>           |        X         |
>>           |    /        \    |
>>           C                  D
>>          / \                / \
>>         E   F              G   H
>>
>> In extent tree, E is only referred by subvol 257.
>> While C has two referencers, 257 and 258.
>>
>> So in reality, you need to:
>> 1) Do a tree search from subvol 257
>>    You got a path, E -> C -> A
>> 2) Check each node to see if it's shared.
>>    E is only referred by C, no extra referencer.
>>    C is refered by two new tree blocks, A and B.
>>    A is refered by subvol 257.
>>    B is refered by subvol 258.
>>    So E is shared by 257 and 258.
>>
>> Now, you see how things would go mad, for each extent you must go that
>> way to determine the real owner of each extent, not to mention we can
>> have at most 8 levels, tree blocks at level 0~7 can all be shared.
>>
>> If it's shared by 1000 subvolumes, hope you had a good day then.
> 
> Ok, let's do just this issue for the time being. One issue at a time. It
> will be easier.
> 
> The solution is to temporarily create a copy of the entire backref-tree
> in memory. To create this copy, you just do a preorder depth-first
> traversal following only forward references.
> 
> So this preorder depth-first traversal would visit the nodes in the
> following order:
> A,C,E,F,D,G,H,B
> 
> Oh, it is not a tree, it is a DAG in that example of yours. OK, preorder
> is possible on DAG, too. But how did you get a DAG, shouldn't it be all
> trees?
> 
> When you have the entire backref-tree (backref-DAG?) in memory, doing a
> backref-walk is a piece of cake.
> 
> Of course, this in-memory backref tree has to be kept in sync with the
> filesystem, that is it has to be updated whenever there is a write to
> disk. That's not so hard.

Great, now that you have devised a solution and have plenty of
experience writing code why not try and contribute to btrfs?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 17:11         ` webmaster
@ 2019-09-10 17:39           ` Andrei Borzenkov
  2019-09-10 22:41             ` webmaster
  0 siblings, 1 reply; 111+ messages in thread
From: Andrei Borzenkov @ 2019-09-10 17:39 UTC (permalink / raw)
  To: webmaster, linux-btrfs

09.09.2019 20:11, webmaster@zedlx.com пишет:
...
>>
>> Forgot to mention this part.
>>
>> If your primary objective is to migrate your data to another device
>> online (mounted, without unmount any of the fs).
> 
> This is not the primary objective. The primary objective is to produce a
> full, online, easy-to-use, robust backup. But let's say we need to do
> migration...
>>
>> Then I could say, you can still add a new device, then remove the old
>> device to do that.
> 
> If the source filesystem already uses RAID1, then, yes, you could do it,

You could do it with any profile.

> but it would be too slow, it would need a lot of user intervention, so
> many commands typed, so many ways to do it wrong, to make a mistake.
> 

It requires exactly two commands - one to add new device, another to
remove old device.

> Too cumbersome. Too wastefull of time and resources.
> 

Do you mean your imaginary full backup will not read full filesystem?
Otherwise how can it take less time and resources?

>> That would be even more efficient than LVM (not thin provisioned one),
>> as we only move used space.
> 
> In fact, you can do this kind of full-online-backup with the help of
> mdadm RAID, or some other RAID solution. It can already be done, no need
> to add 'btrfs backup'.
> 
> But, again, to cumbersome, too inflexible, too many problems, and, the
> user would have to setup a downgraded mdadm RAID in front and run with a
> degraded mdadm RAID all the time (since btrfs RAID would be actually
> protecting the data).
> 
>> If your objective is to create a full copy as backup, then I'd say my
>> new patchset of btrfs-image data dump may be your best choice.
> 
> It should be mountable. It should be performed online. Never heard of
> btrfs-image, i need the docs to see whether this btrfs-image is good
> enough.
> 
>> The only down side is, you need to at least mount the source fs to RO
>> mode.
> 
> No. That's not really an online backup. Not good enough.
> 
>> The true on-line backup is not that easy, especially any write can screw
>> up your backup process, so it must be done unmounted.
> 
> Nope, I disagree.
> 
> First, there is the RAID1-alike solution, which is easy to perform (just
> send all new writes to both source and destination). It's the same thing
> that mdadm RAID1 would do (like I mentioned a few paragraphs above).
> But, this solution may have a performance concern, when the destination
> drive is too slow.
> 
> Fortunately, with btrfs, an online backup is easier tha usual. To
> produce a frozen snapshot of the entire filesystem, just create a
> read-only snapshot of every subvolume (this is not 100% consistent, I
> know, but it is good enough).
> 
> But I'm just repeating myself, I already wrote this in the first email.
> 
> So, in conclusion I disagree that true on-line backup is not easy.
> 
>> Even btrfs send handles this by forcing the source subvolume to be RO,
>> so I can't find an easy solution to address that.
> 
> This is a digression, but I would say that you first make a temporary RO
> snapshot of the source subvolume, then use 'btrfs send' on the temporary
> snapshot, then delete the temporary snapshot.
> 
> Oh, my.
> 
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-09 19:26         ` webmaster
@ 2019-09-10 19:22           ` Austin S. Hemmelgarn
  2019-09-10 23:32             ` webmaster
  2019-09-10 23:58             ` webmaster
  0 siblings, 2 replies; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-10 19:22 UTC (permalink / raw)
  To: webmaster, linux-btrfs

On 2019-09-09 15:26, webmaster@zedlx.com wrote:
> This post is a reply to Remi Gauvin's post, but the email got lost so I 
> can't reply to him directly.
> 
> Remi Gauvin wrote on 2019-09-09 17:24 :
>>
>> On 2019-09-09 11:29 a.m., Graham Cobb wrote:
>>
>>>  and does anyone really care about
>>> defrag any more?).
>>>
>>
>>
>> Err, yes, yes absolutely.
>>
>> I don't have any issues with the current btrfs defrag implementions, but
>> it's *vital* for btrfs. (which works just as the OP requested, as far as
>> I can tell, recursively for a subvolume)
>>
>> Just booting Windows on a BTRFS virtual image, for example, will create
>> almost 20,000 file fragments.  Even on SSD's, you get into problems
>> trying to work with files that are over 200,000 fragments.
>>
>> Another huge problem is rsync --inplace.  which is perfect backup
>> solution to take advantage of BTRFS snapshots, but fragments larges
>> files into tiny pieces (and subsequently creates files that are very
>> slow to read.).. for some reason, autodefrag doesn't catch that one 
>> either.
>>
>> But the wiki could do a beter job of trying to explain that the snapshot
>> duplication of defrag only affects the fragmented portions.  As I
>> understand, it's really only a problem when using defrag to change
>> compression.
> 
> 
> Ok, a few things.
> 
> First, my defrag suggestion doesn't EVER unshare extents. The defrag 
> should never unshare, not even a single extent. Why? Because it violates 
> the expectation that defrag would not decrease free space.
No, it should by default not unshare, but still allow the possibility of 
unsharing extents.  Sometimes completely removing all fragmentation is 
more important than space usage.
> 
> Defrag may break up extents. Defrag may fuse extents. But it shouln't 
> ever unshare extents.
Actually, spitting or merging extents will unshare them in a large 
majority of cases.
> 
> Therefore, I doubt that the current defrag does "just as the OP 
> requested". Nonsense. The current implementation does the unsharing all 
> the time.
> 
> Second, I never used btrfs defrag in my life, despite mananging at least 
> 10 btrfs filesystems. I can't. Because, all my btrfs volumes have lot of 
> subvolumes, so I'm afraid that defrag will unshare much more than I can 
> tolerate. In my subvolumes, over 90% of data is shared. If all 
> subvolumes were to be unshared, the disk usage would likely increase 
> tenfold, and that I cannot afford.
> 
> I agree that btrfs defrag is vital. But currently, it's unusable for 
> many use cases.
> 
> Also, I don't quite understand what the poster means by "the snapshot 
> duplication of defrag only affects the fragmented portions". Possibly it 
> means approximately: if a file wasn't modified in the current (latest) 
> subvolume, it doesn't need to be unshared. But, that would still unshare 
> all the log files, for example, even all files that have been appended, 
> etc... that's quite bad. Even if just one byte was appended to a log 
> file, then defrag will unshare the entire file (I suppose).
> 
What it means is that defrag will only ever touch a file if that file 
has extents that require defragmentation, and will then only touch 
extents that are smaller than the target extent size (32M by default, 
configurable at run-time with the `-t` option for the defrag command) 
and possibly those directly adjacent to such extents (because it might 
merge the small extents into larger neighbors, which will in turn 
rewrite the larger extent too).

This, in turn, leads to a couple of interesting behaviors:

* If you have a subvolume with snapshots , it may or may not break 
reflinks between that subvolume and it's snapshots, but will not break 
any of the reflinks between the snapshots themselves.
* When dealing with append-only files that are significantly larger than 
the target extent size which are defragmented regularly, only extents 
near the end of the file are likely to be unshared by the operation.
* If you fully defragment a subvolume, then snapshot it, then defrag it 
again, the second defrag will not unshare anything unless you were 
writing to the subvolume or snapshot while the second defrag was running.
* There's almost no net benefit to not defragmenting when dealing with 
very large files that mostly see internal rewrites (VM disk images, 
large databases, etc) because every internal rewrite will implicitly 
unshare extents anyway.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10 14:14                     ` Nikolay Borisov
@ 2019-09-10 22:35                       ` webmaster
  2019-09-11  6:40                         ` Nikolay Borisov
  0 siblings, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-10 22:35 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs


Quoting Nikolay Borisov <n.borisov.lkml@gmail.com>:


>>>
>>> You're exactly in the pitfall of btrfs backref walk.
>>>
>>> For btrfs, it's definitely not an easy work to do backref walk.
>>> btrfs uses hidden backref, that means, under most case, one extent
>>> shared by 1000 snapshots, in extent tree (shows the backref) it can
>>> completely be possible to only have one ref, for the initial subvolume.
>>>
>>> For btrfs, you need to walk up the tree to find how it's shared.
>>>
>>> It has to be done like that, that's why we call it backref-*walk*.
>>>
>>> E.g
>>>           A (subvol 257)     B (Subvol 258, snapshot of 257)
>>>           |    \        /    |
>>>           |        X         |
>>>           |    /        \    |
>>>           C                  D
>>>          / \                / \
>>>         E   F              G   H
>>>
>>> In extent tree, E is only referred by subvol 257.
>>> While C has two referencers, 257 and 258.
>>>
>>> So in reality, you need to:
>>> 1) Do a tree search from subvol 257
>>>    You got a path, E -> C -> A
>>> 2) Check each node to see if it's shared.
>>>    E is only referred by C, no extra referencer.
>>>    C is refered by two new tree blocks, A and B.
>>>    A is refered by subvol 257.
>>>    B is refered by subvol 258.
>>>    So E is shared by 257 and 258.
>>>
>>> Now, you see how things would go mad, for each extent you must go that
>>> way to determine the real owner of each extent, not to mention we can
>>> have at most 8 levels, tree blocks at level 0~7 can all be shared.
>>>
>>> If it's shared by 1000 subvolumes, hope you had a good day then.
>>
>> Ok, let's do just this issue for the time being. One issue at a time. It
>> will be easier.
>>
>> The solution is to temporarily create a copy of the entire backref-tree
>> in memory. To create this copy, you just do a preorder depth-first
>> traversal following only forward references.
>>
>> So this preorder depth-first traversal would visit the nodes in the
>> following order:
>> A,C,E,F,D,G,H,B
>>
>> Oh, it is not a tree, it is a DAG in that example of yours. OK, preorder
>> is possible on DAG, too. But how did you get a DAG, shouldn't it be all
>> trees?
>>
>> When you have the entire backref-tree (backref-DAG?) in memory, doing a
>> backref-walk is a piece of cake.
>>
>> Of course, this in-memory backref tree has to be kept in sync with the
>> filesystem, that is it has to be updated whenever there is a write to
>> disk. That's not so hard.
>
> Great, now that you have devised a solution and have plenty of
> experience writing code why not try and contribute to btrfs?

First, that is what I'm just doing. I'm contributing to discussion on  
most needed features of btrfs. I'm helping you to get on the right  
track and waste less time on unimportant stuff.

You might appreciate my help, or not, but I am trying to help.

What you probaby wanted to say is that you would like me to contribute  
by writing code, pro bono. Unfortunately, I work for money as does the  
99% of the population. Why not contribute for free? For the same  
reason why the rest of the population doesn't work for free. And, I'm  
not going from door to door and buggin everyone with "why don't you  
work for free", "why don't you help this noble cause..." blah. Makes  
no sense to me.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10 17:39           ` Andrei Borzenkov
@ 2019-09-10 22:41             ` webmaster
  0 siblings, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-10 22:41 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: linux-btrfs


Quoting Andrei Borzenkov <arvidjaar@gmail.com>:

> 09.09.2019 20:11, webmaster@zedlx.com пишет:
> ...
>>>
>>> Forgot to mention this part.
>>>
>>> If your primary objective is to migrate your data to another device
>>> online (mounted, without unmount any of the fs).
>>
>> This is not the primary objective. The primary objective is to produce a
>> full, online, easy-to-use, robust backup. But let's say we need to do
>> migration...
>>>
>>> Then I could say, you can still add a new device, then remove the old
>>> device to do that.
>>
>> If the source filesystem already uses RAID1, then, yes, you could do it,
>
> You could do it with any profile.
>
>> but it would be too slow, it would need a lot of user intervention, so
>> many commands typed, so many ways to do it wrong, to make a mistake.
>>
>
> It requires exactly two commands - one to add new device, another to
> remove old device.
>

Yes, sorry I got a bit confused.

The point is that migration is not the objective. The objective could,  
possibly, be restated as: make another copy of the filesystem.  
Migration is something different.

>> Too cumbersome. Too wastefull of time and resources.
>>
>
> Do you mean your imaginary full backup will not read full filesystem?
> Otherwise how can it take less time and resources?

My imaginary full backup needs to read and write the entire filesystem.
It can't take less than that.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  3:32                   ` webmaster
  2019-09-10 14:14                     ` Nikolay Borisov
@ 2019-09-10 22:48                     ` webmaster
  1 sibling, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-10 22:48 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


Quoting webmaster@zedlx.com:

> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>> On 2019/9/10 上午9:24, webmaster@zedlx.com wrote:
>>>
>>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>>> Btrfs defrag works by creating new extents containing the old data.
>>>>>>
>>>>>> So if btrfs decides to defrag, no old extents will be used.
>>>>>> It will all be new extents.
>>>>>>
>>>>>> That's why your proposal is freaking strange here.
>>>>>
>>>>> Ok, but: can the NEW extents still be shared?
>>>>
>>>> Can only be shared by reflink.
>>>> Not automatically, so if btrfs decides to defrag, it will not be shared
>>>> at all.
>>>>
>>>>> If you had an extent E88
>>>>> shared by 4 files in different subvolumes, can it be copied to another
>>>>> place and still be shared by the original 4 files?
>>>>
>>>> Not for current btrfs.
>>>>
>>>>> I guess that the
>>>>> answer is YES. And, that's the only requirement for a good defrag
>>>>> algorithm that doesn't shrink free space.
>>>>
>>>> We may go that direction.
>>>>
>>>> The biggest burden here is, btrfs needs to do expensive full-backref
>>>> walk to determine how many files are referring to this extent.
>>>> And then change them all to refer to the new extent.
>>>
>>> YES! That! Exactly THAT. That is what needs to be done.
>>>
>>> I mean, you just create an (perhaps associative) array which links an
>>> extent (the array index contains the extent ID) to all the files that
>>> reference that extent.
>>
>> You're exactly in the pitfall of btrfs backref walk.
>>
>> For btrfs, it's definitely not an easy work to do backref walk.
>> btrfs uses hidden backref, that means, under most case, one extent
>> shared by 1000 snapshots, in extent tree (shows the backref) it can
>> completely be possible to only have one ref, for the initial subvolume.
>>
>> For btrfs, you need to walk up the tree to find how it's shared.
>>
>> It has to be done like that, that's why we call it backref-*walk*.
>>
>> E.g
>>          A (subvol 257)     B (Subvol 258, snapshot of 257)
>>          |    \        /    |
>>          |        X         |
>>          |    /        \    |
>>          C                  D
>>         / \                / \
>>        E   F              G   H
>>
>> In extent tree, E is only referred by subvol 257.
>> While C has two referencers, 257 and 258.
>>
>> So in reality, you need to:
>> 1) Do a tree search from subvol 257
>>   You got a path, E -> C -> A
>> 2) Check each node to see if it's shared.
>>   E is only referred by C, no extra referencer.
>>   C is refered by two new tree blocks, A and B.
>>   A is refered by subvol 257.
>>   B is refered by subvol 258.
>>   So E is shared by 257 and 258.
>>
>> Now, you see how things would go mad, for each extent you must go that
>> way to determine the real owner of each extent, not to mention we can
>> have at most 8 levels, tree blocks at level 0~7 can all be shared.
>>
>> If it's shared by 1000 subvolumes, hope you had a good day then.
>
> Ok, let's do just this issue for the time being. One issue at a  
> time. It will be easier.
>
> The solution is to temporarily create a copy of the entire  
> backref-tree in memory. To create this copy, you just do a preorder  
> depth-first traversal following only forward references.
>
> So this preorder depth-first traversal would visit the nodes in the  
> following order:
> A,C,E,F,D,G,H,B
>
> Oh, it is not a tree, it is a DAG in that example of yours. OK,  
> preorder is possible on DAG, too. But how did you get a DAG,  
> shouldn't it be all trees?
>
> When you have the entire backref-tree (backref-DAG?) in memory,  
> doing a backref-walk is a piece of cake.
>
> Of course, this in-memory backref tree has to be kept in sync with  
> the filesystem, that is it has to be updated whenever there is a  
> write to disk. That's not so hard.

Oh, I get why you have a DAG there. Because there are multiple trees,  
one for each subvolume. Each subvolume is a tree, but when you combine  
all subvolumes it is not a tree anymore.

So, I guess this solves this big performance-problem.

I would make this backref-tree an associative array. So, for your  
example, it would contain:

backref['A'] = {subvol 257}
backref['C'] = {'A','B'}
backref['E'] = {'C'}
backref['F'] = {'C'}
backref['D'] = {'A','B'}
backref['G'] = {'D'}
backref['H'] = {'D'}
backref['B'] = {subvol 258}


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  1:48                 ` Qu Wenruo
  2019-09-10  3:32                   ` webmaster
@ 2019-09-10 23:14                   ` webmaster
  1 sibling, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-10 23:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:

>>> So here what we could do is: (From easy to hard)
>>> - Introduce an interface to allow defrag not to touch shared extents
>>>   it shouldn't be that difficult compared to other work we are going
>>>   to do.
>>>   At least, user has their choice.
>>
>> That defrag wouldn't acomplish much. You can call it defrag, but it is
>> more like nothing happens.
>
> If one subvolume is not shared by snapshots or reflinks at all, I'd say
> that's exactly what user want.

If one subvolume is not shared by snapshots, the super-duper defrag  
would produce the same result concering that subvolume.

Therefore, it is a waste of time to consider this case separately and  
to go writing the code to cover just this case.

>>> - Introduce different levels for defrag
>>>   Allow btrfs to do some calculation and space usage policy to
>>>   determine if it's a good idea to defrag some shared extents.
>>>   E.g. my extreme case, unshare the extent would make it possible to
>>>   defrag the other subvolume to free a huge amount of space.
>>>   A compromise, let user to choose if they want to sacrifice some space.
>>
>> Meh. You can always defrag one chosen subvolume perfectly, without
>> unsharing any file extents.
>
> If the subvolume is shared by another snapshot, you always need to face
> the decision whether to unshare.
> It's unavoidable.

In my opinion, unsharing is a very bad thing to do. If the user orders  
it, then OK, but I think it that it is rarely required.

Unsharing can be done manually by just copying the data to another  
place (partition). So, if someone really wants to unshare, he can  
always easily do it.

When you unshare, it is hard to go back. Unsharing is a one-way road.  
When you unshare, you lose free space. Therefore, the defrag should  
not unshare.

In my view, the only real decision that needs to be left to the user  
is: what to defrag?

In terms of full or partial defrag:
* Everything
     - rarely; waste of time and resources, and it wears out SSDs
     - perhaps this shouldn't be allowed at all
* 2% od most fragmented files (2% ot total space used, by size in bytes)
     - good idea for daily or weekly defrag
     - good default
* Let the user choose between 0.01% and 10%  (by size in bytes)
     - the best

Options by scope:
   - One file (when necessary)
   - One subvolume (when necessary)
   - A list of subvolumes (with priority from first to last; the first  
one on the list would be defragmented best)
   - All subvolumes
   - All subvolumes, with one exclusion list, and one priority list
   - option to include or exclude RO subvolumes - as you said, this is  
probably the hardest and implementation should be postponed

Therefore, making a super-duper defrag which can defrag one file  
(without unsharing!!!) is a good starting point, instead of wasing  
time on your proposal "Introduce different levels for defrag".

>> So, since it can be done perfectly without unsharing, why unshare at all?
>
> No, you can't.
>
> Go check my initial "red-herring" case.

I might check it, but I think that you can't be right. You are  
thinking too low-level. If you can split extents and fuse extents and  
create new extents that are shared by multiple files, than what you  
are saying is simply not possible. The operations I listed are  
sufficient to produce a perfect full defrag. Always.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10 19:22           ` Austin S. Hemmelgarn
@ 2019-09-10 23:32             ` webmaster
  2019-09-11 12:02               ` Austin S. Hemmelgarn
  2019-09-10 23:58             ` webmaster
  1 sibling, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-10 23:32 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:


>> Defrag may break up extents. Defrag may fuse extents. But it  
>> shouln't ever unshare extents.

> Actually, spitting or merging extents will unshare them in a large  
> majority of cases.

Ok, this point seems to be repeated over and over without any proof,  
and it is illogical to me.

About merging extents: a defrag should merge extents ONLY when both  
extents are shared by the same files (and when those extents are  
neighbours in both files). In other words, defrag should always merge  
without unsharing. Let's call that operation "fusing extents", so that  
there are no more misunderstandings.

=== I CHALLENGE you and anyone else on this mailing list: ===

  - Show me an exaple where splitting an extent requires unsharing,  
and this split is needed to defrag.

Make it clear, write it yourself, I don't want any machine-made outputs.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10 19:22           ` Austin S. Hemmelgarn
  2019-09-10 23:32             ` webmaster
@ 2019-09-10 23:58             ` webmaster
  1 sibling, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-10 23:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

>> Also, I don't quite understand what the poster means by "the  
>> snapshot duplication of defrag only affects the fragmented  
>> portions". Possibly it means approximately: if a file wasn't  
>> modified in the current (latest) subvolume, it doesn't need to be  
>> unshared. But, that would still unshare all the log files, for  
>> example, even all files that have been appended, etc... that's  
>> quite bad. Even if just one byte was appended to a log file, then  
>> defrag will unshare the entire file (I suppose).
>>
> What it means is that defrag will only ever touch a file if that  
> file has extents that require defragmentation, and will then only  
> touch extents that are smaller than the target extent size (32M by  
> default, configurable at run-time with the `-t` option for the  
> defrag command) and possibly those directly adjacent to such extents  
> (because it might merge the small extents into larger neighbors,  
> which will in turn rewrite the larger extent too).

Umm... it seems to me that it's quite a poor defrag you got there.

> * There's almost no net benefit to not defragmenting when dealing  
> with very large files that mostly see internal rewrites (VM disk  
> images, large databases, etc) because every internal rewrite will  
> implicitly unshare extents anyway.

Ok, so if you have a database, and then you snapshot its subvolume,  
you might be in trouble because of all the in-place writes that  
databases do, right?

It would almost be better if you could, manually, order the database  
file to be unshared and defragmented. So, that would be the use-case  
for defrag-unsharing. Interesting. Ok, I would agree with that. So,  
there needs to be the operation called defrag-unshare, but that has  
nothing to do with the real defrag.

I mean, this defrag-unsharing is just a glorified copy operation, but  
there are a few twists, because it must be consistent, as opposed to  
online copy, which would fail the consistency criteria.

But, you and other developers here seem to be confusing this  
defrag-unshare with the real defrag. I bet you haven't even considered  
what it means to "defrag without usharing" in terms of: what the final  
result of such defrag should be, when it is done perfectly.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10  0:48             ` Qu Wenruo
  2019-09-10  1:24               ` webmaster
@ 2019-09-11  0:26               ` webmaster
  2019-09-11  0:36                 ` webmaster
  2019-09-11  1:00                 ` webmaster
  1 sibling, 2 replies; 111+ messages in thread
From: webmaster @ 2019-09-11  0:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:


> - Introduce different levels for defrag
>   Allow btrfs to do some calculation and space usage policy to
>   determine if it's a good idea to defrag some shared extents.
>   E.g. my extreme case, unshare the extent would make it possible to
>   defrag the other subvolume to free a huge amount of space.
>   A compromise, let user to choose if they want to sacrifice some space.

Ok, I noticed a few thing comming up frequently in this discussion, so  
I thing we should clear up a meaning of a few words before continuing.
Because same words are used with two different meanings.

First, the term "defrag" can has 2 different meanings in filesystems  
with shared extents

A) "plain defrag" or "unsharing-defrag"
   a file is unshare-defragmented if ALL of the following are met:
    1) all extents are written on disk in neighbouring sectors in the  
ascending order
    2) none of its extents are shared
    3) it doesn't have too many small extents

B) "sharing-defrag"
    a file is share-defragmented if ALL of the following are met:
    1) all extents are written on disk in neighbouring sectors in the  
ascending order
    2) all pairs of *adjanced* extents meet ONE of the following criteria
       2.1) both extents are sufficiently large
       2.2) the two extents have mismatching sharing groups (they are  
shared by different sets of files)

So, it might be, in some cases, a good idea to "plain defrag" some  
files like databases. But, this is a completely separate concern and  
completely different feature. So "plain defrag" is very different  
thing from "sharing-defrag"

Why there needs to be a "sharing defrag"
   - it is a defrag operation that can be run without concern
   - it can be run every day
   - eventually, it needs to be run to prevent degradation of performance
   - all other filesystems have this kind of defrag
   - the user suffers no permanent 'penalties' (like loss of free space)

The "unsharing-defrag" is something completely different, another  
feature for another discussion. I don't want to discuss it, because  
everything will just get confusing.

So, please, lets keep "unsharing-defrag" out of this discussion,  
because it has nothing to do with the thing I'm talking about.

So the "sharing defrag" is what I mean by saying defrag. That's the  
everyday defrag.







^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11  0:26               ` webmaster
@ 2019-09-11  0:36                 ` webmaster
  2019-09-11  1:00                 ` webmaster
  1 sibling, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-11  0:36 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

I made a small mistake, in 2) replace the word "ONE" with "AT LEAST ONE"

B) "sharing-defrag"
    a file is share-defragmented if ALL of the following are met:
    1) all extents are written on disk in neighbouring sectors in the  
ascending order
    2) all pairs of *adjanced* extents meet AT LEAST ONE of the  
following criteria
       2.1) both extents are sufficiently large
       2.2) the two extents have mismatching sharing groups (they are  
shared by different sets of files)


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11  0:26               ` webmaster
  2019-09-11  0:36                 ` webmaster
@ 2019-09-11  1:00                 ` webmaster
  1 sibling, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-11  1:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


Quoting myself webmaster@zedlx.com:

> B) "sharing-defrag"
>    a file is share-defragmented if ALL of the following are met:
>    1) all extents are written on disk in neighbouring sectors in the  
> ascending order
      2) all pairs of *adjanced* extents meet AT LEAST ONE of the  
following criteria
>       2.1) both extents are sufficiently large
>       2.2) the two extents have mismatching sharing groups (they are  
> shared by different sets of files)

If it is hard to undersztand what is meant by those rules, consider  
the following example (copy-pasted from previous post):

Let's say that there is a file FFF with extents e11, e12, e13, e22,
e23, e33, e34
- in subvolA the file FFF consists of e11, e12, e13
- in subvolB the file FFF consists of e11, e22, e23
- in subvolC the file FFF consists of e11, e22, e33, e34

After defrag, where 'selected subvolume' is subvolC, the extents are
ordered on disk as follows:

e11,e22,e33,e34 - e23 - e12,e13

In the list above, the comma denotes neighbouring extents, the dash
indicates that there can be a possible gap.

As you can see in the list, the file FFF is fully sharing-defragmented in
subvolC, since its extents are occupying neighbouring disk sectors.  
Except for extents e33,e34  which can be fused if one of them is too  
small. If that is not the case (e33,e34 are sufficiently large), then  
the perfect solution is given.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10 22:35                       ` webmaster
@ 2019-09-11  6:40                         ` Nikolay Borisov
  0 siblings, 0 replies; 111+ messages in thread
From: Nikolay Borisov @ 2019-09-11  6:40 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs



On 11.09.19 г. 1:35 ч., webmaster@zedlx.com wrote:
> 
> Quoting Nikolay Borisov <n.borisov.lkml@gmail.com>:
> 
> 
>>>>
>>>> You're exactly in the pitfall of btrfs backref walk.
>>>>
>>>> For btrfs, it's definitely not an easy work to do backref walk.
>>>> btrfs uses hidden backref, that means, under most case, one extent
>>>> shared by 1000 snapshots, in extent tree (shows the backref) it can
>>>> completely be possible to only have one ref, for the initial subvolume.
>>>>
>>>> For btrfs, you need to walk up the tree to find how it's shared.
>>>>
>>>> It has to be done like that, that's why we call it backref-*walk*.
>>>>
>>>> E.g
>>>>           A (subvol 257)     B (Subvol 258, snapshot of 257)
>>>>           |    \        /    |
>>>>           |        X         |
>>>>           |    /        \    |
>>>>           C                  D
>>>>          / \                / \
>>>>         E   F              G   H
>>>>
>>>> In extent tree, E is only referred by subvol 257.
>>>> While C has two referencers, 257 and 258.
>>>>
>>>> So in reality, you need to:
>>>> 1) Do a tree search from subvol 257
>>>>    You got a path, E -> C -> A
>>>> 2) Check each node to see if it's shared.
>>>>    E is only referred by C, no extra referencer.
>>>>    C is refered by two new tree blocks, A and B.
>>>>    A is refered by subvol 257.
>>>>    B is refered by subvol 258.
>>>>    So E is shared by 257 and 258.
>>>>
>>>> Now, you see how things would go mad, for each extent you must go that
>>>> way to determine the real owner of each extent, not to mention we can
>>>> have at most 8 levels, tree blocks at level 0~7 can all be shared.
>>>>
>>>> If it's shared by 1000 subvolumes, hope you had a good day then.
>>>
>>> Ok, let's do just this issue for the time being. One issue at a time. It
>>> will be easier.
>>>
>>> The solution is to temporarily create a copy of the entire backref-tree
>>> in memory. To create this copy, you just do a preorder depth-first
>>> traversal following only forward references.
>>>
>>> So this preorder depth-first traversal would visit the nodes in the
>>> following order:
>>> A,C,E,F,D,G,H,B
>>>
>>> Oh, it is not a tree, it is a DAG in that example of yours. OK, preorder
>>> is possible on DAG, too. But how did you get a DAG, shouldn't it be all
>>> trees?
>>>
>>> When you have the entire backref-tree (backref-DAG?) in memory, doing a
>>> backref-walk is a piece of cake.
>>>
>>> Of course, this in-memory backref tree has to be kept in sync with the
>>> filesystem, that is it has to be updated whenever there is a write to
>>> disk. That's not so hard.
>>
>> Great, now that you have devised a solution and have plenty of
>> experience writing code why not try and contribute to btrfs?
> 
> First, that is what I'm just doing. I'm contributing to discussion on
> most needed features of btrfs. I'm helping you to get on the right track

"most needed" according to your needs. Again, "right track" according to
you.

> and waste less time on unimportant stuff.

Who decides whether something is important or unimportant?

> 
> You might appreciate my help, or not, but I am trying to help.
> 
> What you probaby wanted to say is that you would like me to contribute
> by writing code, pro bono. Unfortunately, I work for money as does the
> 99% of the population. Why not contribute for free? For the same reason
> why the rest of the population doesn't work for free. And, I'm not going
> from door to door and buggin everyone with "why don't you work for
> free", "why don't you help this noble cause..." blah. Makes no sense to me.

Correct, this boils down there are essentially 2 ways to get something
done in open source - code or money. So far you haven't provided either.
By the same token - why should anyone do, in essence, pro-bono work for
features *you* are specifically interested? Let's not kid ourselves -
that's what this is all about.

So while the discussion you spurred could be somewhat beneficial the
style you use is definitely not. It's, at the very least, grating and
somewhat aggressive. You haven't done any technical work on btrfs yet
when more knowledgeable people, who have spent years(!!!) working on the
code base tell you that there are technical reasons why things are the
way they are you are brisk to dismiss their opinion. Of course they
might very well be wrong - prove it to them, ideally with code.

Given this how do you expect people to be interested in engaging in a
meaningful conversation with you?



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-10 23:32             ` webmaster
@ 2019-09-11 12:02               ` Austin S. Hemmelgarn
  2019-09-11 16:26                 ` Zygo Blaxell
  2019-09-11 17:20                 ` webmaster
  0 siblings, 2 replies; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-11 12:02 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs

On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> 
>>> Defrag may break up extents. Defrag may fuse extents. But it shouln't 
>>> ever unshare extents.
> 
>> Actually, spitting or merging extents will unshare them in a large 
>> majority of cases.
> 
> Ok, this point seems to be repeated over and over without any proof, and 
> it is illogical to me.
> 
> About merging extents: a defrag should merge extents ONLY when both 
> extents are shared by the same files (and when those extents are 
> neighbours in both files). In other words, defrag should always merge 
> without unsharing. Let's call that operation "fusing extents", so that 
> there are no more misunderstandings.
And I reiterate: defrag only operates on the file it's passed in.  It 
needs to for efficiency reasons (we had a reflink aware defrag for a 
while a few years back, it got removed because performance limitations 
meant it was unusable in the cases where you actually needed it). 
Defrag doesn't even know that there are reflinks to the extents it's 
operating on.

Now factor in that _any_ write will result in unsharing the region being 
written to, rounded to the nearest full filesystem block in both 
directions (this is mandatory, it's a side effect of the copy-on-write 
nature of BTRFS, and is why files that experience heavy internal 
rewrites get fragmented very heavily and very quickly on BTRFS).

Given this, defrag isn't willfully unsharing anything, it's just a 
side-effect of how it works (since it's rewriting the block layout of 
the file in-place).
> 
> === I CHALLENGE you and anyone else on this mailing list: ===
> 
>   - Show me an exaple where splitting an extent requires unsharing, and 
> this split is needed to defrag.
> 
> Make it clear, write it yourself, I don't want any machine-made outputs.
> 
Start with the above comment about all writes unsharing the region being 
written to.

Now, extrapolating from there:

Assume you have two files, A and B, each consisting of 64 filesystem 
blocks in single shared extent.  Now assume somebody writes a few bytes 
to the middle of file B, right around the boundary between blocks 31 and 
32, and that you get similar writes to file A straddling blocks 14-15 
and 47-48.

After all of that, file A will be 5 extents:

* A reflink to blocks 0-13 of the original extent.
* A single isolated extent consisting of the new blocks 14-15
* A reflink to blocks 16-46 of the original extent.
* A single isolated extent consisting of the new blocks 47-48
* A reflink to blocks 49-63 of the original extent.

And file B will be 3 extents:

* A reflink to blocks 0-30 of the original extent.
* A single isolated extent consisting of the new blocks 31-32.
* A reflink to blocks 32-63 of the original extent.

Note that there are a total of four contiguous sequences of blocks that 
are common between both files:

* 0-13
* 16-30
* 32-46
* 49-63

There is no way to completely defragment either file without splitting 
the original extent (which is still there, just not fully referenced by 
either file) unless you rewrite the whole file to a new single extent 
(which would, of course, completely unshare the whole file).  In fact, 
if you want to ensure that those shared regions stay reflinked, there's 
no way to defragment either file without _increasing_ the number of 
extents in that file (either file would need 7 extents to properly share 
only those 4 regions), and even then only one of the files could be 
fully defragmented.

Such a situation generally won't happen if you're just dealing with 
read-only snapshots, but is not unusual when dealing with regular files 
that are reflinked (which is not an uncommon situation on some systems, 
as a lot of people have `cp` aliased to reflink things whenever possible).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 12:02               ` Austin S. Hemmelgarn
@ 2019-09-11 16:26                 ` Zygo Blaxell
  2019-09-11 17:20                 ` webmaster
  1 sibling, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-11 16:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: webmaster, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5986 bytes --]

On Wed, Sep 11, 2019 at 08:02:40AM -0400, Austin S. Hemmelgarn wrote:
> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > 
> > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > 
> > 
> > > > Defrag may break up extents. Defrag may fuse extents. But it
> > > > shouln't ever unshare extents.
> > 
> > > Actually, spitting or merging extents will unshare them in a large
> > > majority of cases.
> > 
> > Ok, this point seems to be repeated over and over without any proof, and
> > it is illogical to me.
> > 
> > About merging extents: a defrag should merge extents ONLY when both
> > extents are shared by the same files (and when those extents are
> > neighbours in both files). In other words, defrag should always merge
> > without unsharing. Let's call that operation "fusing extents", so that
> > there are no more misunderstandings.
> And I reiterate: defrag only operates on the file it's passed in.  It needs
> to for efficiency reasons (we had a reflink aware defrag for a while a few
> years back, it got removed because performance limitations meant it was
> unusable in the cases where you actually needed it). Defrag doesn't even
> know that there are reflinks to the extents it's operating on.
> 
> Now factor in that _any_ write will result in unsharing the region being
> written to, rounded to the nearest full filesystem block in both directions
> (this is mandatory, it's a side effect of the copy-on-write nature of BTRFS,
> and is why files that experience heavy internal rewrites get fragmented very
> heavily and very quickly on BTRFS).
> 
> Given this, defrag isn't willfully unsharing anything, it's just a
> side-effect of how it works (since it's rewriting the block layout of the
> file in-place).
> > 
> > === I CHALLENGE you and anyone else on this mailing list: ===
> > 
> >   - Show me an exaple where splitting an extent requires unsharing, and
> > this split is needed to defrag.
> >
> > Make it clear, write it yourself, I don't want any machine-made outputs.
> > 
> Start with the above comment about all writes unsharing the region being
> written to.
> 
> Now, extrapolating from there:
> 
> Assume you have two files, A and B, each consisting of 64 filesystem blocks
> in single shared extent.  Now assume somebody writes a few bytes to the
> middle of file B, right around the boundary between blocks 31 and 32, and
> that you get similar writes to file A straddling blocks 14-15 and 47-48.
> 
> After all of that, file A will be 5 extents:
> 
> * A reflink to blocks 0-13 of the original extent.
> * A single isolated extent consisting of the new blocks 14-15
> * A reflink to blocks 16-46 of the original extent.
> * A single isolated extent consisting of the new blocks 47-48
> * A reflink to blocks 49-63 of the original extent.
> 
> And file B will be 3 extents:
> 
> * A reflink to blocks 0-30 of the original extent.
> * A single isolated extent consisting of the new blocks 31-32.
> * A reflink to blocks 32-63 of the original extent.
> 
> Note that there are a total of four contiguous sequences of blocks that are
> common between both files:
> 
> * 0-13
> * 16-30
> * 32-46
> * 49-63
> 
> There is no way to completely defragment either file without splitting the
> original extent (which is still there, just not fully referenced by either
> file) unless you rewrite the whole file to a new single extent (which would,
> of course, completely unshare the whole file).  In fact, if you want to
> ensure that those shared regions stay reflinked, there's no way to
> defragment either file without _increasing_ the number of extents in that
> file (either file would need 7 extents to properly share only those 4
> regions), and even then only one of the files could be fully defragmented.

Arguably, the kernel's defrag ioctl should go ahead and do the extent
relocation and update all the reflinks at once, using the file given
in the argument as the "canonical" block order, i.e. the fd and offset
range you pass in is checked, and if it's not physically contiguous,
the extents in the range are copied to a single contiguous extent, then
all the other references to the old extent(s) within the offset range are
rewritten to point to the new extent, then the old extent is discarded.

It is possible to do this from userspace now using a mix of data copies
and dedupe, but it's much more efficient to use the facilities available
in the kernel:  in particular, the kernel can lock the extent in question
while all of this is going on, and the kernel can update shared snapshot
metadata pages directly instead of duplicating them and doing identical
updates on each copy.

This sort of extent reference manipulation, particularly of extents
referenced by readonly snapshots, used to break a lot of things (btrfs
send in particular) but the same issues came up again for dedupe,
and they now seem to be fixed as of 5.3 or so.  Maybe it's time to try
shared-extent-aware defrag again.

In practice, such an improved defrag ioctl would probably need some
more limit parameters, e.g.  "just skip over any extent with more than
1000 references" or "do a two-pass algorithm and relocate data only if
every reference to the data is logically contiguous" to avoid getting
bogged down on extents which require more iops to defrag in the present
than can possibly be saved by using the defrag result in the future.
That makes the defrag API even uglier, with even more magic baked-in
behavior to get in the way of users who know what they're doing, but
some stranger on a mailing list requested it, so why not...  :-P

> Such a situation generally won't happen if you're just dealing with
> read-only snapshots, but is not unusual when dealing with regular files that
> are reflinked (which is not an uncommon situation on some systems, as a lot
> of people have `cp` aliased to reflink things whenever possible).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 12:02               ` Austin S. Hemmelgarn
  2019-09-11 16:26                 ` Zygo Blaxell
@ 2019-09-11 17:20                 ` webmaster
  2019-09-11 18:19                   ` Austin S. Hemmelgarn
  2019-09-11 21:37                   ` Zygo Blaxell
  1 sibling, 2 replies; 111+ messages in thread
From: webmaster @ 2019-09-11 17:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>

>>
>> === I CHALLENGE you and anyone else on this mailing list: ===
>>
>>  - Show me an exaple where splitting an extent requires unsharing,  
>> and this split is needed to defrag.
>>
>> Make it clear, write it yourself, I don't want any machine-made outputs.
>>
> Start with the above comment about all writes unsharing the region  
> being written to.
>
> Now, extrapolating from there:
>
> Assume you have two files, A and B, each consisting of 64 filesystem  
> blocks in single shared extent.  Now assume somebody writes a few  
> bytes to the middle of file B, right around the boundary between  
> blocks 31 and 32, and that you get similar writes to file A  
> straddling blocks 14-15 and 47-48.
>
> After all of that, file A will be 5 extents:
>
> * A reflink to blocks 0-13 of the original extent.
> * A single isolated extent consisting of the new blocks 14-15
> * A reflink to blocks 16-46 of the original extent.
> * A single isolated extent consisting of the new blocks 47-48
> * A reflink to blocks 49-63 of the original extent.
>
> And file B will be 3 extents:
>
> * A reflink to blocks 0-30 of the original extent.
> * A single isolated extent consisting of the new blocks 31-32.
> * A reflink to blocks 32-63 of the original extent.
>
> Note that there are a total of four contiguous sequences of blocks  
> that are common between both files:
>
> * 0-13
> * 16-30
> * 32-46
> * 49-63
>
> There is no way to completely defragment either file without  
> splitting the original extent (which is still there, just not fully  
> referenced by either file) unless you rewrite the whole file to a  
> new single extent (which would, of course, completely unshare the  
> whole file).  In fact, if you want to ensure that those shared  
> regions stay reflinked, there's no way to defragment either file  
> without _increasing_ the number of extents in that file (either file  
> would need 7 extents to properly share only those 4 regions), and  
> even then only one of the files could be fully defragmented.
>
> Such a situation generally won't happen if you're just dealing with  
> read-only snapshots, but is not unusual when dealing with regular  
> files that are reflinked (which is not an uncommon situation on some  
> systems, as a lot of people have `cp` aliased to reflink things  
> whenever possible).

Well, thank you very much for writing this example. Your example is  
certainly not minimal, as it seems to me that one write to the file A  
and one write to file B would be sufficient to prove your point, so  
there we have one extra write in the example, but that's OK.

Your example proves that I was wrong. I admit: it is impossible to  
perfectly defrag one subvolume (in the way I imagined it should be  
done).
Why? Because, as in your example, there can be files within a SINGLE  
subvolume which share their extents with each other. I didn't consider  
such a case.

On the other hand, I judge this issue to be mostly irrelevant. Why?  
Because most of the file sharing will be between subvolumes, not  
within a subvolume. When a user creates a reflink to a file in the  
same subvolume, he is willingly denying himself the assurance of a  
perfect defrag. Because, as your example proves, if there are a few  
writes to BOTH files, it gets impossible to defrag perfectly. So, if  
the user creates such reflinks, it's his own whish and his own fault.

Such situations will occur only in some specific circumstances:
a) when the user is reflinking manually
b) when a file is copied from one subvolume into a different file in a  
different subvolume.

The situation a) is unusual in normal use of the filesystem. Even when  
it occurs, it is the explicit command given by the user, so he should  
be willing to accept all the consequences, even the bad ones like  
imperfect defrag.

The situation b) is possible, but as far as I know copies are  
currently not done that way in btrfs. There should probably be the  
option to reflink-copy files fron another subvolume, that would be good.

But anyway, it doesn't matter. Because most of the sharing will be  
between subvolumes, not within subvolume. So, if there is some  
in-subvolume sharing, the defrag wont be 100% perfect, that a minor  
point. Unimportant.

>> About merging extents: a defrag should merge extents ONLY when both  
>> extents are shared by the same files (and when those extents are  
>> neighbours in both files). In other words, defrag should always  
>> merge without unsharing. Let's call that operation "fusing  
>> extents", so that there are no more misunderstandings.

> And I reiterate: defrag only operates on the file it's passed in.   
> It needs to for efficiency reasons (we had a reflink aware defrag  
> for a while a few years back, it got removed because performance  
> limitations meant it was unusable in the cases where you actually  
> needed it). Defrag doesn't even know that there are reflinks to the  
> extents it's operating on.

If the defrag doesn't know about all reflinks, that's bad in my view.  
That is a bad defrag. If you had a reflink-aware defrag, and it was  
slow, maybe that happened because the implementation was bad. Because,  
I don't see any reason why it should be slow. So, you will have to  
explain to me what was causing this performance problems.

> Given this, defrag isn't willfully unsharing anything, it's just a  
> side-effect of how it works (since it's rewriting the block layout  
> of the file in-place).

The current defrag has to unshare because, as you said, because it is  
unaware of the full reflink structure. If it doesn't know about all  
reflinks, it has to unshare, there is no way around that.

> Now factor in that _any_ write will result in unsharing the region  
> being written to, rounded to the nearest full filesystem block in  
> both directions (this is mandatory, it's a side effect of the  
> copy-on-write nature of BTRFS, and is why files that experience  
> heavy internal rewrites get fragmented very heavily and very quickly  
> on BTRFS).

You mean: when defrag performs a write, the new data is unshared  
because every write is unshared? Really?

Consider there is an extent E55 shared by two files A and B. The  
defrag has to move E55 to another location. In order to do that,  
defrag creates a new extent E70. It makes it belong to file A by  
changing the reflink of extent E55 in file A to point to E70.

Now, to retain the original sharing structure, the defrag has to  
change the reflink of extent E55 in file B to point to E70. You are  
telling me this is not possible? Bullshit!

Please explain to me how this 'defrag has to unshare' story of yours  
isn't an intentional attempt to mislead me.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 17:20                 ` webmaster
@ 2019-09-11 18:19                   ` Austin S. Hemmelgarn
  2019-09-11 20:01                     ` webmaster
  2019-09-11 21:37                     ` webmaster
  2019-09-11 21:37                   ` Zygo Blaxell
  1 sibling, 2 replies; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-11 18:19 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs

On 2019-09-11 13:20, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>
>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>
> 
>>>
>>> === I CHALLENGE you and anyone else on this mailing list: ===
>>>
>>>  - Show me an exaple where splitting an extent requires unsharing, 
>>> and this split is needed to defrag.
>>>
>>> Make it clear, write it yourself, I don't want any machine-made outputs.
>>>
>> Start with the above comment about all writes unsharing the region 
>> being written to.
>>
>> Now, extrapolating from there:
>>
>> Assume you have two files, A and B, each consisting of 64 filesystem 
>> blocks in single shared extent.  Now assume somebody writes a few 
>> bytes to the middle of file B, right around the boundary between 
>> blocks 31 and 32, and that you get similar writes to file A straddling 
>> blocks 14-15 and 47-48.
>>
>> After all of that, file A will be 5 extents:
>>
>> * A reflink to blocks 0-13 of the original extent.
>> * A single isolated extent consisting of the new blocks 14-15
>> * A reflink to blocks 16-46 of the original extent.
>> * A single isolated extent consisting of the new blocks 47-48
>> * A reflink to blocks 49-63 of the original extent.
>>
>> And file B will be 3 extents:
>>
>> * A reflink to blocks 0-30 of the original extent.
>> * A single isolated extent consisting of the new blocks 31-32.
>> * A reflink to blocks 32-63 of the original extent.
>>
>> Note that there are a total of four contiguous sequences of blocks 
>> that are common between both files:
>>
>> * 0-13
>> * 16-30
>> * 32-46
>> * 49-63
>>
>> There is no way to completely defragment either file without splitting 
>> the original extent (which is still there, just not fully referenced 
>> by either file) unless you rewrite the whole file to a new single 
>> extent (which would, of course, completely unshare the whole file).  
>> In fact, if you want to ensure that those shared regions stay 
>> reflinked, there's no way to defragment either file without 
>> _increasing_ the number of extents in that file (either file would 
>> need 7 extents to properly share only those 4 regions), and even then 
>> only one of the files could be fully defragmented.
>>
>> Such a situation generally won't happen if you're just dealing with 
>> read-only snapshots, but is not unusual when dealing with regular 
>> files that are reflinked (which is not an uncommon situation on some 
>> systems, as a lot of people have `cp` aliased to reflink things 
>> whenever possible).
> 
> Well, thank you very much for writing this example. Your example is 
> certainly not minimal, as it seems to me that one write to the file A 
> and one write to file B would be sufficient to prove your point, so 
> there we have one extra write in the example, but that's OK.
> 
> Your example proves that I was wrong. I admit: it is impossible to 
> perfectly defrag one subvolume (in the way I imagined it should be done).
> Why? Because, as in your example, there can be files within a SINGLE 
> subvolume which share their extents with each other. I didn't consider 
> such a case.
> 
> On the other hand, I judge this issue to be mostly irrelevant. Why? 
> Because most of the file sharing will be between subvolumes, not within 
> a subvolume.
Not necessarily. Even ignoring the case of data deduplication (which 
needs to be considered if you care at all about enterprise usage, and is 
part of the whole point of using a CoW filesystem), there are existing 
applications that actively use reflinks, either directly or indirectly 
(via things like the `copy_file_range` system call), and the number of 
such applications is growing.

> When a user creates a reflink to a file in the same 
> subvolume, he is willingly denying himself the assurance of a perfect 
> defrag. Because, as your example proves, if there are a few writes to 
> BOTH files, it gets impossible to defrag perfectly. So, if the user 
> creates such reflinks, it's his own whish and his own fault.
The same argument can be made about snapshots.  It's an invalid argument 
in both cases though because it's not always the user who's creating the 
reflinks or snapshots.
> 
> Such situations will occur only in some specific circumstances:
> a) when the user is reflinking manually
> b) when a file is copied from one subvolume into a different file in a 
> different subvolume.
> 
> The situation a) is unusual in normal use of the filesystem. Even when 
> it occurs, it is the explicit command given by the user, so he should be 
> willing to accept all the consequences, even the bad ones like imperfect 
> defrag.
> 
> The situation b) is possible, but as far as I know copies are currently 
> not done that way in btrfs. There should probably be the option to 
> reflink-copy files fron another subvolume, that would be good.
> 
> But anyway, it doesn't matter. Because most of the sharing will be 
> between subvolumes, not within subvolume. So, if there is some 
> in-subvolume sharing, the defrag wont be 100% perfect, that a minor 
> point. Unimportant.
You're focusing too much on your own use case here.  Not everybody uses 
snapshots, and there are many people who are using reflinks very 
actively within subvolumes, either for deduplication or because it saves 
time and space when dealing with multiple copies of mostly identical 
tress of files.
> 
>>> About merging extents: a defrag should merge extents ONLY when both 
>>> extents are shared by the same files (and when those extents are 
>>> neighbours in both files). In other words, defrag should always merge 
>>> without unsharing. Let's call that operation "fusing extents", so 
>>> that there are no more misunderstandings.
> 
>> And I reiterate: defrag only operates on the file it's passed in.  It 
>> needs to for efficiency reasons (we had a reflink aware defrag for a 
>> while a few years back, it got removed because performance limitations 
>> meant it was unusable in the cases where you actually needed it). 
>> Defrag doesn't even know that there are reflinks to the extents it's 
>> operating on.
> 
> If the defrag doesn't know about all reflinks, that's bad in my view. 
> That is a bad defrag. If you had a reflink-aware defrag, and it was 
> slow, maybe that happened because the implementation was bad. Because, I 
> don't see any reason why it should be slow. So, you will have to explain 
> to me what was causing this performance problems.
> 
>> Given this, defrag isn't willfully unsharing anything, it's just a 
>> side-effect of how it works (since it's rewriting the block layout of 
>> the file in-place).
> 
> The current defrag has to unshare because, as you said, because it is 
> unaware of the full reflink structure. If it doesn't know about all 
> reflinks, it has to unshare, there is no way around that.
> 
>> Now factor in that _any_ write will result in unsharing the region 
>> being written to, rounded to the nearest full filesystem block in both 
>> directions (this is mandatory, it's a side effect of the copy-on-write 
>> nature of BTRFS, and is why files that experience heavy internal 
>> rewrites get fragmented very heavily and very quickly on BTRFS).
> 
> You mean: when defrag performs a write, the new data is unshared because 
> every write is unshared? Really?
> 
> Consider there is an extent E55 shared by two files A and B. The defrag 
> has to move E55 to another location. In order to do that, defrag creates 
> a new extent E70. It makes it belong to file A by changing the reflink 
> of extent E55 in file A to point to E70.
> 
> Now, to retain the original sharing structure, the defrag has to change 
> the reflink of extent E55 in file B to point to E70. You are telling me 
> this is not possible? Bullshit!
> 
> Please explain to me how this 'defrag has to unshare' story of yours 
> isn't an intentional attempt to mislead me.
As mentioned in the previous email, we actually did have a (mostly) 
working reflink-aware defrag a few years back.  It got removed because 
it had serious performance issues.  Note that we're not talking a few 
seconds of extra time to defrag a full tree here, we're talking 
double-digit _minutes_ of extra time to defrag a moderate sized (low 
triple digit GB) subvolume with dozens of snapshots, _if you were lucky_ 
(if you weren't, you would be looking at potentially multiple _hours_ of 
runtime for the defrag).  The performance scaled inversely proportionate 
to the number of reflinks involved and the total amount of data in the 
subvolume being defragmented, and was pretty bad even in the case of 
only a couple of snapshots.

Ultimately, there are a couple of issues at play here:

* Online defrag has to maintain consistency during operation.  The 
current implementation does this by rewriting the regions being 
defragmented (which causes them to become a single new extent (most of 
the time)), which avoids a whole lot of otherwise complicated logic 
required to make sure things happen correctly, and also means that only 
the file being operated on is impacted and only the parts being modified 
need to be protected against concurrent writes.  Properly handling 
reflinks means that _every_ file that shares some part of an extent with 
the file being operated on needs to have the reflinked regions locked 
for the defrag operation, which has a huge impact on performance. Using 
your example, the update to E55 in both files A and B has to happen as 
part of the same commit, which can contain no other writes in that 
region of the file, otherwise you run the risk of losing writes to file 
B that occur while file A is being defragmented.  It's not horrible when 
it's just a small region in two files, but it becomes a big issue when 
dealing with lots of files and/or particularly large extents (extents in 
BTRFS can get into the GB range in terms of size when dealing with 
really big files).

* Reflinks can reference partial extents.  This means, ultimately, that 
you may end up having to split extents in odd ways during defrag if you 
want to preserve reflinks, and might have to split extents _elsewhere_ 
that are only tangentially related to the region being defragmented. 
See the example in my previous email for a case like this, maintaining 
the shared regions as being shared when you defragment either file to a 
single extent will require splitting extents in the other file (in 
either case, whichever file you don't defragment to a single extent will 
end up having 7 extents if you try to force the one that's been 
defragmented to be the canonical version).  Once you consider that a 
given extent can have multiple ranges reflinked from multiple other 
locations, it gets even more complicated.

* If you choose to just not handle the above point by not letting defrag 
split extents, you put a hard lower limit on the amount of fragmentation 
present in a file if you want to preserve reflinks.  IOW, you can't 
defragment files past a certain point.  If we go this way, neither of 
the two files in the example from my previous email could be 
defragmented any further than they already are, because doing so would 
require splitting extents.

* Determining all the reflinks to a given region of a given extent is 
not a cheap operation, and the information may immediately be stale 
(because an operation right after you fetch the info might change 
things).  We could work around this by locking the extent somehow, but 
doing so would be expensive because you would have to hold the lock for 
the entire defrag operation.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 18:19                   ` Austin S. Hemmelgarn
@ 2019-09-11 20:01                     ` webmaster
  2019-09-11 21:42                       ` Zygo Blaxell
  2019-09-11 21:37                     ` webmaster
  1 sibling, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-11 20:01 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>
>>>>
>>>> === I CHALLENGE you and anyone else on this mailing list: ===
>>>>
>>>>  - Show me an exaple where splitting an extent requires  
>>>> unsharing, and this split is needed to defrag.
>>>>
>>>> Make it clear, write it yourself, I don't want any machine-made outputs.
>>>>
>>> Start with the above comment about all writes unsharing the region  
>>> being written to.
>>>
>>> Now, extrapolating from there:
>>>
>>> Assume you have two files, A and B, each consisting of 64  
>>> filesystem blocks in single shared extent.  Now assume somebody  
>>> writes a few bytes to the middle of file B, right around the  
>>> boundary between blocks 31 and 32, and that you get similar writes  
>>> to file A straddling blocks 14-15 and 47-48.
>>>
>>> After all of that, file A will be 5 extents:
>>>
>>> * A reflink to blocks 0-13 of the original extent.
>>> * A single isolated extent consisting of the new blocks 14-15
>>> * A reflink to blocks 16-46 of the original extent.
>>> * A single isolated extent consisting of the new blocks 47-48
>>> * A reflink to blocks 49-63 of the original extent.
>>>
>>> And file B will be 3 extents:
>>>
>>> * A reflink to blocks 0-30 of the original extent.
>>> * A single isolated extent consisting of the new blocks 31-32.
>>> * A reflink to blocks 32-63 of the original extent.
>>>
>>> Note that there are a total of four contiguous sequences of blocks  
>>> that are common between both files:
>>>
>>> * 0-13
>>> * 16-30
>>> * 32-46
>>> * 49-63
>>>
>>> There is no way to completely defragment either file without  
>>> splitting the original extent (which is still there, just not  
>>> fully referenced by either file) unless you rewrite the whole file  
>>> to a new single extent (which would, of course, completely unshare  
>>> the whole file).  In fact, if you want to ensure that those shared  
>>> regions stay reflinked, there's no way to defragment either file  
>>> without _increasing_ the number of extents in that file (either  
>>> file would need 7 extents to properly share only those 4 regions),  
>>> and even then only one of the files could be fully defragmented.
>>>
>>> Such a situation generally won't happen if you're just dealing  
>>> with read-only snapshots, but is not unusual when dealing with  
>>> regular files that are reflinked (which is not an uncommon  
>>> situation on some systems, as a lot of people have `cp` aliased to  
>>> reflink things whenever possible).
>>
>> Well, thank you very much for writing this example. Your example is  
>> certainly not minimal, as it seems to me that one write to the file  
>> A and one write to file B would be sufficient to prove your point,  
>> so there we have one extra write in the example, but that's OK.
>>
>> Your example proves that I was wrong. I admit: it is impossible to  
>> perfectly defrag one subvolume (in the way I imagined it should be  
>> done).
>> Why? Because, as in your example, there can be files within a  
>> SINGLE subvolume which share their extents with each other. I  
>> didn't consider such a case.
>>
>> On the other hand, I judge this issue to be mostly irrelevant. Why?  
>> Because most of the file sharing will be between subvolumes, not  
>> within a subvolume.

> Not necessarily. Even ignoring the case of data deduplication (which  
> needs to be considered if you care at all about enterprise usage,  
> and is part of the whole point of using a CoW filesystem), there are  
> existing applications that actively use reflinks, either directly or  
> indirectly (via things like the `copy_file_range` system call), and  
> the number of such applications is growing.

The same argument goes here: If data-deduplication was performed, then  
the user has specifically requested it.
Therefore, since it was user's will, the defrag has to honor it, and  
so the defrag must not unshare deduplicated extents because the user  
wants them shared. This might prevent a perfect defrag, but that is  
exactly what the user has requested, either directly or indirectly, by  
some policy he has choosen.

If an application actively creates reflinked-copies, then we can  
assume it does so according to user's will, therefore it is also a  
command by user and defrag should honor it by not unsharing and by  
being imperfect.

Now, you might point out that, in case of data-deduplication, we now  
have a case where most sharing might be within-subvolume, invalidating  
my assertion that most sharing will be between-subvolumes. But this is  
an invalid (more precisely, irelevant) argument. Why? Because the  
defrag operation has to focus on doing what it can do, while honoring  
user's will. All within-subvolume sharing is user-requested, therefore  
it cannot be part of the argument to unshare.

You can't both perfectly defrag and honor deduplication. Therefore,  
the defrag has to do the best possible thing while still honoring  
user's will. <<<!!! So, the fact that the deduplication was performed  
is actually the reason FOR not unsharing, not against it, as you made  
it look in that paragraph. !!!>>>

If the system unshares automatically after deduplication, then the  
user will need to run deduplication again. Ridiculous!

>> When a user creates a reflink to a file in the same subvolume, he  
>> is willingly denying himself the assurance of a perfect defrag.  
>> Because, as your example proves, if there are a few writes to BOTH  
>> files, it gets impossible to defrag perfectly. So, if the user  
>> creates such reflinks, it's his own whish and his own fault.

> The same argument can be made about snapshots.  It's an invalid  
> argument in both cases though because it's not always the user who's  
> creating the reflinks or snapshots.

Um, I don't agree.

1) Actually, it is always the user who is creating reflinks, and  
snapshots, too. Ultimately, it's always the user who does absolutely  
everything, because a computer is supposed to be under his full  
control. But, in the case of reflink-copies, this is even more true
because reflinks are not an essential feature for normal OS operation,  
at least as far as today's OSes go. Every OS has to copy files around.  
Every OS requires the copy operation. No current OS requires the  
reflinked-copy operation in order to function.

2) A user can make any number of snapshots and subvolumes, but he can  
at any time select one subvolume as a focus of the defrag operation,  
and that subvolume can be perfectly defragmented without any unsharing  
(except that the internal-reflinked files won't be perfectly  
defragmented).
Therefore, the snapshoting operation can never jeopardize a perfect  
defrag. The user can make many snapshots without any fears (I'd say a  
total of 100 snapshots at any point in time is a good and reasonable  
limit).

>> Such situations will occur only in some specific circumstances:
>> a) when the user is reflinking manually
>> b) when a file is copied from one subvolume into a different file  
>> in a different subvolume.
>>
>> The situation a) is unusual in normal use of the filesystem. Even  
>> when it occurs, it is the explicit command given by the user, so he  
>> should be willing to accept all the consequences, even the bad ones  
>> like imperfect defrag.
>>
>> The situation b) is possible, but as far as I know copies are  
>> currently not done that way in btrfs. There should probably be the  
>> option to reflink-copy files fron another subvolume, that would be  
>> good.
>>
>> But anyway, it doesn't matter. Because most of the sharing will be  
>> between subvolumes, not within subvolume. So, if there is some  
>> in-subvolume sharing, the defrag wont be 100% perfect, that a minor  
>> point. Unimportant.

> You're focusing too much on your own use case here.

It's so easy to say that. But you really don't know. You might be  
wrong. I might be the objective one, and you might be giving me some  
groupthink-induced, badly thought out conclusions from years ago,  
which was never rechecked because that's so hard to do. And then  
everybody just repeats it and it becomes the truth. As Goebels said,  
if you repeat anything enough times, it becomes the truth.

> Not everybody uses snapshots, and there are many people who are  
> using reflinks very actively within subvolumes, either for  
> deduplication or because it saves time and space when dealing with  
> multiple copies of mostly identical tress of files.

Yes, I guess there are many such users. Doesn't matter. What you are  
proposing is that the defrag should break all their reflinks and  
deduplicated data they painstakingly created. Come on!

Or, maybe the defrag should unshare to gain performance? Yes, but only  
WHEN USER REQUESTS IT. So the defrag can unshare,
but only by request. Since this means that user is reversing his  
previous command to not unshare, this has to be explicitly requested  
by the user, not part of the default defrag operation.


> As mentioned in the previous email, we actually did have a (mostly)  
> working reflink-aware defrag a few years back.  It got removed  
> because it had serious performance issues.  Note that we're not  
> talking a few seconds of extra time to defrag a full tree here,  
> we're talking double-digit _minutes_ of extra time to defrag a  
> moderate sized (low triple digit GB) subvolume with dozens of  
> snapshots, _if you were lucky_ (if you weren't, you would be looking  
> at potentially multiple _hours_ of runtime for the defrag).  The  
> performance scaled inversely proportionate to the number of reflinks  
> involved and the total amount of data in the subvolume being  
> defragmented, and was pretty bad even in the case of only a couple  
> of snapshots.
>
> Ultimately, there are a couple of issues at play here:

I'll reply to this in another post. This one is getting a bit too long.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 17:20                 ` webmaster
  2019-09-11 18:19                   ` Austin S. Hemmelgarn
@ 2019-09-11 21:37                   ` Zygo Blaxell
  2019-09-11 23:21                     ` webmaster
  1 sibling, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-11 21:37 UTC (permalink / raw)
  To: webmaster; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > > 
> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > 
> 
> > > 
> > > === I CHALLENGE you and anyone else on this mailing list: ===
> > > 
> > >  - Show me an exaple where splitting an extent requires unsharing,
> > > and this split is needed to defrag.
> > > 
> > > Make it clear, write it yourself, I don't want any machine-made outputs.
> > > 
> > Start with the above comment about all writes unsharing the region being
> > written to.
> > 
> > Now, extrapolating from there:
> > 
> > Assume you have two files, A and B, each consisting of 64 filesystem
> > blocks in single shared extent.  Now assume somebody writes a few bytes
> > to the middle of file B, right around the boundary between blocks 31 and
> > 32, and that you get similar writes to file A straddling blocks 14-15
> > and 47-48.
> > 
> > After all of that, file A will be 5 extents:
> > 
> > * A reflink to blocks 0-13 of the original extent.
> > * A single isolated extent consisting of the new blocks 14-15
> > * A reflink to blocks 16-46 of the original extent.
> > * A single isolated extent consisting of the new blocks 47-48
> > * A reflink to blocks 49-63 of the original extent.
> > 
> > And file B will be 3 extents:
> > 
> > * A reflink to blocks 0-30 of the original extent.
> > * A single isolated extent consisting of the new blocks 31-32.
> > * A reflink to blocks 32-63 of the original extent.
> > 
> > Note that there are a total of four contiguous sequences of blocks that
> > are common between both files:
> > 
> > * 0-13
> > * 16-30
> > * 32-46
> > * 49-63
> > 
> > There is no way to completely defragment either file without splitting
> > the original extent (which is still there, just not fully referenced by
> > either file) unless you rewrite the whole file to a new single extent
> > (which would, of course, completely unshare the whole file).  In fact,
> > if you want to ensure that those shared regions stay reflinked, there's
> > no way to defragment either file without _increasing_ the number of
> > extents in that file (either file would need 7 extents to properly share
> > only those 4 regions), and even then only one of the files could be
> > fully defragmented.
> > 
> > Such a situation generally won't happen if you're just dealing with
> > read-only snapshots, but is not unusual when dealing with regular files
> > that are reflinked (which is not an uncommon situation on some systems,
> > as a lot of people have `cp` aliased to reflink things whenever
> > possible).
> 
> Well, thank you very much for writing this example. Your example is
> certainly not minimal, as it seems to me that one write to the file A and
> one write to file B would be sufficient to prove your point, so there we
> have one extra write in the example, but that's OK.
> 
> Your example proves that I was wrong. I admit: it is impossible to perfectly
> defrag one subvolume (in the way I imagined it should be done).
> Why? Because, as in your example, there can be files within a SINGLE
> subvolume which share their extents with each other. I didn't consider such
> a case.
> 
> On the other hand, I judge this issue to be mostly irrelevant. Why? Because
> most of the file sharing will be between subvolumes, not within a subvolume.
> When a user creates a reflink to a file in the same subvolume, he is
> willingly denying himself the assurance of a perfect defrag. Because, as
> your example proves, if there are a few writes to BOTH files, it gets
> impossible to defrag perfectly. So, if the user creates such reflinks, it's
> his own whish and his own fault.
> 
> Such situations will occur only in some specific circumstances:
> a) when the user is reflinking manually
> b) when a file is copied from one subvolume into a different file in a
> different subvolume.
> 
> The situation a) is unusual in normal use of the filesystem. Even when it
> occurs, it is the explicit command given by the user, so he should be
> willing to accept all the consequences, even the bad ones like imperfect
> defrag.
> 
> The situation b) is possible, but as far as I know copies are currently not
> done that way in btrfs. There should probably be the option to reflink-copy
> files fron another subvolume, that would be good.

Reflink copies across subvolumes have been working for years.  They are
an important component that makes dedupe work when snapshots are present.

> But anyway, it doesn't matter. Because most of the sharing will be between
> subvolumes, not within subvolume. 

Heh.  I'd like you to meet one of my medium-sized filesystems:

	Physical size:  8TB
	Logical size:  16TB
	Average references per extent:  2.03 (not counting snapshots)
	Workload:  CI build server, VM host

That's a filesystem where over half of the logical data is reflinks to the
other physical data, and 94% of that data is in a single subvol.  7.5TB of
data is unique, the remaining 500GB is referenced an average of 17 times.

We use ordinary applications to make ordinary copies of files, and do
tarball unpacks and source checkouts with reckless abandon, all day long.
Dedupe turns the copies into reflinks as we go, so every copy becomes
a reflink no matter how it was created.

For the VM filesystem image files, it's not uncommon to see a high
reflink rate within a single file as well as reflinks to other files
(like the binary files in the build directories that the VM images are
constructed from).  Those reference counts can go into the millions.

> So, if there is some in-subvolume sharing,
> the defrag wont be 100% perfect, that a minor point. Unimportant.

It's not unimportant; however, the implementation does have to take this
into account, and make sure that defrag can efficiently skip extents that
are too expensive to relocate.  If we plan to read an extent fewer than
100 times, it makes no sense to update 20000 references to it--we spend
less total time just doing the 100 slower reads.  If the numbers are
reversed then it's better to defrag the extent--100 reference updates
are easily outweighed by 20000 faster reads.  The kernel doesn't have
enough information to make good decisions about this.

Dedupe has a similar problem--it's rarely worth doing a GB of IO to
save 4K of space, so in practical implementations, a lot of duplicate
blocks have to remain duplicate.

There are some ways to make the kernel dedupe and defrag API process
each reference a little more efficiently, but none will get around this
basic physical problem:  some extents are just better off where they are.

Userspace has access to some extra data from the user, e.g.  "which
snapshots should have their references excluded from defrag because
the entire snapshot will be deleted in a few minutes."  That will allow
better defrag cost-benefit decisions than any in-kernel implementation
can make by itself.

'btrfs fi defrag' is just one possible userspace implementation, which
implements the "throw entire files at the legacy kernel defrag API one
at a time" algorithm.  Unfortunately, nobody seems to have implemented
any other algorithms yet, other than a few toy proof-of-concept demos.

> > > About merging extents: a defrag should merge extents ONLY when both
> > > extents are shared by the same files (and when those extents are
> > > neighbours in both files). In other words, defrag should always
> > > merge without unsharing. Let's call that operation "fusing extents",
> > > so that there are no more misunderstandings.
> 
> > And I reiterate: defrag only operates on the file it's passed in.  It
> > needs to for efficiency reasons (we had a reflink aware defrag for a
> > while a few years back, it got removed because performance limitations
> > meant it was unusable in the cases where you actually needed it). Defrag
> > doesn't even know that there are reflinks to the extents it's operating
> > on.
> 
> If the defrag doesn't know about all reflinks, that's bad in my view. That
> is a bad defrag. If you had a reflink-aware defrag, and it was slow, maybe
> that happened because the implementation was bad. Because, I don't see any
> reason why it should be slow. So, you will have to explain to me what was
> causing this performance problems.
> 
> > Given this, defrag isn't willfully unsharing anything, it's just a
> > side-effect of how it works (since it's rewriting the block layout of
> > the file in-place).
> 
> The current defrag has to unshare because, as you said, because it is
> unaware of the full reflink structure. If it doesn't know about all
> reflinks, it has to unshare, there is no way around that.
> 
> > Now factor in that _any_ write will result in unsharing the region being
> > written to, rounded to the nearest full filesystem block in both
> > directions (this is mandatory, it's a side effect of the copy-on-write
> > nature of BTRFS, and is why files that experience heavy internal
> > rewrites get fragmented very heavily and very quickly on BTRFS).
> 
> You mean: when defrag performs a write, the new data is unshared because
> every write is unshared? Really?
> 
> Consider there is an extent E55 shared by two files A and B. The defrag has
> to move E55 to another location. In order to do that, defrag creates a new
> extent E70. It makes it belong to file A by changing the reflink of extent
> E55 in file A to point to E70.
> 
> Now, to retain the original sharing structure, the defrag has to change the
> reflink of extent E55 in file B to point to E70. You are telling me this is
> not possible? Bullshit!

This is already possible today and userspace tools can do it--not as
efficiently as possible, but without requiring more than 128M of temporary
space.  'btrfs fi defrag' is not one of those tools.

> Please explain to me how this 'defrag has to unshare' story of yours isn't
> an intentional attempt to mislead me.

Austin is talking about the btrfs we have, not the btrfs we want.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 18:19                   ` Austin S. Hemmelgarn
  2019-09-11 20:01                     ` webmaster
@ 2019-09-11 21:37                     ` webmaster
  2019-09-12 11:31                       ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-11 21:37 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>

>>> Given this, defrag isn't willfully unsharing anything, it's just a  
>>> side-effect of how it works (since it's rewriting the block layout  
>>> of the file in-place).
>>
>> The current defrag has to unshare because, as you said, because it  
>> is unaware of the full reflink structure. If it doesn't know about  
>> all reflinks, it has to unshare, there is no way around that.
>>
>>> Now factor in that _any_ write will result in unsharing the region  
>>> being written to, rounded to the nearest full filesystem block in  
>>> both directions (this is mandatory, it's a side effect of the  
>>> copy-on-write nature of BTRFS, and is why files that experience  
>>> heavy internal rewrites get fragmented very heavily and very  
>>> quickly on BTRFS).
>>
>> You mean: when defrag performs a write, the new data is unshared  
>> because every write is unshared? Really?
>>
>> Consider there is an extent E55 shared by two files A and B. The  
>> defrag has to move E55 to another location. In order to do that,  
>> defrag creates a new extent E70. It makes it belong to file A by  
>> changing the reflink of extent E55 in file A to point to E70.
>>
>> Now, to retain the original sharing structure, the defrag has to  
>> change the reflink of extent E55 in file B to point to E70. You are  
>> telling me this is not possible? Bullshit!
>>
>> Please explain to me how this 'defrag has to unshare' story of  
>> yours isn't an intentional attempt to mislead me.

> As mentioned in the previous email, we actually did have a (mostly)  
> working reflink-aware defrag a few years back.  It got removed  
> because it had serious performance issues.  Note that we're not  
> talking a few seconds of extra time to defrag a full tree here,  
> we're talking double-digit _minutes_ of extra time to defrag a  
> moderate sized (low triple digit GB) subvolume with dozens of  
> snapshots, _if you were lucky_ (if you weren't, you would be looking  
> at potentially multiple _hours_ of runtime for the defrag).  The  
> performance scaled inversely proportionate to the number of reflinks  
> involved and the total amount of data in the subvolume being  
> defragmented, and was pretty bad even in the case of only a couple  
> of snapshots.

You cannot ever make the worst program, because an even worse program  
can be made by slowing down the original by a factor of 2.
So, you had a badly implemented defrag. At least you got some  
experience. Let's see what went wrong.

> Ultimately, there are a couple of issues at play here:
>
> * Online defrag has to maintain consistency during operation.  The  
> current implementation does this by rewriting the regions being  
> defragmented (which causes them to become a single new extent (most  
> of the time)), which avoids a whole lot of otherwise complicated  
> logic required to make sure things happen correctly, and also means  
> that only the file being operated on is impacted and only the parts  
> being modified need to be protected against concurrent writes.   
> Properly handling reflinks means that _every_ file that shares some  
> part of an extent with the file being operated on needs to have the  
> reflinked regions locked for the defrag operation, which has a huge  
> impact on performance. Using your example, the update to E55 in both  
> files A and B has to happen as part of the same commit, which can  
> contain no other writes in that region of the file, otherwise you  
> run the risk of losing writes to file B that occur while file A is  
> being defragmented.

Nah. I think there is a workaround. You can first (atomically) update  
A, then whatever, then you can update B later. I know, your yelling  
"what if E55 gets updated in B". Doesn't matter. The defrag continues  
later by searching for reflink to E55 in B. Then it checks the data  
contained in E55. If the data matches the E70, then it can safely  
update the reflink in B. Or the defrag can just verify that neither  
E55 nor E70 have been written to in the meantime. That means they  
still have the same data.

> It's not horrible when it's just a small region in two files, but it  
> becomes a big issue when dealing with lots of files and/or  
> particularly large extents (extents in BTRFS can get into the GB  
> range in terms of size when dealing with really big files).

You must just split large extents in a smart way. So, in the  
beginning, the defrag can split large extents (2GB) into smaller ones  
(32MB) to facilitate more responsive and easier defrag.

If you have lots of files, update them one-by one. It is possible. Or  
you can update in big batches. Whatever is faster.

The point is that the defrag can keep a buffer of a "pending  
operations". Pending operations are those that should be performed in  
order to keep the original sharing structure. If the defrag gets  
interrupted, then files in "pending operations" will be unshared. But  
this should really be some important and urgent interrupt, as the  
"pending operations" buffer needs at most a second or two to complete  
its operations.

> * Reflinks can reference partial extents.  This means, ultimately,  
> that you may end up having to split extents in odd ways during  
> defrag if you want to preserve reflinks, and might have to split  
> extents _elsewhere_ that are only tangentially related to the region  
> being defragmented. See the example in my previous email for a case  
> like this, maintaining the shared regions as being shared when you  
> defragment either file to a single extent will require splitting  
> extents in the other file (in either case, whichever file you don't  
> defragment to a single extent will end up having 7 extents if you  
> try to force the one that's been defragmented to be the canonical  
> version).  Once you consider that a given extent can have multiple  
> ranges reflinked from multiple other locations, it gets even more  
> complicated.

I think that this problem can be solved, and that it can be solved  
perfectly (the result is a perfectly-defragmented file). But, if it is  
so hard to do, just skip those problematic extents in initial version  
of defrag.

Ultimately, in the super-duper defrag, those partially-referenced  
extents should be split up by defrag.

> * If you choose to just not handle the above point by not letting  
> defrag split extents, you put a hard lower limit on the amount of  
> fragmentation present in a file if you want to preserve reflinks.   
> IOW, you can't defragment files past a certain point.  If we go this  
> way, neither of the two files in the example from my previous email  
> could be defragmented any further than they already are, because  
> doing so would require splitting extents.

Oh, you're reading my thoughts. That's good.

Initial implementation of defrag might be not-so-perfect. It would  
still be better than the current defrag.

This is not a one-way street. Handling of partially-used extents can  
be improved in later versions.

> * Determining all the reflinks to a given region of a given extent  
> is not a cheap operation, and the information may immediately be  
> stale (because an operation right after you fetch the info might  
> change things).  We could work around this by locking the extent  
> somehow, but doing so would be expensive because you would have to  
> hold the lock for the entire defrag operation.

No. DO NOT LOCK TO RETRIEVE REFLINKS.

Instead, you have to create a hook in every function that updates the  
reflink structure or extents (for exaple, write-to-file operation).  
So, when a reflink gets changed, the defrag is immediately notified  
about this. That way the defrag can keep its data about reflinks  
in-sync with the filesystem.

Also note, this defrag should run as a part of the kernel, not in  
userspace. Defrag-from-userspace is a nightmare. Defrag has to  
serialize its operations properly, and it must have knowledge of all  
other operations in progress. So, it can only operate efficiently as  
part of the kernel.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 20:01                     ` webmaster
@ 2019-09-11 21:42                       ` Zygo Blaxell
  2019-09-13  1:33                         ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-11 21:42 UTC (permalink / raw)
  To: webmaster; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Wed, Sep 11, 2019 at 04:01:01PM -0400, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> > On 2019-09-11 13:20, webmaster@zedlx.com wrote:
> > > 
> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > 
> > > > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > > > > 
> > > > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > > > 
> > > 
> > > > > 
> > > > > === I CHALLENGE you and anyone else on this mailing list: ===
> > > > > 
> > > > >  - Show me an exaple where splitting an extent requires
> > > > > unsharing, and this split is needed to defrag.
> > > > > 
> > > > > Make it clear, write it yourself, I don't want any machine-made outputs.
> > > > > 
> > > > Start with the above comment about all writes unsharing the
> > > > region being written to.
> > > > 
> > > > Now, extrapolating from there:
> > > > 
> > > > Assume you have two files, A and B, each consisting of 64
> > > > filesystem blocks in single shared extent.  Now assume somebody
> > > > writes a few bytes to the middle of file B, right around the
> > > > boundary between blocks 31 and 32, and that you get similar
> > > > writes to file A straddling blocks 14-15 and 47-48.
> > > > 
> > > > After all of that, file A will be 5 extents:
> > > > 
> > > > * A reflink to blocks 0-13 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 14-15
> > > > * A reflink to blocks 16-46 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 47-48
> > > > * A reflink to blocks 49-63 of the original extent.
> > > > 
> > > > And file B will be 3 extents:
> > > > 
> > > > * A reflink to blocks 0-30 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 31-32.
> > > > * A reflink to blocks 32-63 of the original extent.
> > > > 
> > > > Note that there are a total of four contiguous sequences of
> > > > blocks that are common between both files:
> > > > 
> > > > * 0-13
> > > > * 16-30
> > > > * 32-46
> > > > * 49-63
> > > > 
> > > > There is no way to completely defragment either file without
> > > > splitting the original extent (which is still there, just not
> > > > fully referenced by either file) unless you rewrite the whole
> > > > file to a new single extent (which would, of course, completely
> > > > unshare the whole file).  In fact, if you want to ensure that
> > > > those shared regions stay reflinked, there's no way to
> > > > defragment either file without _increasing_ the number of
> > > > extents in that file (either file would need 7 extents to
> > > > properly share only those 4 regions), and even then only one of
> > > > the files could be fully defragmented.
> > > > 
> > > > Such a situation generally won't happen if you're just dealing
> > > > with read-only snapshots, but is not unusual when dealing with
> > > > regular files that are reflinked (which is not an uncommon
> > > > situation on some systems, as a lot of people have `cp` aliased
> > > > to reflink things whenever possible).
> > > 
> > > Well, thank you very much for writing this example. Your example is
> > > certainly not minimal, as it seems to me that one write to the file
> > > A and one write to file B would be sufficient to prove your point,
> > > so there we have one extra write in the example, but that's OK.
> > > 
> > > Your example proves that I was wrong. I admit: it is impossible to
> > > perfectly defrag one subvolume (in the way I imagined it should be
> > > done).
> > > Why? Because, as in your example, there can be files within a SINGLE
> > > subvolume which share their extents with each other. I didn't
> > > consider such a case.
> > > 
> > > On the other hand, I judge this issue to be mostly irrelevant. Why?
> > > Because most of the file sharing will be between subvolumes, not
> > > within a subvolume.
> 
> > Not necessarily. Even ignoring the case of data deduplication (which
> > needs to be considered if you care at all about enterprise usage, and is
> > part of the whole point of using a CoW filesystem), there are existing
> > applications that actively use reflinks, either directly or indirectly
> > (via things like the `copy_file_range` system call), and the number of
> > such applications is growing.
> 
> The same argument goes here: If data-deduplication was performed, then the
> user has specifically requested it.
> Therefore, since it was user's will, the defrag has to honor it, and so the
> defrag must not unshare deduplicated extents because the user wants them
> shared. This might prevent a perfect defrag, but that is exactly what the
> user has requested, either directly or indirectly, by some policy he has
> choosen.
> 
> If an application actively creates reflinked-copies, then we can assume it
> does so according to user's will, therefore it is also a command by user and
> defrag should honor it by not unsharing and by being imperfect.
> 
> Now, you might point out that, in case of data-deduplication, we now have a
> case where most sharing might be within-subvolume, invalidating my assertion
> that most sharing will be between-subvolumes. But this is an invalid (more
> precisely, irelevant) argument. Why? Because the defrag operation has to
> focus on doing what it can do, while honoring user's will. All
> within-subvolume sharing is user-requested, therefore it cannot be part of
> the argument to unshare.
> 
> You can't both perfectly defrag and honor deduplication. Therefore, the
> defrag has to do the best possible thing while still honoring user's will.
> <<<!!! So, the fact that the deduplication was performed is actually the
> reason FOR not unsharing, not against it, as you made it look in that
> paragraph. !!!>>>

IMHO the current kernel 'defrag' API shouldn't be used any more.  We need
a tool that handles dedupe and defrag at the same time, for precisely
this reason:  currently the two operations have no knowledge of each
other and duplicate or reverse each others work.  You don't need to defrag
an extent if you can find a duplicate, and you don't want to use fragmented
extents as dedupe sources.

> If the system unshares automatically after deduplication, then the user will
> need to run deduplication again. Ridiculous!
> 
> > > When a user creates a reflink to a file in the same subvolume, he is
> > > willingly denying himself the assurance of a perfect defrag.
> > > Because, as your example proves, if there are a few writes to BOTH
> > > files, it gets impossible to defrag perfectly. So, if the user
> > > creates such reflinks, it's his own whish and his own fault.
> 
> > The same argument can be made about snapshots.  It's an invalid argument
> > in both cases though because it's not always the user who's creating the
> > reflinks or snapshots.
> 
> Um, I don't agree.
> 
> 1) Actually, it is always the user who is creating reflinks, and snapshots,
> too. Ultimately, it's always the user who does absolutely everything,
> because a computer is supposed to be under his full control. But, in the
> case of reflink-copies, this is even more true
> because reflinks are not an essential feature for normal OS operation, at
> least as far as today's OSes go. Every OS has to copy files around. Every OS
> requires the copy operation. No current OS requires the reflinked-copy
> operation in order to function.

If we don't do reflinks all day, every day, our disks fill up in a matter
of hours...

> 2) A user can make any number of snapshots and subvolumes, but he can at any
> time select one subvolume as a focus of the defrag operation, and that
> subvolume can be perfectly defragmented without any unsharing (except that
> the internal-reflinked files won't be perfectly defragmented).
> Therefore, the snapshoting operation can never jeopardize a perfect defrag.
> The user can make many snapshots without any fears (I'd say a total of 100
> snapshots at any point in time is a good and reasonable limit).
> 
> > > Such situations will occur only in some specific circumstances:
> > > a) when the user is reflinking manually
> > > b) when a file is copied from one subvolume into a different file in
> > > a different subvolume.
> > > 
> > > The situation a) is unusual in normal use of the filesystem. Even
> > > when it occurs, it is the explicit command given by the user, so he
> > > should be willing to accept all the consequences, even the bad ones
> > > like imperfect defrag.
> > > 
> > > The situation b) is possible, but as far as I know copies are
> > > currently not done that way in btrfs. There should probably be the
> > > option to reflink-copy files fron another subvolume, that would be
> > > good.
> > > 
> > > But anyway, it doesn't matter. Because most of the sharing will be
> > > between subvolumes, not within subvolume. So, if there is some
> > > in-subvolume sharing, the defrag wont be 100% perfect, that a minor
> > > point. Unimportant.
> 
> > You're focusing too much on your own use case here.
> 
> It's so easy to say that. But you really don't know. You might be wrong. I
> might be the objective one, and you might be giving me some
> groupthink-induced, badly thought out conclusions from years ago, which was
> never rechecked because that's so hard to do. And then everybody just
> repeats it and it becomes the truth. As Goebels said, if you repeat anything
> enough times, it becomes the truth.
> 
> > Not everybody uses snapshots, and there are many people who are using
> > reflinks very actively within subvolumes, either for deduplication or
> > because it saves time and space when dealing with multiple copies of
> > mostly identical tress of files.
> 
> Yes, I guess there are many such users. Doesn't matter. What you are
> proposing is that the defrag should break all their reflinks and
> deduplicated data they painstakingly created. Come on!
> 
> Or, maybe the defrag should unshare to gain performance? Yes, but only WHEN
> USER REQUESTS IT. So the defrag can unshare,
> but only by request. Since this means that user is reversing his previous
> command to not unshare, this has to be explicitly requested by the user, not
> part of the default defrag operation.
> 
> 
> > As mentioned in the previous email, we actually did have a (mostly)
> > working reflink-aware defrag a few years back.  It got removed because
> > it had serious performance issues.  Note that we're not talking a few
> > seconds of extra time to defrag a full tree here, we're talking
> > double-digit _minutes_ of extra time to defrag a moderate sized (low
> > triple digit GB) subvolume with dozens of snapshots, _if you were lucky_
> > (if you weren't, you would be looking at potentially multiple _hours_ of
> > runtime for the defrag).  The performance scaled inversely proportionate
> > to the number of reflinks involved and the total amount of data in the
> > subvolume being defragmented, and was pretty bad even in the case of
> > only a couple of snapshots.
> > 
> > Ultimately, there are a couple of issues at play here:
> 
> I'll reply to this in another post. This one is getting a bit too long.
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 21:37                   ` Zygo Blaxell
@ 2019-09-11 23:21                     ` webmaster
  2019-09-12  0:10                       ` Remi Gauvin
  2019-09-12  5:19                       ` Zygo Blaxell
  0 siblings, 2 replies; 111+ messages in thread
From: webmaster @ 2019-09-11 23:21 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Austin S. Hemmelgarn, linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>> > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>> > >
>> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>> > >
>>
>> > >
>> > > === I CHALLENGE you and anyone else on this mailing list: ===
>> > >
>> > >  - Show me an exaple where splitting an extent requires unsharing,
>> > > and this split is needed to defrag.
>> > >
>> > > Make it clear, write it yourself, I don't want any machine-made outputs.
>> > >
>> > Start with the above comment about all writes unsharing the region being
>> > written to.
>> >
>> > Now, extrapolating from there:
>> >
>> > Assume you have two files, A and B, each consisting of 64 filesystem
>> > blocks in single shared extent.  Now assume somebody writes a few bytes
>> > to the middle of file B, right around the boundary between blocks 31 and
>> > 32, and that you get similar writes to file A straddling blocks 14-15
>> > and 47-48.
>> >
>> > After all of that, file A will be 5 extents:
>> >
>> > * A reflink to blocks 0-13 of the original extent.
>> > * A single isolated extent consisting of the new blocks 14-15
>> > * A reflink to blocks 16-46 of the original extent.
>> > * A single isolated extent consisting of the new blocks 47-48
>> > * A reflink to blocks 49-63 of the original extent.
>> >
>> > And file B will be 3 extents:
>> >
>> > * A reflink to blocks 0-30 of the original extent.
>> > * A single isolated extent consisting of the new blocks 31-32.
>> > * A reflink to blocks 32-63 of the original extent.
>> >
>> > Note that there are a total of four contiguous sequences of blocks that
>> > are common between both files:
>> >
>> > * 0-13
>> > * 16-30
>> > * 32-46
>> > * 49-63
>> >
>> > There is no way to completely defragment either file without splitting
>> > the original extent (which is still there, just not fully referenced by
>> > either file) unless you rewrite the whole file to a new single extent
>> > (which would, of course, completely unshare the whole file).  In fact,
>> > if you want to ensure that those shared regions stay reflinked, there's
>> > no way to defragment either file without _increasing_ the number of
>> > extents in that file (either file would need 7 extents to properly share
>> > only those 4 regions), and even then only one of the files could be
>> > fully defragmented.
>> >
>> > Such a situation generally won't happen if you're just dealing with
>> > read-only snapshots, but is not unusual when dealing with regular files
>> > that are reflinked (which is not an uncommon situation on some systems,
>> > as a lot of people have `cp` aliased to reflink things whenever
>> > possible).
>>
>> Well, thank you very much for writing this example. Your example is
>> certainly not minimal, as it seems to me that one write to the file A and
>> one write to file B would be sufficient to prove your point, so there we
>> have one extra write in the example, but that's OK.
>>
>> Your example proves that I was wrong. I admit: it is impossible to perfectly
>> defrag one subvolume (in the way I imagined it should be done).
>> Why? Because, as in your example, there can be files within a SINGLE
>> subvolume which share their extents with each other. I didn't consider such
>> a case.
>>
>> On the other hand, I judge this issue to be mostly irrelevant. Why? Because
>> most of the file sharing will be between subvolumes, not within a subvolume.
>> When a user creates a reflink to a file in the same subvolume, he is
>> willingly denying himself the assurance of a perfect defrag. Because, as
>> your example proves, if there are a few writes to BOTH files, it gets
>> impossible to defrag perfectly. So, if the user creates such reflinks, it's
>> his own whish and his own fault.
>>
>> Such situations will occur only in some specific circumstances:
>> a) when the user is reflinking manually
>> b) when a file is copied from one subvolume into a different file in a
>> different subvolume.
>>
>> The situation a) is unusual in normal use of the filesystem. Even when it
>> occurs, it is the explicit command given by the user, so he should be
>> willing to accept all the consequences, even the bad ones like imperfect
>> defrag.
>>
>> The situation b) is possible, but as far as I know copies are currently not
>> done that way in btrfs. There should probably be the option to reflink-copy
>> files fron another subvolume, that would be good.
>
> Reflink copies across subvolumes have been working for years.  They are
> an important component that makes dedupe work when snapshots are present.

I take that what you say is true, but what I said is that when a user  
(or application) makes a
normal copy from one subvolume to another, then it won't be a  
reflink-copy. To make such a reflink-copy, you need btrfs-aware cp or  
btrfs-aware applications.

So, the reflik-copy is a special case, usually explicitly requested by  
the user.

>> But anyway, it doesn't matter. Because most of the sharing will be between
>> subvolumes, not within subvolume.
>
> Heh.  I'd like you to meet one of my medium-sized filesystems:
>
> 	Physical size:  8TB
> 	Logical size:  16TB
> 	Average references per extent:  2.03 (not counting snapshots)
> 	Workload:  CI build server, VM host
>
> That's a filesystem where over half of the logical data is reflinks to the
> other physical data, and 94% of that data is in a single subvol.  7.5TB of
> data is unique, the remaining 500GB is referenced an average of 17 times.
>
> We use ordinary applications to make ordinary copies of files, and do
> tarball unpacks and source checkouts with reckless abandon, all day long.
> Dedupe turns the copies into reflinks as we go, so every copy becomes
> a reflink no matter how it was created.
>
> For the VM filesystem image files, it's not uncommon to see a high
> reflink rate within a single file as well as reflinks to other files
> (like the binary files in the build directories that the VM images are
> constructed from).  Those reference counts can go into the millions.

OK, but that cannot be helped: either you retain the sharing structure  
with imperfect defrag, or you unshare and produce a perfect defrag  
which should have somewhat better performance (and pray that the disk  
doesn't fill up).

>> So, if there is some in-subvolume sharing,
>> the defrag wont be 100% perfect, that a minor point. Unimportant.
>
> It's not unimportant; however, the implementation does have to take this
> into account, and make sure that defrag can efficiently skip extents that
> are too expensive to relocate.  If we plan to read an extent fewer than
> 100 times, it makes no sense to update 20000 references to it--we spend
> less total time just doing the 100 slower reads.

Not necesarily. Because you can defrag in the time-of-day when there  
is a low pressure on the disk IO, so updating 20000 references is  
esentially free.

You are just making those later 100 reads faster.

OK, you are right, there is some limit, but this is such a rare case,  
that such a heavily-referenced extents are best left untouched.
I suggest something along these lines: if there are more than XX  
(where XX defaults to 1000) reflinks to an extent, then one or more  
copies of the extent should be made such that each has less than XX  
reflinks to it. The number XX should be user-configurable.

> If the numbers are
> reversed then it's better to defrag the extent--100 reference updates
> are easily outweighed by 20000 faster reads.  The kernel doesn't have
> enough information to make good decisions about this.

So, just make the number XX user-provided.

> Dedupe has a similar problem--it's rarely worth doing a GB of IO to
> save 4K of space, so in practical implementations, a lot of duplicate
> blocks have to remain duplicate.
>
> There are some ways to make the kernel dedupe and defrag API process
> each reference a little more efficiently, but none will get around this
> basic physical problem:  some extents are just better off where they are.

OK. If you don't touch those extents, they are still shared. That's  
what I wanted.

> Userspace has access to some extra data from the user, e.g.  "which
> snapshots should have their references excluded from defrag because
> the entire snapshot will be deleted in a few minutes."  That will allow
> better defrag cost-benefit decisions than any in-kernel implementation
> can make by itself.

Yes, but I think that we are going into too much details which are  
diverting the attention from the overall picture and from big problems.

And the big problem here is: what do we want defrag to do in general,  
most common cases. Because we haven't still agreed on that one since  
many of the people here are ardent followers of the  
defrag-by-unsharing ideology.

> 'btrfs fi defrag' is just one possible userspace implementation, which
> implements the "throw entire files at the legacy kernel defrag API one
> at a time" algorithm.  Unfortunately, nobody seems to have implemented
> any other algorithms yet, other than a few toy proof-of-concept demos.

I really don't have a clue what's happening, but if I were to start  
working on it (which I won't), then the first things should be:
- creating a way for btrfs to split large extents into smaller ones  
(for easier defrag, as first phase).
- creating a way for btrfs to merge small adjanced extents shared by  
the same files into larger extents (as the last phase of defragmenting  
a file).
- create a structure (associative array) for defrag that can track  
backlinks. Keep the structure updated with each filesystem change, by  
placing hooks in filesystem-update routines.

You can't go wrong with this. Whatever details change about defrag  
operation, the given three things will be needed by defrag.

>> Now, to retain the original sharing structure, the defrag has to change the
>> reflink of extent E55 in file B to point to E70. You are telling me this is
>> not possible? Bullshit!
>
> This is already possible today and userspace tools can do it--not as
> efficiently as possible, but without requiring more than 128M of temporary
> space.  'btrfs fi defrag' is not one of those tools.
>
>> Please explain to me how this 'defrag has to unshare' story of yours isn't
>> an intentional attempt to mislead me.
>
> Austin is talking about the btrfs we have, not the btrfs we want.

OK, but then, you agree with me that current defrag is a joke. I mean,  
something is better than nothing, and the current defrag isn't  
completely useless, but it is in most circumstances either unusable or  
not good enough.

I mean, the snapshots are a prime feature of btrfs. If not, then why  
bother with b-trees? If you wanted subvolumes, checksums and RAID,  
then you should have made ext5. B-trees are in btrfs so that there can  
be snapshots. But, the current defrag works bad with snaphots. It  
doesn't defrag them well, it also unshares data. Bad bad bad.

And if you wanted to be honest to your users, why don't you place this  
info in the wiki? Ok, the wiki says "defrag will unshare", but it  
doesn't say that it also doesn't defragment well.

For example, lets examine the typical home user. If he is using btrfs,  
it means he probably wants snapshots of his data. And, after a few  
snapshots, his data is fragmented, and the current defrag can't help  
because it does a terrible job in this particualr case.

So why don't you write on the wiki "the defrag is practically unusable  
in case you use snapshots". Because that is the truth. Be honest.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 23:21                     ` webmaster
@ 2019-09-12  0:10                       ` Remi Gauvin
  2019-09-12  3:05                         ` webmaster
  2019-09-12  5:19                       ` Zygo Blaxell
  1 sibling, 1 reply; 111+ messages in thread
From: Remi Gauvin @ 2019-09-12  0:10 UTC (permalink / raw)
  To: webmaster, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1396 bytes --]

On 2019-09-11 7:21 p.m., webmaster@zedlx.com wrote:


> 
> For example, lets examine the typical home user. If he is using btrfs,
> it means he probably wants snapshots of his data. And, after a few
> snapshots, his data is fragmented, and the current defrag can't help
> because it does a terrible job in this particualr case.
> 

I shouldn't be replying to your provocative posts, but this is just
nonsense.

 Not to say that Defragmentation can't be better, smarter,, it happens
to work very well for typical use.

This sounds like you're implying that snapshots fragment data... can you
explain that?  as far as I know, snapshotting has nothing to do with
fragmentation of data.  All data is COW, and all files that are subject
to random read write will be fragmented, with or without snapshots.

And running defrag on your system regularly works just fine.  There's a
little overhead of space if you are taking regular snapshots, (say
hourly snapshots with snapper.)  If you have more control/liberty when
you take your snapshots, ideally, you would defrag before taking the
snaptshop/reflink copy.  Again, this only matters to files that are
subject to fragmentation in the first place.

I suspect if you actually tried using the btrfs defrag, you would find
you are making a mountain of a molehill.. There are lots of far more
important problems to solve.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12  0:10                       ` Remi Gauvin
@ 2019-09-12  3:05                         ` webmaster
  2019-09-12  3:30                           ` Remi Gauvin
  0 siblings, 1 reply; 111+ messages in thread
From: webmaster @ 2019-09-12  3:05 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs


Quoting Remi Gauvin <remi@georgianit.com>:

> On 2019-09-11 7:21 p.m., webmaster@zedlx.com wrote:
>
>> For example, lets examine the typical home user. If he is using btrfs,
>> it means he probably wants snapshots of his data. And, after a few
>> snapshots, his data is fragmented, and the current defrag can't help
>> because it does a terrible job in this particualr case.
>>
>
> I shouldn't be replying to your provocative posts, but this is just
> nonsense.

I really hope that I'm not such a bad person as that sentence is suggesting.
About the provocative posts, I don't know of any other way to get my  
thoughts across.
If I offended people, I appologize, but I cannot change the way I communicate.

Certainly, I like the btrfs filesystem and the features it offers, and  
I'll continue using it, and no matter what you think of me I want to  
say thanks to you guys who are making it all work.

>  Not to say that Defragmentation can't be better, smarter,, it happens
> to work very well for typical use.

My thought is that the only reason why it appears that it works is  
that a typical home user rarely needs defragmentation. He runs the  
"btrfs fi defrag", virtually nothing happens (a few log files get  
defragged, if they were shared than they are unshared), prints out  
"Done", the user is happy. Placebo effect.

> This sounds like you're implying that snapshots fragment data... can you
> explain that?  as far as I know, snapshotting has nothing to do with
> fragmentation of data.  All data is COW, and all files that are subject
> to random read write will be fragmented, with or without snapshots.

Close, but essentially: yes. I'm implying that snapshots induce future  
fragmentation. The mere act of snapshoting won't create fragments  
immediately, but if there are any future writes to previously  
snapshoted files, those writes are likely to cause fragmentation. I  
think that this is not hard to figure out, but if you wish, I can  
elaborate further.

The real question is: does it really matter? Looking at the typical  
home user, most of his files rarely change, they are rarely written  
to. More likely, most new writes will go to new files. So, maybe the  
"home user" is not the best study-case for defragmentation. He has to  
be at least some kind of power-user, or content-creator to experience  
any significant fragmentation.

> And running defrag on your system regularly works just fine.  There's a
> little overhead of space if you are taking regular snapshots, (say
> hourly snapshots with snapper.)  If you have more control/liberty when
> you take your snapshots, ideally, you would defrag before taking the
> snaptshop/reflink copy.  Again, this only matters to files that are
> subject to fragmentation in the first place.

Btrfs defrag works just fine until you get some serious fragmentation.  
At that point, if you happen to have some snapshots, you better delete  
them before running defrag. Because, if you do run defrag on  
snapshoted and heavily fragmented filesystem, you are going to run out  
of disk space really fast.

> I suspect if you actually tried using the btrfs defrag, you would find
> you are making a mountain of a molehill.. There are lots of far more
> important problems to solve.

About importance, well, maybe you are right there, maybe not. Somehow  
I guess that after so many years in development and a stable feature  
set, most remaining problems are bugs and trivialities. So you are  
fixing them, one by one, many of those are urgent. I see you are  
working on deduplication, that's a hell of a work which actually won't  
end up well if it is not supplemented by a good defrag.

Didn't someone say, earlier in this discussion, that the defrag is  
important for btrfs. I would guess that it is. On many OSes defrag is  
run automatically. All older filesystems have a pretty good defrag.

What I would say is that btrfs can have a much better defrag than it  
has now. If defrag is deemed important, thay why are improvements to  
defrag unimportant?



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12  3:05                         ` webmaster
@ 2019-09-12  3:30                           ` Remi Gauvin
  2019-09-12  3:33                             ` Remi Gauvin
  0 siblings, 1 reply; 111+ messages in thread
From: Remi Gauvin @ 2019-09-12  3:30 UTC (permalink / raw)
  To: webmaster, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1979 bytes --]

On 2019-09-11 11:05 p.m., webmaster@zedlx.com wrote:

> 
> Close, but essentially: yes. I'm implying that snapshots induce future
> fragmentation. The mere act of snapshoting won't create fragments
> immediately, but if there are any future writes to previously snapshoted
> files, those writes are likely to cause fragmentation. I think that this
> is not hard to figure out, but if you wish, I can elaborate further.

You'll have too, because the only way snapshots contribute to future
fragmentation is if you use NoCow attribute, (an entirely different
kettle of fish there.)

> 
> The real question is: does it really matter? Looking at the typical home
> user, most of his files rarely change, they are rarely written to. More
> likely, most new writes will go to new files. So, maybe the "home user"
> is not the best study-case for defragmentation. He has to be at least
> some kind of power-user, or content-creator to experience any
> significant fragmentation.

Torrent Downloaders should make an *excellent* case study, and not uncommon.


> 
> Btrfs defrag works just fine until you get some serious fragmentation.
> At that point, if you happen to have some snapshots, you better delete
> them before running defrag. Because, if you do run defrag on snapshoted
> and heavily fragmented filesystem, you are going to run out of disk
> space really fast.
> 

Agreed that if you have large files subject to fragmentation, (a special
use case for which BTRFS is arguably not the best fit, at least, in
terms of performance,) you need to take special care with
fragmentation.. ie,, defrag before snapshotting when possible.


> 
> Didn't someone say, earlier in this discussion, that the defrag is
> important for btrfs. I would guess that it is. On many OSes defrag is
> run automatically. All older filesystems have a pretty good defrag.

This statement makes me wonder if you really belong on a Linux
Development list.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12  3:30                           ` Remi Gauvin
@ 2019-09-12  3:33                             ` Remi Gauvin
  0 siblings, 0 replies; 111+ messages in thread
From: Remi Gauvin @ 2019-09-12  3:33 UTC (permalink / raw)
  To: webmaster, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 247 bytes --]

On 2019-09-11 11:30 p.m., Remi Gauvin wrote:

> 
> This statement makes me wonder if you really belong on a Linux
> Development list.
> 
> 

This is why I should avoid getting into debates,, ha.. ext4 does now
have defrag.. sorry :)



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 23:21                     ` webmaster
  2019-09-12  0:10                       ` Remi Gauvin
@ 2019-09-12  5:19                       ` Zygo Blaxell
  2019-09-12 21:23                         ` General Zed
  1 sibling, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-12  5:19 UTC (permalink / raw)
  To: webmaster; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Wed, Sep 11, 2019 at 07:21:31PM -0400, webmaster@zedlx.com wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmaster@zedlx.com wrote:
> > > 
> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > 
> > > > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > > > >
> > > > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > > >
> > > 
> > > > >
> > > > > === I CHALLENGE you and anyone else on this mailing list: ===
> > > > >
> > > > >  - Show me an exaple where splitting an extent requires unsharing,
> > > > > and this split is needed to defrag.
> > > > >
> > > > > Make it clear, write it yourself, I don't want any machine-made outputs.
> > > > >
> > > > Start with the above comment about all writes unsharing the region being
> > > > written to.
> > > >
> > > > Now, extrapolating from there:
> > > >
> > > > Assume you have two files, A and B, each consisting of 64 filesystem
> > > > blocks in single shared extent.  Now assume somebody writes a few bytes
> > > > to the middle of file B, right around the boundary between blocks 31 and
> > > > 32, and that you get similar writes to file A straddling blocks 14-15
> > > > and 47-48.
> > > >
> > > > After all of that, file A will be 5 extents:
> > > >
> > > > * A reflink to blocks 0-13 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 14-15
> > > > * A reflink to blocks 16-46 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 47-48
> > > > * A reflink to blocks 49-63 of the original extent.
> > > >
> > > > And file B will be 3 extents:
> > > >
> > > > * A reflink to blocks 0-30 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 31-32.
> > > > * A reflink to blocks 32-63 of the original extent.
> > > >
> > > > Note that there are a total of four contiguous sequences of blocks that
> > > > are common between both files:
> > > >
> > > > * 0-13
> > > > * 16-30
> > > > * 32-46
> > > > * 49-63
> > > >
> > > > There is no way to completely defragment either file without splitting
> > > > the original extent (which is still there, just not fully referenced by
> > > > either file) unless you rewrite the whole file to a new single extent
> > > > (which would, of course, completely unshare the whole file).  In fact,
> > > > if you want to ensure that those shared regions stay reflinked, there's
> > > > no way to defragment either file without _increasing_ the number of
> > > > extents in that file (either file would need 7 extents to properly share
> > > > only those 4 regions), and even then only one of the files could be
> > > > fully defragmented.
> > > >
> > > > Such a situation generally won't happen if you're just dealing with
> > > > read-only snapshots, but is not unusual when dealing with regular files
> > > > that are reflinked (which is not an uncommon situation on some systems,
> > > > as a lot of people have `cp` aliased to reflink things whenever
> > > > possible).
> > > 
> > > Well, thank you very much for writing this example. Your example is
> > > certainly not minimal, as it seems to me that one write to the file A and
> > > one write to file B would be sufficient to prove your point, so there we
> > > have one extra write in the example, but that's OK.
> > > 
> > > Your example proves that I was wrong. I admit: it is impossible to perfectly
> > > defrag one subvolume (in the way I imagined it should be done).
> > > Why? Because, as in your example, there can be files within a SINGLE
> > > subvolume which share their extents with each other. I didn't consider such
> > > a case.
> > > 
> > > On the other hand, I judge this issue to be mostly irrelevant. Why? Because
> > > most of the file sharing will be between subvolumes, not within a subvolume.
> > > When a user creates a reflink to a file in the same subvolume, he is
> > > willingly denying himself the assurance of a perfect defrag. Because, as
> > > your example proves, if there are a few writes to BOTH files, it gets
> > > impossible to defrag perfectly. So, if the user creates such reflinks, it's
> > > his own whish and his own fault.
> > > 
> > > Such situations will occur only in some specific circumstances:
> > > a) when the user is reflinking manually
> > > b) when a file is copied from one subvolume into a different file in a
> > > different subvolume.
> > > 
> > > The situation a) is unusual in normal use of the filesystem. Even when it
> > > occurs, it is the explicit command given by the user, so he should be
> > > willing to accept all the consequences, even the bad ones like imperfect
> > > defrag.
> > > 
> > > The situation b) is possible, but as far as I know copies are currently not
> > > done that way in btrfs. There should probably be the option to reflink-copy
> > > files fron another subvolume, that would be good.
> > 
> > Reflink copies across subvolumes have been working for years.  They are
> > an important component that makes dedupe work when snapshots are present.
> 
> I take that what you say is true, but what I said is that when a user (or
> application) makes a
> normal copy from one subvolume to another, then it won't be a reflink-copy.
> To make such a reflink-copy, you need btrfs-aware cp or btrfs-aware
> applications.

It's the default for GNU coreutils, and for 'mv' across subvols there
is currently no option to turn reflink copies off.  Maybe for 'cp'
you still have to explicitly request reflink, but that will presumably
change at some point as more filesystems get the CLONE_RANGE ioctl and
more users expect it to just work by default.

> So, the reflik-copy is a special case, usually explicitly requested by the
> user.
> 
> > > But anyway, it doesn't matter. Because most of the sharing will be between
> > > subvolumes, not within subvolume.
> > 
> > Heh.  I'd like you to meet one of my medium-sized filesystems:
> > 
> > 	Physical size:  8TB
> > 	Logical size:  16TB
> > 	Average references per extent:  2.03 (not counting snapshots)
> > 	Workload:  CI build server, VM host
> > 
> > That's a filesystem where over half of the logical data is reflinks to the
> > other physical data, and 94% of that data is in a single subvol.  7.5TB of
> > data is unique, the remaining 500GB is referenced an average of 17 times.
> > 
> > We use ordinary applications to make ordinary copies of files, and do
> > tarball unpacks and source checkouts with reckless abandon, all day long.
> > Dedupe turns the copies into reflinks as we go, so every copy becomes
> > a reflink no matter how it was created.
> > 
> > For the VM filesystem image files, it's not uncommon to see a high
> > reflink rate within a single file as well as reflinks to other files
> > (like the binary files in the build directories that the VM images are
> > constructed from).  Those reference counts can go into the millions.
> 
> OK, but that cannot be helped: either you retain the sharing structure with
> imperfect defrag, or you unshare and produce a perfect defrag which should
> have somewhat better performance (and pray that the disk doesn't fill up).

One detail that might not have been covered in the rest of the discussion
is that btrfs extents are immutable.  e.g. if you create a 128MB file,
then truncate it to 4K, it still takes 128MB on disk because that 4K
block at the beginning holds a reference to the whole immutable 128MB
extent.  This holds even if the extent is not shared.  The 127.996MB of
unreferenced extent blocks are unreachable until the last 4K reference
is removed from the filesystem (deleted or overwritten).

There is no "split extent in place" operation on btrfs because the
assumption that extents are immutable is baked into the code.  Changing
this has been discussed a few times.  The gotcha is that some fairly
ordinary write cases can trigger updates of all extent references if
extents were split automatically as soon as blocks become unreachable.
There's non-trivial CPU overhead to determine whether blocks are
unreachable, making it unsuitable to run the garbage collector during
ordinary writes.  So for now, the way to split an extent is to copy the
data you want to keep from the extent into some new extents, then remove
all references to the old extent.

To maintain btrfs storage efficiency you have to do a *garbage collection*
operation that there is currently no tool for.  This tool would find
unreachable blocks and split extents around them, releasing the entire
extent (including unreachable blocks) back to the filesystem.  Defrag is
sometimes used for its accidental garbage collection side-effects (it
will copy all the remaining reachable blocks to new extents and reduce
total space usage if the extens were not shared), but the current defrag
is not designed for GC, and doesn't do a good job.

Dedupe on btrfs also requires the ability to split and merge extents;
otherwise, we can't dedupe an extent that contains a combination of
unique and duplicate data.  If we try to just move references around
without splitting extents into all-duplicate and all-unique extents,
the duplicate blocks become unreachable, but are not deallocated.  If we
only split extents, fragmentation overhead gets bad.  Before creating
thousands of references to an extent, it is worthwhile to merge it with
as many of its neighbors as possible, ideally by picking the biggest
existing garbage-free extents available so we don't have to do defrag.
As we examine each extent in the filesystem, it may be best to send
to defrag, dedupe, or garbage collection--sometimes more than one of
those.

As extents get bigger, the seeking and overhead to read them gets smaller.
I'd want to defrag many consecutive 4K extents, but I wouldn't bother
touching 256K extents unless they were in high-traffic files, nor would I
bother combining only 2 or 3 4K extents together (there would be around
400K of metadata IO overhead to do so--likely more than what is saved
unless the file is very frequently read sequentially).  The incremental
gains are inversely proportional to size, while the defrag cost is
directly proportional to size.

> > > So, if there is some in-subvolume sharing,
> > > the defrag wont be 100% perfect, that a minor point. Unimportant.
> > 
> > It's not unimportant; however, the implementation does have to take this
> > into account, and make sure that defrag can efficiently skip extents that
> > are too expensive to relocate.  If we plan to read an extent fewer than
> > 100 times, it makes no sense to update 20000 references to it--we spend
> > less total time just doing the 100 slower reads.
> 
> Not necesarily. Because you can defrag in the time-of-day when there is a
> low pressure on the disk IO, so updating 20000 references is esentially
> free.
> 
> You are just making those later 100 reads faster.
> 
> OK, you are right, there is some limit, but this is such a rare case, that
> such a heavily-referenced extents are best left untouched.

For you it's rare.  For other users, it's expected behavior--dedupe is
just a thing many different filesystems do now.

Also, quite a few heavily-fragmented files only ever get read once,
or are read only during low-cost IO times (e.g. log files during
maintenance windows).  For those, a defrag is pure wasted iops.

> I suggest something along these lines: if there are more than XX (where XX
> defaults to 1000) reflinks to an extent, then one or more copies of the
> extent should be made such that each has less than XX reflinks to it. The
> number XX should be user-configurable.

That would depend on where those extents are, how big the references
are, etc.  In some cases a million references are fine, in other cases
even 20 is too many.  After doing the analysis for each extent in a
filesystem, you'll probably get an average around some number, but
setting the number first is putting the cart before the horse.

> > If the numbers are
> > reversed then it's better to defrag the extent--100 reference updates
> > are easily outweighed by 20000 faster reads.  The kernel doesn't have
> > enough information to make good decisions about this.
> 
> So, just make the number XX user-provided.
> 
> > Dedupe has a similar problem--it's rarely worth doing a GB of IO to
> > save 4K of space, so in practical implementations, a lot of duplicate
> > blocks have to remain duplicate.
> > 
> > There are some ways to make the kernel dedupe and defrag API process
> > each reference a little more efficiently, but none will get around this
> > basic physical problem:  some extents are just better off where they are.
> 
> OK. If you don't touch those extents, they are still shared. That's what I
> wanted.
> 
> > Userspace has access to some extra data from the user, e.g.  "which
> > snapshots should have their references excluded from defrag because
> > the entire snapshot will be deleted in a few minutes."  That will allow
> > better defrag cost-benefit decisions than any in-kernel implementation
> > can make by itself.
> 
> Yes, but I think that we are going into too much details which are diverting
> the attention from the overall picture and from big problems.
> 
> And the big problem here is: what do we want defrag to do in general, most
> common cases. Because we haven't still agreed on that one since many of the
> people here are ardent followers of the defrag-by-unsharing ideology.

Well, that's the current implementation, so most people are familiar with it.

I have a userspace library that lets applications work in terms of extents
and their sequential connections to logical neighbors, which it extracts
from the reference data in the filesystem trees.  The caller says "move
extent A50..A70, B10..B20, and C90..C110 to a new physically contiguous
location" and the library translates that into calls to the current kernel
API to update all the references to those extents, at least one of which
is a single contiguous reference to the entire new extent.  To defrag,
it walks over extent adjacency DAGs and connects short extents together
into longer ones, and if it notices opportunities to get rid of extents
entirely by dedupe then it does that instead of defrag.  Extents with
unreachable blocks are garbage collected.  If I ever get it done, I'll
propose the kernel provide the library's top-level interface directly,
then drop the userspace emulation layer when that kernel API appears.

> > 'btrfs fi defrag' is just one possible userspace implementation, which
> > implements the "throw entire files at the legacy kernel defrag API one
> > at a time" algorithm.  Unfortunately, nobody seems to have implemented
> > any other algorithms yet, other than a few toy proof-of-concept demos.
> 
> I really don't have a clue what's happening, but if I were to start working
> on it (which I won't), then the first things should be:

> - creating a way for btrfs to split large extents into smaller ones (for
> easier defrag, as first phase).

> - creating a way for btrfs to merge small adjanced extents shared by the
> same files into larger extents (as the last phase of defragmenting a file).

Those were the primitives I found useful as well.  Well, primitive,
singular--in the absence of the ability to split an extent in place,
the same function can do both split and merge operations.  Input is a
list of block ranges, output is one extent containing data from those
block ranges, side effect is to replace all the references to the old
data with references to the new data.

Add an operation to replace references to data at extent A with
references to data at extent B, and an operation to query extent reference
structure efficiently, and you have all the ingredients of an integrated
dedupe/defrag/garbage collection tool for btrfs (analogous to the XFS
"fsr" process).

> - create a structure (associative array) for defrag that can track
> backlinks. Keep the structure updated with each filesystem change, by
> placing hooks in filesystem-update routines.

No need.  btrfs already maintains backrefs--they power the online device shrink
feature and the LOGICAL_TO_INO ioctl.

> You can't go wrong with this. Whatever details change about defrag
> operation, the given three things will be needed by defrag.

I agree about 70% with what you said.  ;)

> > > Now, to retain the original sharing structure, the defrag has to change the
> > > reflink of extent E55 in file B to point to E70. You are telling me this is
> > > not possible? Bullshit!
> > 
> > This is already possible today and userspace tools can do it--not as
> > efficiently as possible, but without requiring more than 128M of temporary
> > space.  'btrfs fi defrag' is not one of those tools.
> > 
> > > Please explain to me how this 'defrag has to unshare' story of yours isn't
> > > an intentional attempt to mislead me.
> > 
> > Austin is talking about the btrfs we have, not the btrfs we want.
> 
> OK, but then, you agree with me that current defrag is a joke. I mean,
> something is better than nothing, and the current defrag isn't completely
> useless, but it is in most circumstances either unusable or not good enough.

I agree that the current defrag is almost useless.  Where we might
disagree is that I don't think the current defrag can ever be useful,
even if it did update all the references simultaneously.

> I mean, the snapshots are a prime feature of btrfs. If not, then why bother
> with b-trees? If you wanted subvolumes, checksums and RAID, then you should
> have made ext5. B-trees are in btrfs so that there can be snapshots. But,
> the current defrag works bad with snaphots. It doesn't defrag them well, it
> also unshares data. Bad bad bad.
> 
> And if you wanted to be honest to your users, why don't you place this info
> in the wiki? Ok, the wiki says "defrag will unshare", but it doesn't say
> that it also doesn't defragment well.

"Well" is the first thing users have trouble with.  They generally
overvalue defragmentation because they don't consider the costs.

There's a lower bound _and_ an upper bound on how big fragments can
usefully be.  The lower bound is somewhere around 100K (below that size
there is egregious seek and metadata overhead), and the upper bound is
a few MB at most, with a few special exceptions like torrent downloads
(where a large file is written in random order, then rearranged into
contiguous extents and never written again).  If you strictly try to
minimize the number of fragments, you can make a lot of disk space
unreachable without garbage collection.  Up to 32768 times the logical
file size can theoretically be wasted, though 25-200% is more common
in practice.  The bigger your target extent size, the more frequently
you have to defrag to maintain disk space usage at reasonable levels,
and the less benefit you get from each defrag run.

> For example, lets examine the typical home user. If he is using btrfs, it
> means he probably wants snapshots of his data. And, after a few snapshots,
> his data is fragmented, and the current defrag can't help because it does a
> terrible job in this particualr case.

I'm not sure I follow.  Snapshots don't cause data fragmentation.
Discontiguous writes do (unless the filesystem is very full and the
only available free areas are small--see below).  The data will be
fragmented or not depending on the write pattern, not the presence or
absence of snapshots.  On the other hand, if snapshots are present, then
(overly aggressive) defrag will consume a large amount of disk space.
Maybe that's what you're getting at.

If the filesystem is very full and all free space areas are small,
you want to use balance to move all the data closer together so that
free space areas get bigger.  Defrag is the wrong tool for this.
Balance handles shared extent references just fine.

> So why don't you write on the wiki "the defrag is practically unusable in
> case you use snapshots". Because that is the truth. Be honest.

It depends on the size of the defragmented data.  If the only thing
you need to defrag is systemd journal files and the $HOME/.thunderbird
directory, then you can probably afford to store a few extra copies of
those on the disk.  It won't be possible to usefully defrag files like
those while they have shared extents anyway--too many small overlaps.

If you have a filesystem full of big VM image files then there's no
good solution yet.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 21:37                     ` webmaster
@ 2019-09-12 11:31                       ` Austin S. Hemmelgarn
  2019-09-12 19:18                         ` webmaster
  0 siblings, 1 reply; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-12 11:31 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs

On 2019-09-11 17:37, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>
>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>
>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>
>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>
> 
>>>> Given this, defrag isn't willfully unsharing anything, it's just a 
>>>> side-effect of how it works (since it's rewriting the block layout 
>>>> of the file in-place).
>>>
>>> The current defrag has to unshare because, as you said, because it is 
>>> unaware of the full reflink structure. If it doesn't know about all 
>>> reflinks, it has to unshare, there is no way around that.
>>>
>>>> Now factor in that _any_ write will result in unsharing the region 
>>>> being written to, rounded to the nearest full filesystem block in 
>>>> both directions (this is mandatory, it's a side effect of the 
>>>> copy-on-write nature of BTRFS, and is why files that experience 
>>>> heavy internal rewrites get fragmented very heavily and very quickly 
>>>> on BTRFS).
>>>
>>> You mean: when defrag performs a write, the new data is unshared 
>>> because every write is unshared? Really?
>>>
>>> Consider there is an extent E55 shared by two files A and B. The 
>>> defrag has to move E55 to another location. In order to do that, 
>>> defrag creates a new extent E70. It makes it belong to file A by 
>>> changing the reflink of extent E55 in file A to point to E70.
>>>
>>> Now, to retain the original sharing structure, the defrag has to 
>>> change the reflink of extent E55 in file B to point to E70. You are 
>>> telling me this is not possible? Bullshit!
>>>
>>> Please explain to me how this 'defrag has to unshare' story of yours 
>>> isn't an intentional attempt to mislead me.
> 
>> As mentioned in the previous email, we actually did have a (mostly) 
>> working reflink-aware defrag a few years back.  It got removed because 
>> it had serious performance issues.  Note that we're not talking a few 
>> seconds of extra time to defrag a full tree here, we're talking 
>> double-digit _minutes_ of extra time to defrag a moderate sized (low 
>> triple digit GB) subvolume with dozens of snapshots, _if you were 
>> lucky_ (if you weren't, you would be looking at potentially multiple 
>> _hours_ of runtime for the defrag).  The performance scaled inversely 
>> proportionate to the number of reflinks involved and the total amount 
>> of data in the subvolume being defragmented, and was pretty bad even 
>> in the case of only a couple of snapshots.
> 
> You cannot ever make the worst program, because an even worse program 
> can be made by slowing down the original by a factor of 2.
> So, you had a badly implemented defrag. At least you got some 
> experience. Let's see what went wrong.
> 
>> Ultimately, there are a couple of issues at play here:
>>
>> * Online defrag has to maintain consistency during operation.  The 
>> current implementation does this by rewriting the regions being 
>> defragmented (which causes them to become a single new extent (most of 
>> the time)), which avoids a whole lot of otherwise complicated logic 
>> required to make sure things happen correctly, and also means that 
>> only the file being operated on is impacted and only the parts being 
>> modified need to be protected against concurrent writes.  Properly 
>> handling reflinks means that _every_ file that shares some part of an 
>> extent with the file being operated on needs to have the reflinked 
>> regions locked for the defrag operation, which has a huge impact on 
>> performance. Using your example, the update to E55 in both files A and 
>> B has to happen as part of the same commit, which can contain no other 
>> writes in that region of the file, otherwise you run the risk of 
>> losing writes to file B that occur while file A is being defragmented.
> 
> Nah. I think there is a workaround. You can first (atomically) update A, 
> then whatever, then you can update B later. I know, your yelling "what 
> if E55 gets updated in B". Doesn't matter. The defrag continues later by 
> searching for reflink to E55 in B. Then it checks the data contained in 
> E55. If the data matches the E70, then it can safely update the reflink 
> in B. Or the defrag can just verify that neither E55 nor E70 have been 
> written to in the meantime. That means they still have the same data.
So, IOW, you don't care if the total space used by the data is 
instantaneously larger than what you started with?  That seems to be at 
odds with your previous statements, but OK, if we allow for that then 
this is indeed a non-issue.

> 
>> It's not horrible when it's just a small region in two files, but it 
>> becomes a big issue when dealing with lots of files and/or 
>> particularly large extents (extents in BTRFS can get into the GB range 
>> in terms of size when dealing with really big files).
> 
> You must just split large extents in a smart way. So, in the beginning, 
> the defrag can split large extents (2GB) into smaller ones (32MB) to 
> facilitate more responsive and easier defrag.
> 
> If you have lots of files, update them one-by one. It is possible. Or 
> you can update in big batches. Whatever is faster.
Neither will solve this though.  Large numbers of files are an issue 
because the operation is expensive and has to be done on each file, not 
because the number of files somehow makes the operation more espensive. 
It's O(n) relative to files, not higher time complexity.
> 
> The point is that the defrag can keep a buffer of a "pending 
> operations". Pending operations are those that should be performed in 
> order to keep the original sharing structure. If the defrag gets 
> interrupted, then files in "pending operations" will be unshared. But 
> this should really be some important and urgent interrupt, as the 
> "pending operations" buffer needs at most a second or two to complete 
> its operations.
Depending on the exact situation, it can take well more than a few 
seconds to complete stuff. Especially if there are lots of reflinks.
> 
>> * Reflinks can reference partial extents.  This means, ultimately, 
>> that you may end up having to split extents in odd ways during defrag 
>> if you want to preserve reflinks, and might have to split extents 
>> _elsewhere_ that are only tangentially related to the region being 
>> defragmented. See the example in my previous email for a case like 
>> this, maintaining the shared regions as being shared when you 
>> defragment either file to a single extent will require splitting 
>> extents in the other file (in either case, whichever file you don't 
>> defragment to a single extent will end up having 7 extents if you try 
>> to force the one that's been defragmented to be the canonical 
>> version).  Once you consider that a given extent can have multiple 
>> ranges reflinked from multiple other locations, it gets even more 
>> complicated.
> 
> I think that this problem can be solved, and that it can be solved 
> perfectly (the result is a perfectly-defragmented file). But, if it is 
> so hard to do, just skip those problematic extents in initial version of 
> defrag.
> 
> Ultimately, in the super-duper defrag, those partially-referenced 
> extents should be split up by defrag.
> 
>> * If you choose to just not handle the above point by not letting 
>> defrag split extents, you put a hard lower limit on the amount of 
>> fragmentation present in a file if you want to preserve reflinks.  
>> IOW, you can't defragment files past a certain point.  If we go this 
>> way, neither of the two files in the example from my previous email 
>> could be defragmented any further than they already are, because doing 
>> so would require splitting extents.
> 
> Oh, you're reading my thoughts. That's good.
> 
> Initial implementation of defrag might be not-so-perfect. It would still 
> be better than the current defrag.
> 
> This is not a one-way street. Handling of partially-used extents can be 
> improved in later versions.
> 
>> * Determining all the reflinks to a given region of a given extent is 
>> not a cheap operation, and the information may immediately be stale 
>> (because an operation right after you fetch the info might change 
>> things).  We could work around this by locking the extent somehow, but 
>> doing so would be expensive because you would have to hold the lock 
>> for the entire defrag operation.
> 
> No. DO NOT LOCK TO RETRIEVE REFLINKS.
> 
> Instead, you have to create a hook in every function that updates the 
> reflink structure or extents (for exaple, write-to-file operation). So, 
> when a reflink gets changed, the defrag is immediately notified about 
> this. That way the defrag can keep its data about reflinks in-sync with 
> the filesystem.
This doesn't get around the fact that it's still an expensive operation 
to enumerate all the reflinks for a given region of a file or extent.

It also allows a very real possibility of a user functionally delaying 
the defrag operation indefinitely (by triggering a continuous stream of 
operations that would cause reflink changes for a file being operated on 
by defrag) if not implemented very carefully.
> 
> Also note, this defrag should run as a part of the kernel, not in 
> userspace. Defrag-from-userspace is a nightmare. Defrag has to serialize 
> its operations properly, and it must have knowledge of all other 
> operations in progress. So, it can only operate efficiently as part of 
> the kernel.
Agreed on this point.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 11:31                       ` Austin S. Hemmelgarn
@ 2019-09-12 19:18                         ` webmaster
  2019-09-12 19:44                           ` Chris Murphy
  2019-09-12 19:54                           ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 111+ messages in thread
From: webmaster @ 2019-09-12 19:18 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>
>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>
>>
>>>>> Given this, defrag isn't willfully unsharing anything, it's just  
>>>>> a side-effect of how it works (since it's rewriting the block  
>>>>> layout of the file in-place).
>>>>
>>>> The current defrag has to unshare because, as you said, because  
>>>> it is unaware of the full reflink structure. If it doesn't know  
>>>> about all reflinks, it has to unshare, there is no way around that.
>>>>
>>>>> Now factor in that _any_ write will result in unsharing the  
>>>>> region being written to, rounded to the nearest full filesystem  
>>>>> block in both directions (this is mandatory, it's a side effect  
>>>>> of the copy-on-write nature of BTRFS, and is why files that  
>>>>> experience heavy internal rewrites get fragmented very heavily  
>>>>> and very quickly on BTRFS).
>>>>
>>>> You mean: when defrag performs a write, the new data is unshared  
>>>> because every write is unshared? Really?
>>>>
>>>> Consider there is an extent E55 shared by two files A and B. The  
>>>> defrag has to move E55 to another location. In order to do that,  
>>>> defrag creates a new extent E70. It makes it belong to file A by  
>>>> changing the reflink of extent E55 in file A to point to E70.
>>>>
>>>> Now, to retain the original sharing structure, the defrag has to  
>>>> change the reflink of extent E55 in file B to point to E70. You  
>>>> are telling me this is not possible? Bullshit!
>>>>
>>>> Please explain to me how this 'defrag has to unshare' story of  
>>>> yours isn't an intentional attempt to mislead me.
>>
>>> As mentioned in the previous email, we actually did have a  
>>> (mostly) working reflink-aware defrag a few years back.  It got  
>>> removed because it had serious performance issues.  Note that  
>>> we're not talking a few seconds of extra time to defrag a full  
>>> tree here, we're talking double-digit _minutes_ of extra time to  
>>> defrag a moderate sized (low triple digit GB) subvolume with  
>>> dozens of snapshots, _if you were lucky_ (if you weren't, you  
>>> would be looking at potentially multiple _hours_ of runtime for  
>>> the defrag).  The performance scaled inversely proportionate to  
>>> the number of reflinks involved and the total amount of data in  
>>> the subvolume being defragmented, and was pretty bad even in the  
>>> case of only a couple of snapshots.
>>
>> You cannot ever make the worst program, because an even worse  
>> program can be made by slowing down the original by a factor of 2.
>> So, you had a badly implemented defrag. At least you got some  
>> experience. Let's see what went wrong.
>>
>>> Ultimately, there are a couple of issues at play here:
>>>
>>> * Online defrag has to maintain consistency during operation.  The  
>>> current implementation does this by rewriting the regions being  
>>> defragmented (which causes them to become a single new extent  
>>> (most of the time)), which avoids a whole lot of otherwise  
>>> complicated logic required to make sure things happen correctly,  
>>> and also means that only the file being operated on is impacted  
>>> and only the parts being modified need to be protected against  
>>> concurrent writes.  Properly handling reflinks means that _every_  
>>> file that shares some part of an extent with the file being  
>>> operated on needs to have the reflinked regions locked for the  
>>> defrag operation, which has a huge impact on performance. Using  
>>> your example, the update to E55 in both files A and B has to  
>>> happen as part of the same commit, which can contain no other  
>>> writes in that region of the file, otherwise you run the risk of  
>>> losing writes to file B that occur while file A is being  
>>> defragmented.
>>
>> Nah. I think there is a workaround. You can first (atomically)  
>> update A, then whatever, then you can update B later. I know, your  
>> yelling "what if E55 gets updated in B". Doesn't matter. The defrag  
>> continues later by searching for reflink to E55 in B. Then it  
>> checks the data contained in E55. If the data matches the E70, then  
>> it can safely update the reflink in B. Or the defrag can just  
>> verify that neither E55 nor E70 have been written to in the  
>> meantime. That means they still have the same data.

> So, IOW, you don't care if the total space used by the data is  
> instantaneously larger than what you started with?  That seems to be  
> at odds with your previous statements, but OK, if we allow for that  
> then this is indeed a non-issue.

It is normal and common for defrag operation to use some disk space  
while it is running. I estimate that a reasonable limit would be to  
use up to 1% of total partition size. So, if a partition size is 100  
GB, the defrag can use 1 GB. Lets call this "defrag operation space".

The defrag should, when started, verify that there is "sufficient free  
space" on the partition. In the case that there is no sufficient free  
space, the defrag should output the message to the user and abort. The  
size of "sufficient free space" must be larger than the "defrag  
operation space". I would estimate that a good limit would be 2% of  
the partition size. "defrag operation space" is a part of "sufficient  
free space" while defrag operation is in progress.

If, during defrag operation, sufficient free space drops below 2%, the  
defrag should output a message and abort. Another possibility is for  
defrag to pause until the user frees some disk space, but this is not  
common in other defrag implementations AFAIK.

>>> It's not horrible when it's just a small region in two files, but  
>>> it becomes a big issue when dealing with lots of files and/or  
>>> particularly large extents (extents in BTRFS can get into the GB  
>>> range in terms of size when dealing with really big files).
>>
>> You must just split large extents in a smart way. So, in the  
>> beginning, the defrag can split large extents (2GB) into smaller  
>> ones (32MB) to facilitate more responsive and easier defrag.
>>
>> If you have lots of files, update them one-by one. It is possible.  
>> Or you can update in big batches. Whatever is faster.

> Neither will solve this though.  Large numbers of files are an issue  
> because the operation is expensive and has to be done on each file,  
> not because the number of files somehow makes the operation more  
> espensive. It's O(n) relative to files, not higher time complexity.

I would say that updating in big batches helps a lot, to the point  
that it gets almost as fast as defragging any other file system. What  
defrag needs to do is to write a big bunch of defragged file (data)  
extents to the disk, and then update the b-trees. What happens is that  
many of the updates to the b-trees would fall into the same disk  
sector/extent, so instead of many writes there will be just one write.

Here is the general outline for implementation:
     - write a big bunch of defragged file extents to disk
         - a minimal set of updates of the b-trees that cannot be  
delayed is performed (this is nothing or almost nothing in most  
circumstances)
         - put the rest of required updates of b-trees into "pending  
operations buffer"
     - analyze the "pending operations buffer", and find out  
(approximately) the biggest part of it that can be flushed out by  
doing minimal number of disk writes
         - flush out that part of "pending operations buffer"
     - repeat

>> The point is that the defrag can keep a buffer of a "pending  
>> operations". Pending operations are those that should be performed  
>> in order to keep the original sharing structure. If the defrag gets  
>> interrupted, then files in "pending operations" will be unshared.  
>> But this should really be some important and urgent interrupt, as  
>> the "pending operations" buffer needs at most a second or two to  
>> complete its operations.

> Depending on the exact situation, it can take well more than a few  
> seconds to complete stuff. Especially if there are lots of reflinks.

Nope. You are quite wrong there.
In the worst case, the "pending operations buffer" will update (write  
to disk) all the b-trees. So, the upper limit on time to flush the  
"pending operations buffer" equals the time to write the entire b-tree  
structure to the disk (into new extents). I estimate that takes at  
most a few seconds.

>>> * Reflinks can reference partial extents.  This means, ultimately,  
>>> that you may end up having to split extents in odd ways during  
>>> defrag if you want to preserve reflinks, and might have to split  
>>> extents _elsewhere_ that are only tangentially related to the  
>>> region being defragmented. See the example in my previous email  
>>> for a case like this, maintaining the shared regions as being  
>>> shared when you defragment either file to a single extent will  
>>> require splitting extents in the other file (in either case,  
>>> whichever file you don't defragment to a single extent will end up  
>>> having 7 extents if you try to force the one that's been  
>>> defragmented to be the canonical version).  Once you consider that  
>>> a given extent can have multiple ranges reflinked from multiple  
>>> other locations, it gets even more complicated.
>>
>> I think that this problem can be solved, and that it can be solved  
>> perfectly (the result is a perfectly-defragmented file). But, if it  
>> is so hard to do, just skip those problematic extents in initial  
>> version of defrag.
>>
>> Ultimately, in the super-duper defrag, those partially-referenced  
>> extents should be split up by defrag.
>>
>>> * If you choose to just not handle the above point by not letting  
>>> defrag split extents, you put a hard lower limit on the amount of  
>>> fragmentation present in a file if you want to preserve reflinks.   
>>> IOW, you can't defragment files past a certain point.  If we go  
>>> this way, neither of the two files in the example from my previous  
>>> email could be defragmented any further than they already are,  
>>> because doing so would require splitting extents.
>>
>> Oh, you're reading my thoughts. That's good.
>>
>> Initial implementation of defrag might be not-so-perfect. It would  
>> still be better than the current defrag.
>>
>> This is not a one-way street. Handling of partially-used extents  
>> can be improved in later versions.
>>
>>> * Determining all the reflinks to a given region of a given extent  
>>> is not a cheap operation, and the information may immediately be  
>>> stale (because an operation right after you fetch the info might  
>>> change things).  We could work around this by locking the extent  
>>> somehow, but doing so would be expensive because you would have to  
>>> hold the lock for the entire defrag operation.
>>
>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>
>> Instead, you have to create a hook in every function that updates  
>> the reflink structure or extents (for exaple, write-to-file  
>> operation). So, when a reflink gets changed, the defrag is  
>> immediately notified about this. That way the defrag can keep its  
>> data about reflinks in-sync with the filesystem.

> This doesn't get around the fact that it's still an expensive  
> operation to enumerate all the reflinks for a given region of a file  
> or extent.

No, you are wrong.

In order to enumerate all the reflinks in a region, the defrag needs  
to have another array, which is also kept in memory and in sync with  
the filesystem. It is the easiest to divide the disk into regions of  
equal size, where each region is a few MB large. Lets call this array  
"regions-to-extents" array. This array doesn't need to be associative,  
it is a plain array.
This in-memory array links regions of disk to extents that are in the  
region. The array in initialized when defrag starts.

This array makes the operation of finding all extents of a region  
extremely fast.

> It also allows a very real possibility of a user functionally  
> delaying the defrag operation indefinitely (by triggering a  
> continuous stream of operations that would cause reflink changes for  
> a file being operated on by defrag) if not implemented very carefully.

Yes, if a user does something like that, the defrag can be paused or  
even aborted. That is normal.

There are many ways around this problem, but it really doesn't matter,  
those are just details. The initial version of defrag can just abort.  
The more mature versions of defrag can have a better handling of this  
problem.





^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 19:18                         ` webmaster
@ 2019-09-12 19:44                           ` Chris Murphy
  2019-09-12 21:34                             ` General Zed
  2019-09-12 19:54                           ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 111+ messages in thread
From: Chris Murphy @ 2019-09-12 19:44 UTC (permalink / raw)
  To: webmaster; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>
> It is normal and common for defrag operation to use some disk space
> while it is running. I estimate that a reasonable limit would be to
> use up to 1% of total partition size. So, if a partition size is 100
> GB, the defrag can use 1 GB. Lets call this "defrag operation space".

The simplest case of a file with no shared extents, the minimum free
space should be set to the potential maximum rewrite of the file, i.e.
100% of the file size. Since Btrfs is COW, the entire operation must
succeed or fail, no possibility of an ambiguous in between state, and
this does apply to defragment.

So if you're defragging a 10GiB file, you need 10GiB minimum free
space to COW those extents to a new, mostly contiguous, set of exents,
and then some extra free space to COW the metadata to point to these
new extents. Once that change is committed to stable media, then the
stale data and metadata extents can be released.

And this process is subject to ENOSPC condition. That's really what'll
tell you if you don't have enough space otherwise your setup time for
a complete volume recursive defragment is going to be really long, and
has some chance of reporting back that defragment isn't possible even
though most of it could be possible.

Arguably the defragmenting strategy should differ depending on whether
no_ssd or ssd mount option is enabled. Massive fragmentation on SSD
does impact latency, but there are no locality concerns, so as long as
the file is defragmented into ~32MiB extents, I think it's fine.
Perhaps ideal would be erase block sized extents? Whereas on a hard
drive, locality matters as well.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 19:18                         ` webmaster
  2019-09-12 19:44                           ` Chris Murphy
@ 2019-09-12 19:54                           ` Austin S. Hemmelgarn
  2019-09-12 22:21                             ` General Zed
  2019-09-12 22:47                             ` General Zed
  1 sibling, 2 replies; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-12 19:54 UTC (permalink / raw)
  To: webmaster; +Cc: linux-btrfs

On 2019-09-12 15:18, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>
>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>
>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>
>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>
>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>
>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>
>>>
>>>>>> Given this, defrag isn't willfully unsharing anything, it's just a 
>>>>>> side-effect of how it works (since it's rewriting the block layout 
>>>>>> of the file in-place).
>>>>>
>>>>> The current defrag has to unshare because, as you said, because it 
>>>>> is unaware of the full reflink structure. If it doesn't know about 
>>>>> all reflinks, it has to unshare, there is no way around that.
>>>>>
>>>>>> Now factor in that _any_ write will result in unsharing the region 
>>>>>> being written to, rounded to the nearest full filesystem block in 
>>>>>> both directions (this is mandatory, it's a side effect of the 
>>>>>> copy-on-write nature of BTRFS, and is why files that experience 
>>>>>> heavy internal rewrites get fragmented very heavily and very 
>>>>>> quickly on BTRFS).
>>>>>
>>>>> You mean: when defrag performs a write, the new data is unshared 
>>>>> because every write is unshared? Really?
>>>>>
>>>>> Consider there is an extent E55 shared by two files A and B. The 
>>>>> defrag has to move E55 to another location. In order to do that, 
>>>>> defrag creates a new extent E70. It makes it belong to file A by 
>>>>> changing the reflink of extent E55 in file A to point to E70.
>>>>>
>>>>> Now, to retain the original sharing structure, the defrag has to 
>>>>> change the reflink of extent E55 in file B to point to E70. You are 
>>>>> telling me this is not possible? Bullshit!
>>>>>
>>>>> Please explain to me how this 'defrag has to unshare' story of 
>>>>> yours isn't an intentional attempt to mislead me.
>>>
>>>> As mentioned in the previous email, we actually did have a (mostly) 
>>>> working reflink-aware defrag a few years back.  It got removed 
>>>> because it had serious performance issues.  Note that we're not 
>>>> talking a few seconds of extra time to defrag a full tree here, 
>>>> we're talking double-digit _minutes_ of extra time to defrag a 
>>>> moderate sized (low triple digit GB) subvolume with dozens of 
>>>> snapshots, _if you were lucky_ (if you weren't, you would be looking 
>>>> at potentially multiple _hours_ of runtime for the defrag).  The 
>>>> performance scaled inversely proportionate to the number of reflinks 
>>>> involved and the total amount of data in the subvolume being 
>>>> defragmented, and was pretty bad even in the case of only a couple 
>>>> of snapshots.
>>>
>>> You cannot ever make the worst program, because an even worse program 
>>> can be made by slowing down the original by a factor of 2.
>>> So, you had a badly implemented defrag. At least you got some 
>>> experience. Let's see what went wrong.
>>>
>>>> Ultimately, there are a couple of issues at play here:
>>>>
>>>> * Online defrag has to maintain consistency during operation.  The 
>>>> current implementation does this by rewriting the regions being 
>>>> defragmented (which causes them to become a single new extent (most 
>>>> of the time)), which avoids a whole lot of otherwise complicated 
>>>> logic required to make sure things happen correctly, and also means 
>>>> that only the file being operated on is impacted and only the parts 
>>>> being modified need to be protected against concurrent writes.  
>>>> Properly handling reflinks means that _every_ file that shares some 
>>>> part of an extent with the file being operated on needs to have the 
>>>> reflinked regions locked for the defrag operation, which has a huge 
>>>> impact on performance. Using your example, the update to E55 in both 
>>>> files A and B has to happen as part of the same commit, which can 
>>>> contain no other writes in that region of the file, otherwise you 
>>>> run the risk of losing writes to file B that occur while file A is 
>>>> being defragmented.
>>>
>>> Nah. I think there is a workaround. You can first (atomically) update 
>>> A, then whatever, then you can update B later. I know, your yelling 
>>> "what if E55 gets updated in B". Doesn't matter. The defrag continues 
>>> later by searching for reflink to E55 in B. Then it checks the data 
>>> contained in E55. If the data matches the E70, then it can safely 
>>> update the reflink in B. Or the defrag can just verify that neither 
>>> E55 nor E70 have been written to in the meantime. That means they 
>>> still have the same data.
> 
>> So, IOW, you don't care if the total space used by the data is 
>> instantaneously larger than what you started with?  That seems to be 
>> at odds with your previous statements, but OK, if we allow for that 
>> then this is indeed a non-issue.
> 
> It is normal and common for defrag operation to use some disk space 
> while it is running. I estimate that a reasonable limit would be to use 
> up to 1% of total partition size. So, if a partition size is 100 GB, the 
> defrag can use 1 GB. Lets call this "defrag operation space".
> 
> The defrag should, when started, verify that there is "sufficient free 
> space" on the partition. In the case that there is no sufficient free 
> space, the defrag should output the message to the user and abort. The 
> size of "sufficient free space" must be larger than the "defrag 
> operation space". I would estimate that a good limit would be 2% of the 
> partition size. "defrag operation space" is a part of "sufficient free 
> space" while defrag operation is in progress.
> 
> If, during defrag operation, sufficient free space drops below 2%, the 
> defrag should output a message and abort. Another possibility is for 
> defrag to pause until the user frees some disk space, but this is not 
> common in other defrag implementations AFAIK.
> 
>>>> It's not horrible when it's just a small region in two files, but it 
>>>> becomes a big issue when dealing with lots of files and/or 
>>>> particularly large extents (extents in BTRFS can get into the GB 
>>>> range in terms of size when dealing with really big files).
>>>
>>> You must just split large extents in a smart way. So, in the 
>>> beginning, the defrag can split large extents (2GB) into smaller ones 
>>> (32MB) to facilitate more responsive and easier defrag.
>>>
>>> If you have lots of files, update them one-by one. It is possible. Or 
>>> you can update in big batches. Whatever is faster.
> 
>> Neither will solve this though.  Large numbers of files are an issue 
>> because the operation is expensive and has to be done on each file, 
>> not because the number of files somehow makes the operation more 
>> espensive. It's O(n) relative to files, not higher time complexity.
> 
> I would say that updating in big batches helps a lot, to the point that 
> it gets almost as fast as defragging any other file system. What defrag 
> needs to do is to write a big bunch of defragged file (data) extents to 
> the disk, and then update the b-trees. What happens is that many of the 
> updates to the b-trees would fall into the same disk sector/extent, so 
> instead of many writes there will be just one write.
> 
> Here is the general outline for implementation:
>      - write a big bunch of defragged file extents to disk
>          - a minimal set of updates of the b-trees that cannot be 
> delayed is performed (this is nothing or almost nothing in most 
> circumstances)
>          - put the rest of required updates of b-trees into "pending 
> operations buffer"
>      - analyze the "pending operations buffer", and find out 
> (approximately) the biggest part of it that can be flushed out by doing 
> minimal number of disk writes
>          - flush out that part of "pending operations buffer"
>      - repeat
It helps, but you still can't get around having to recompute the new 
tree state, and that is going to take time proportionate to the number 
of nodes that need to change, which in turn is proportionate to the 
number of files.
> 
>>> The point is that the defrag can keep a buffer of a "pending 
>>> operations". Pending operations are those that should be performed in 
>>> order to keep the original sharing structure. If the defrag gets 
>>> interrupted, then files in "pending operations" will be unshared. But 
>>> this should really be some important and urgent interrupt, as the 
>>> "pending operations" buffer needs at most a second or two to complete 
>>> its operations.
> 
>> Depending on the exact situation, it can take well more than a few 
>> seconds to complete stuff. Especially if there are lots of reflinks.
> 
> Nope. You are quite wrong there.
> In the worst case, the "pending operations buffer" will update (write to 
> disk) all the b-trees. So, the upper limit on time to flush the "pending 
> operations buffer" equals the time to write the entire b-tree structure 
> to the disk (into new extents). I estimate that takes at most a few 
> seconds.
So what you're talking about is journaling the computed state of defrag 
operations.  That shouldn't be too bad (as long as it's done in memory 
instead of on-disk) if you batch the computations properly.  I thought 
you meant having a buffer of what operations to do, and then computing 
them on-the-fly (which would have significant overhead)
> 
>>>> * Reflinks can reference partial extents.  This means, ultimately, 
>>>> that you may end up having to split extents in odd ways during 
>>>> defrag if you want to preserve reflinks, and might have to split 
>>>> extents _elsewhere_ that are only tangentially related to the region 
>>>> being defragmented. See the example in my previous email for a case 
>>>> like this, maintaining the shared regions as being shared when you 
>>>> defragment either file to a single extent will require splitting 
>>>> extents in the other file (in either case, whichever file you don't 
>>>> defragment to a single extent will end up having 7 extents if you 
>>>> try to force the one that's been defragmented to be the canonical 
>>>> version).  Once you consider that a given extent can have multiple 
>>>> ranges reflinked from multiple other locations, it gets even more 
>>>> complicated.
>>>
>>> I think that this problem can be solved, and that it can be solved 
>>> perfectly (the result is a perfectly-defragmented file). But, if it 
>>> is so hard to do, just skip those problematic extents in initial 
>>> version of defrag.
>>>
>>> Ultimately, in the super-duper defrag, those partially-referenced 
>>> extents should be split up by defrag.
>>>
>>>> * If you choose to just not handle the above point by not letting 
>>>> defrag split extents, you put a hard lower limit on the amount of 
>>>> fragmentation present in a file if you want to preserve reflinks.  
>>>> IOW, you can't defragment files past a certain point.  If we go this 
>>>> way, neither of the two files in the example from my previous email 
>>>> could be defragmented any further than they already are, because 
>>>> doing so would require splitting extents.
>>>
>>> Oh, you're reading my thoughts. That's good.
>>>
>>> Initial implementation of defrag might be not-so-perfect. It would 
>>> still be better than the current defrag.
>>>
>>> This is not a one-way street. Handling of partially-used extents can 
>>> be improved in later versions.
>>>
>>>> * Determining all the reflinks to a given region of a given extent 
>>>> is not a cheap operation, and the information may immediately be 
>>>> stale (because an operation right after you fetch the info might 
>>>> change things).  We could work around this by locking the extent 
>>>> somehow, but doing so would be expensive because you would have to 
>>>> hold the lock for the entire defrag operation.
>>>
>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>
>>> Instead, you have to create a hook in every function that updates the 
>>> reflink structure or extents (for exaple, write-to-file operation). 
>>> So, when a reflink gets changed, the defrag is immediately notified 
>>> about this. That way the defrag can keep its data about reflinks 
>>> in-sync with the filesystem.
> 
>> This doesn't get around the fact that it's still an expensive 
>> operation to enumerate all the reflinks for a given region of a file 
>> or extent.
> 
> No, you are wrong.
> 
> In order to enumerate all the reflinks in a region, the defrag needs to 
> have another array, which is also kept in memory and in sync with the 
> filesystem. It is the easiest to divide the disk into regions of equal 
> size, where each region is a few MB large. Lets call this array 
> "regions-to-extents" array. This array doesn't need to be associative, 
> it is a plain array.
> This in-memory array links regions of disk to extents that are in the 
> region. The array in initialized when defrag starts.
> 
> This array makes the operation of finding all extents of a region 
> extremely fast.
That has two issues:

* That's going to be a _lot_ of memory.  You still need to be able to 
defragment big (dozens plus TB) arrays without needing multiple GB of 
RAM just for the defrag operation, otherwise it's not realistically 
useful (remember, it was big arrays that had issues with the old 
reflink-aware defrag too).
* You still have to populate the array in the first place.  A sane 
implementation wouldn't be keeping it in memory even when defrag is not 
running (no way is anybody going to tolerate even dozens of MB of memory 
overhead for this), so you're not going to get around the need to 
enumerate all the reflinks for a file at least once (during startup, or 
when starting to process that file), so you're just moving the overhead 
around instead of eliminating it.
> 
>> It also allows a very real possibility of a user functionally delaying 
>> the defrag operation indefinitely (by triggering a continuous stream 
>> of operations that would cause reflink changes for a file being 
>> operated on by defrag) if not implemented very carefully.
> 
> Yes, if a user does something like that, the defrag can be paused or 
> even aborted. That is normal.
Not really.  Most defrag implementations either avoid files that could 
reasonably be written to, or freeze writes to the file they're operating 
on, or in some other way just sidestep the issue without delaying the 
defragmentation process.
> 
> There are many ways around this problem, but it really doesn't matter, 
> those are just details. The initial version of defrag can just abort. 
> The more mature versions of defrag can have a better handling of this 
> problem.
Details like this are the deciding factor for whether something is 
sanely usable in certain use cases, as you have yourself found out (for 
a lot of users, the fact that defrag can unshare extents is 'just a 
detail' that's not worth worrying about).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12  5:19                       ` Zygo Blaxell
@ 2019-09-12 21:23                         ` General Zed
  2019-09-14  4:12                           ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-12 21:23 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Austin S. Hemmelgarn, linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Wed, Sep 11, 2019 at 07:21:31PM -0400, webmaster@zedlx.com wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmaster@zedlx.com wrote:
>> > >
>> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>> > >
>> > > > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>> > > > >
>> > > > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>> > > > >
>> > >
>> > > > >
>> > > > > === I CHALLENGE you and anyone else on this mailing list: ===
>> > > > >
>> > > > >  - Show me an exaple where splitting an extent requires unsharing,
>> > > > > and this split is needed to defrag.
>> > > > >
>> > > > > Make it clear, write it yourself, I don't want any  
>> machine-made outputs.
>> > > > >
>> > > > Start with the above comment about all writes unsharing the  
>> region being
>> > > > written to.
>> > > >
>> > > > Now, extrapolating from there:
>> > > >
>> > > > Assume you have two files, A and B, each consisting of 64 filesystem
>> > > > blocks in single shared extent.  Now assume somebody writes a  
>> few bytes
>> > > > to the middle of file B, right around the boundary between  
>> blocks 31 and
>> > > > 32, and that you get similar writes to file A straddling blocks 14-15
>> > > > and 47-48.
>> > > >
>> > > > After all of that, file A will be 5 extents:
>> > > >
>> > > > * A reflink to blocks 0-13 of the original extent.
>> > > > * A single isolated extent consisting of the new blocks 14-15
>> > > > * A reflink to blocks 16-46 of the original extent.
>> > > > * A single isolated extent consisting of the new blocks 47-48
>> > > > * A reflink to blocks 49-63 of the original extent.
>> > > >
>> > > > And file B will be 3 extents:
>> > > >
>> > > > * A reflink to blocks 0-30 of the original extent.
>> > > > * A single isolated extent consisting of the new blocks 31-32.
>> > > > * A reflink to blocks 32-63 of the original extent.
>> > > >
>> > > > Note that there are a total of four contiguous sequences of  
>> blocks that
>> > > > are common between both files:
>> > > >
>> > > > * 0-13
>> > > > * 16-30
>> > > > * 32-46
>> > > > * 49-63
>> > > >
>> > > > There is no way to completely defragment either file without splitting
>> > > > the original extent (which is still there, just not fully  
>> referenced by
>> > > > either file) unless you rewrite the whole file to a new single extent
>> > > > (which would, of course, completely unshare the whole file).  In fact,
>> > > > if you want to ensure that those shared regions stay  
>> reflinked, there's
>> > > > no way to defragment either file without _increasing_ the number of
>> > > > extents in that file (either file would need 7 extents to  
>> properly share
>> > > > only those 4 regions), and even then only one of the files could be
>> > > > fully defragmented.
>> > > >
>> > > > Such a situation generally won't happen if you're just dealing with
>> > > > read-only snapshots, but is not unusual when dealing with  
>> regular files
>> > > > that are reflinked (which is not an uncommon situation on  
>> some systems,
>> > > > as a lot of people have `cp` aliased to reflink things whenever
>> > > > possible).
>> > >
>> > > Well, thank you very much for writing this example. Your example is
>> > > certainly not minimal, as it seems to me that one write to the  
>> file A and
>> > > one write to file B would be sufficient to prove your point, so there we
>> > > have one extra write in the example, but that's OK.
>> > >
>> > > Your example proves that I was wrong. I admit: it is impossible  
>> to perfectly
>> > > defrag one subvolume (in the way I imagined it should be done).
>> > > Why? Because, as in your example, there can be files within a SINGLE
>> > > subvolume which share their extents with each other. I didn't  
>> consider such
>> > > a case.
>> > >
>> > > On the other hand, I judge this issue to be mostly irrelevant.  
>> Why? Because
>> > > most of the file sharing will be between subvolumes, not within  
>> a subvolume.
>> > > When a user creates a reflink to a file in the same subvolume, he is
>> > > willingly denying himself the assurance of a perfect defrag. Because, as
>> > > your example proves, if there are a few writes to BOTH files, it gets
>> > > impossible to defrag perfectly. So, if the user creates such  
>> reflinks, it's
>> > > his own whish and his own fault.
>> > >
>> > > Such situations will occur only in some specific circumstances:
>> > > a) when the user is reflinking manually
>> > > b) when a file is copied from one subvolume into a different file in a
>> > > different subvolume.
>> > >
>> > > The situation a) is unusual in normal use of the filesystem.  
>> Even when it
>> > > occurs, it is the explicit command given by the user, so he should be
>> > > willing to accept all the consequences, even the bad ones like imperfect
>> > > defrag.
>> > >
>> > > The situation b) is possible, but as far as I know copies are  
>> currently not
>> > > done that way in btrfs. There should probably be the option to  
>> reflink-copy
>> > > files fron another subvolume, that would be good.
>> >
>> > Reflink copies across subvolumes have been working for years.  They are
>> > an important component that makes dedupe work when snapshots are present.
>>
>> I take that what you say is true, but what I said is that when a user (or
>> application) makes a
>> normal copy from one subvolume to another, then it won't be a reflink-copy.
>> To make such a reflink-copy, you need btrfs-aware cp or btrfs-aware
>> applications.
>
> It's the default for GNU coreutils, and for 'mv' across subvols there
> is currently no option to turn reflink copies off.  Maybe for 'cp'
> you still have to explicitly request reflink, but that will presumably
> change at some point as more filesystems get the CLONE_RANGE ioctl and
> more users expect it to just work by default.

Yes, thank you for posting another batch of arguments that support the  
use of my vision of defrag instead of the current one.

The defrag that I'm proposing will preserve all those reflinks that  
were painstakingly created by the user. Therefore, I take that you  
agree with me on the utmost importance of implementing this new defrag  
that I'm proposing.

>> So, the reflik-copy is a special case, usually explicitly requested by the
>> user.
>>
>> > > But anyway, it doesn't matter. Because most of the sharing will  
>> be between
>> > > subvolumes, not within subvolume.
>> >
>> > Heh.  I'd like you to meet one of my medium-sized filesystems:
>> >
>> > 	Physical size:  8TB
>> > 	Logical size:  16TB
>> > 	Average references per extent:  2.03 (not counting snapshots)
>> > 	Workload:  CI build server, VM host
>> >
>> > That's a filesystem where over half of the logical data is reflinks to the
>> > other physical data, and 94% of that data is in a single subvol.  7.5TB of
>> > data is unique, the remaining 500GB is referenced an average of 17 times.
>> >
>> > We use ordinary applications to make ordinary copies of files, and do
>> > tarball unpacks and source checkouts with reckless abandon, all day long.
>> > Dedupe turns the copies into reflinks as we go, so every copy becomes
>> > a reflink no matter how it was created.
>> >
>> > For the VM filesystem image files, it's not uncommon to see a high
>> > reflink rate within a single file as well as reflinks to other files
>> > (like the binary files in the build directories that the VM images are
>> > constructed from).  Those reference counts can go into the millions.
>>
>> OK, but that cannot be helped: either you retain the sharing structure with
>> imperfect defrag, or you unshare and produce a perfect defrag which should
>> have somewhat better performance (and pray that the disk doesn't fill up).
>
> One detail that might not have been covered in the rest of the discussion
> is that btrfs extents are immutable.  e.g. if you create a 128MB file,
> then truncate it to 4K, it still takes 128MB on disk because that 4K
> block at the beginning holds a reference to the whole immutable 128MB
> extent.  This holds even if the extent is not shared.  The 127.996MB of
> unreferenced extent blocks are unreachable until the last 4K reference
> is removed from the filesystem (deleted or overwritten).
>
> There is no "split extent in place" operation on btrfs because the
> assumption that extents are immutable is baked into the code.  Changing
> this has been discussed a few times.  The gotcha is that some fairly
> ordinary write cases can trigger updates of all extent references if
> extents were split automatically as soon as blocks become unreachable.
> There's non-trivial CPU overhead to determine whether blocks are
> unreachable, making it unsuitable to run the garbage collector during
> ordinary writes.  So for now, the way to split an extent is to copy the
> data you want to keep from the extent into some new extents, then remove
> all references to the old extent.

There, I don't even need to write a solution. You are getting better.

I suggest that btrfs should first try to determine whether it can  
split an extent in-place, or not. If it can't do that, then it should  
create new extents to split the old one.

> To maintain btrfs storage efficiency you have to do a *garbage collection*
> operation that there is currently no tool for.  This tool would find
> unreachable blocks and split extents around them, releasing the entire
> extent (including unreachable blocks) back to the filesystem.  Defrag is
> sometimes used for its accidental garbage collection side-effects (it
> will copy all the remaining reachable blocks to new extents and reduce
> total space usage if the extens were not shared), but the current defrag
> is not designed for GC, and doesn't do a good job.

Yes, you have to determine which blocks are unused, and to do that you  
have to analyze all the b-trees. Only the defrag can do this. And, it  
should do this.

Notice that doing this garbage collection gets much easier when the  
defrag has created, in-memory, the two arrays I described in another  
part of this discussion, namely:
  - "extent-backref" associative array
  - "region-extents" plain array

Since defrag is supposed to have this two arrays always in memory and  
always valid and in-sync, doing this "garbage collection" becomes  
quite easy, even trivial.

Therefore, the defrag can free unused parts of any extent, and then  
the extent can be split is necessary. In fact, both these operations  
can be done simultaneously.

> Dedupe on btrfs also requires the ability to split and merge extents;
> otherwise, we can't dedupe an extent that contains a combination of
> unique and duplicate data.  If we try to just move references around
> without splitting extents into all-duplicate and all-unique extents,
> the duplicate blocks become unreachable, but are not deallocated.  If we
> only split extents, fragmentation overhead gets bad.  Before creating
> thousands of references to an extent, it is worthwhile to merge it with
> as many of its neighbors as possible, ideally by picking the biggest
> existing garbage-free extents available so we don't have to do defrag.
> As we examine each extent in the filesystem, it may be best to send
> to defrag, dedupe, or garbage collection--sometimes more than one of
> those.

This is sovled simply by always running defrag before dedupe.

> As extents get bigger, the seeking and overhead to read them gets smaller.
> I'd want to defrag many consecutive 4K extents, but I wouldn't bother
> touching 256K extents unless they were in high-traffic files, nor would I
> bother combining only 2 or 3 4K extents together (there would be around
> 400K of metadata IO overhead to do so--likely more than what is saved
> unless the file is very frequently read sequentially).  The incremental
> gains are inversely proportional to size, while the defrag cost is
> directly proportional to size.

"the defrag cost is directly proportional to size" - this is wrong.  
The defrag cost is proportional to file size, not to extent size.

Before a file is defragmented, the defrag should split its extents so  
that each one is sufficiently small, let's say 32 MB at most. That  
fixes the issue. This was mentioned in my answer to Austin S.  
Hemmelgarn.

Then, as the final stage of the defrag, the extents should be merged  
into bigger ones of desired size.

>> > > So, if there is some in-subvolume sharing,
>> > > the defrag wont be 100% perfect, that a minor point. Unimportant.
>> >
>> > It's not unimportant; however, the implementation does have to take this
>> > into account, and make sure that defrag can efficiently skip extents that
>> > are too expensive to relocate.  If we plan to read an extent fewer than
>> > 100 times, it makes no sense to update 20000 references to it--we spend
>> > less total time just doing the 100 slower reads.
>>
>> Not necesarily. Because you can defrag in the time-of-day when there is a
>> low pressure on the disk IO, so updating 20000 references is esentially
>> free.
>>
>> You are just making those later 100 reads faster.
>>
>> OK, you are right, there is some limit, but this is such a rare case, that
>> such a heavily-referenced extents are best left untouched.
>
> For you it's rare.  For other users, it's expected behavior--dedupe is
> just a thing many different filesystems do now.

This is a tiny detail not worthy of consideration at this stage of  
planning. It can be solved.

> Also, quite a few heavily-fragmented files only ever get read once,
> or are read only during low-cost IO times (e.g. log files during
> maintenance windows).  For those, a defrag is pure wasted iops.

You don't know that because you can't predict the future. Therefore,  
defrag is never a waste because the future is unknown.

>> I suggest something along these lines: if there are more than XX (where XX
>> defaults to 1000) reflinks to an extent, then one or more copies of the
>> extent should be made such that each has less than XX reflinks to it. The
>> number XX should be user-configurable.
>
> That would depend on where those extents are, how big the references
> are, etc.  In some cases a million references are fine, in other cases
> even 20 is too many.  After doing the analysis for each extent in a
> filesystem, you'll probably get an average around some number, but
> setting the number first is putting the cart before the horse.

This is a tiny detail not worthy of consideration at this stage of  
planning. It can be solved.

>> > If the numbers are
>> > reversed then it's better to defrag the extent--100 reference updates
>> > are easily outweighed by 20000 faster reads.  The kernel doesn't have
>> > enough information to make good decisions about this.
>>
>> So, just make the number XX user-provided.
>>
>> > Dedupe has a similar problem--it's rarely worth doing a GB of IO to
>> > save 4K of space, so in practical implementations, a lot of duplicate
>> > blocks have to remain duplicate.
>> >
>> > There are some ways to make the kernel dedupe and defrag API process
>> > each reference a little more efficiently, but none will get around this
>> > basic physical problem:  some extents are just better off where they are.
>>
>> OK. If you don't touch those extents, they are still shared. That's what I
>> wanted.
>>
>> > Userspace has access to some extra data from the user, e.g.  "which
>> > snapshots should have their references excluded from defrag because
>> > the entire snapshot will be deleted in a few minutes."  That will allow
>> > better defrag cost-benefit decisions than any in-kernel implementation
>> > can make by itself.
>>
>> Yes, but I think that we are going into too much details which are diverting
>> the attention from the overall picture and from big problems.
>>
>> And the big problem here is: what do we want defrag to do in general, most
>> common cases. Because we haven't still agreed on that one since many of the
>> people here are ardent followers of the defrag-by-unsharing ideology.
>
> Well, that's the current implementation, so most people are familiar with it.
>
> I have a userspace library that lets applications work in terms of extents
> and their sequential connections to logical neighbors, which it extracts
> from the reference data in the filesystem trees.  The caller says "move
> extent A50..A70, B10..B20, and C90..C110 to a new physically contiguous
> location" and the library translates that into calls to the current kernel
> API to update all the references to those extents, at least one of which
> is a single contiguous reference to the entire new extent.  To defrag,
> it walks over extent adjacency DAGs and connects short extents together
> into longer ones, and if it notices opportunities to get rid of extents
> entirely by dedupe then it does that instead of defrag.  Extents with
> unreachable blocks are garbage collected.  If I ever get it done, I'll
> propose the kernel provide the library's top-level interface directly,
> then drop the userspace emulation layer when that kernel API appears.

Actually, many of the problems that you wrote about so far in this  
thread are not problems in my imagined implementation of defrag, which  
can solves them all. The problems you wrote about are mostly problems  
of this implementation/library of yours.

So yes, you can do things that way as in your library, but that is  
inferior to real defrag.

>> > 'btrfs fi defrag' is just one possible userspace implementation, which
>> > implements the "throw entire files at the legacy kernel defrag API one
>> > at a time" algorithm.  Unfortunately, nobody seems to have implemented
>> > any other algorithms yet, other than a few toy proof-of-concept demos.
>>
>> I really don't have a clue what's happening, but if I were to start working
>> on it (which I won't), then the first things should be:
>
>> - creating a way for btrfs to split large extents into smaller ones (for
>> easier defrag, as first phase).
>
>> - creating a way for btrfs to merge small adjanced extents shared by the
>> same files into larger extents (as the last phase of defragmenting a file).
>
> Those were the primitives I found useful as well.  Well, primitive,
> singular--in the absence of the ability to split an extent in place,
> the same function can do both split and merge operations.  Input is a
> list of block ranges, output is one extent containing data from those
> block ranges, side effect is to replace all the references to the old
> data with references to the new data.

Great, so there already exists an implementation, or at least a similar one.

Now, that split and merge just have to be moved into kernel.
  - I would keep merge and split as separate operations.
  - If a split cannot be performed due to problems you mention, then  
it should just return and do nothing. Same with merge.

Eventually, when a real defrag starts to be written, those two (split  
and merge) can be updated to make use of "extent-backref" associative  
array and "region-extents" plain array, so that they can be performed  
more efficiently and so that they always succeed.

> Add an operation to replace references to data at extent A with
> references to data at extent B, and an operation to query extent reference
> structure efficiently, and you have all the ingredients of an integrated
> dedupe/defrag/garbage collection tool for btrfs (analogous to the XFS
> "fsr" process).

Obviously, some very usefull code. That is good, but perhaps it would  
be better for that code to serve as an example of how it can be done.
In my imagined defrag, this updating-of-references happens as part of  
flushing the "pending operations buffer", so it will have to be  
rewritten such that it fits into that framework.

The problem of your defrag is that it is not holistic enough. It has a  
view of only small parts of the filesystem, so it can never be as good  
as a real defrag, which also doesn't unshare extents.

>> - create a structure (associative array) for defrag that can track
>> backlinks. Keep the structure updated with each filesystem change, by
>> placing hooks in filesystem-update routines.
>
> No need.  btrfs already maintains backrefs--they power the online  
> device shrink
> feature and the LOGICAL_TO_INO ioctl.

Another person said that it is complicated to trace backreferences. So  
now you are saying that it is not.
Anyway, such a structure must be available to defrag.
So, just in case to avoid misunderstandings, this "extent-backrefs"  
associative array would be in-memory, it would cover all extents, the  
entire filesystem structure, and it would be kept in-sync with the  
filesystem during the defrag operation.

>> You can't go wrong with this. Whatever details change about defrag
>> operation, the given three things will be needed by defrag.
>
> I agree about 70% with what you said.  ;)

Ok, thanks, finally someone agrees with me, at least 70%. I feel like  
I'm on the shooting range here carrying a target on my back and  
running around.

>> > > Now, to retain the original sharing structure, the defrag has  
>> to change the
>> > > reflink of extent E55 in file B to point to E70. You are  
>> telling me this is
>> > > not possible? Bullshit!
>> >
>> > This is already possible today and userspace tools can do it--not as
>> > efficiently as possible, but without requiring more than 128M of temporary
>> > space.  'btrfs fi defrag' is not one of those tools.
>> >
>> > > Please explain to me how this 'defrag has to unshare' story of  
>> yours isn't
>> > > an intentional attempt to mislead me.
>> >
>> > Austin is talking about the btrfs we have, not the btrfs we want.
>>
>> OK, but then, you agree with me that current defrag is a joke. I mean,
>> something is better than nothing, and the current defrag isn't completely
>> useless, but it is in most circumstances either unusable or not good enough.
>
> I agree that the current defrag is almost useless.  Where we might
> disagree is that I don't think the current defrag can ever be useful,
> even if it did update all the references simultaneously.
>
>> I mean, the snapshots are a prime feature of btrfs. If not, then why bother
>> with b-trees? If you wanted subvolumes, checksums and RAID, then you should
>> have made ext5. B-trees are in btrfs so that there can be snapshots. But,
>> the current defrag works bad with snaphots. It doesn't defrag them well, it
>> also unshares data. Bad bad bad.
>>
>> And if you wanted to be honest to your users, why don't you place this info
>> in the wiki? Ok, the wiki says "defrag will unshare", but it doesn't say
>> that it also doesn't defragment well.
>
> "Well" is the first thing users have trouble with.  They generally
> overvalue defragmentation because they don't consider the costs.
>
> There's a lower bound _and_ an upper bound on how big fragments can
> usefully be.  The lower bound is somewhere around 100K (below that size
> there is egregious seek and metadata overhead), and the upper bound is
> a few MB at most, with a few special exceptions like torrent downloads
> (where a large file is written in random order, then rearranged into
> contiguous extents and never written again).  If you strictly try to
> minimize the number of fragments, you can make a lot of disk space
> unreachable without garbage collection.  Up to 32768 times the logical
> file size can theoretically be wasted, though 25-200% is more common
> in practice.  The bigger your target extent size, the more frequently
> you have to defrag to maintain disk space usage at reasonable levels,
> and the less benefit you get from each defrag run.
>
>> For example, lets examine the typical home user. If he is using btrfs, it
>> means he probably wants snapshots of his data. And, after a few snapshots,
>> his data is fragmented, and the current defrag can't help because it does a
>> terrible job in this particualr case.
>
> I'm not sure I follow.  Snapshots don't cause data fragmentation.

I disagree. I might explain my thoughts in aother place (this post is  
getting too long).

> Discontiguous writes do (unless the filesystem is very full and the
> only available free areas are small--see below).  The data will be
> fragmented or not depending on the write pattern, not the presence or
> absence of snapshots.

> On the other hand, if snapshots are present, then
> (overly aggressive) defrag will consume a large amount of disk space.
> Maybe that's what you're getting at.

Yes. But, any defrag which is based on unsharing will fail in such a  
situation, in one way or another. The arguments for current defrag  
always mention the ways in which it won't fail, forgetting to mention  
that it will ALWAYS fail in at least one, sometimes disastrous way.  
But always a different one. Really, it is an inhonest debate.

> If the filesystem is very full and all free space areas are small,
> you want to use balance to move all the data closer together so that
> free space areas get bigger.  Defrag is the wrong tool for this.
> Balance handles shared extent references just fine.
>
>> So why don't you write on the wiki "the defrag is practically unusable in
>> case you use snapshots". Because that is the truth. Be honest.
>
> It depends on the size of the defragmented data.  If the only thing
> you need to defrag is systemd journal files and the $HOME/.thunderbird
> directory, then you can probably afford to store a few extra copies of
> those on the disk.  It won't be possible to usefully defrag files like
> those while they have shared extents anyway--too many small overlaps.
>
> If you have a filesystem full of big VM image files then there's no
> good solution yet.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 19:44                           ` Chris Murphy
@ 2019-09-12 21:34                             ` General Zed
  2019-09-12 22:28                               ` Chris Murphy
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-12 21:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Chris Murphy <lists@colorremedies.com>:

> On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>>
>> It is normal and common for defrag operation to use some disk space
>> while it is running. I estimate that a reasonable limit would be to
>> use up to 1% of total partition size. So, if a partition size is 100
>> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>
> The simplest case of a file with no shared extents, the minimum free
> space should be set to the potential maximum rewrite of the file, i.e.
> 100% of the file size. Since Btrfs is COW, the entire operation must
> succeed or fail, no possibility of an ambiguous in between state, and
> this does apply to defragment.
>
> So if you're defragging a 10GiB file, you need 10GiB minimum free
> space to COW those extents to a new, mostly contiguous, set of exents,

False.

You can defragment just 1 GB of that file, and then just write out to  
disk (in new extents) an entire new version of b-trees.
Of course, you don't really need to do all that, as usually only a  
small part of the b-trees need to be updated.

The only problem that there can arise is if the original file is  
entirely one 10 GB extent. In that case, that extent should be split  
into smaller parts. If btrfs cannot do that splitting, mostly (but not  
entirely) in-place, than it is a fundamentally flawed design, but I  
reckon that is not the case. You would have to be really stupid to  
design a filesystem that way, if it is even possible to design a  
filesystem with such property.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 19:54                           ` Austin S. Hemmelgarn
@ 2019-09-12 22:21                             ` General Zed
  2019-09-13 11:53                               ` Austin S. Hemmelgarn
  2019-09-12 22:47                             ` General Zed
  1 sibling, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-12 22:21 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-12 15:18, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>>
>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>
>>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>>
>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>
>>>>
>>>>>>> Given this, defrag isn't willfully unsharing anything, it's  
>>>>>>> just a side-effect of how it works (since it's rewriting the  
>>>>>>> block layout of the file in-place).
>>>>>>
>>>>>> The current defrag has to unshare because, as you said, because  
>>>>>> it is unaware of the full reflink structure. If it doesn't know  
>>>>>> about all reflinks, it has to unshare, there is no way around  
>>>>>> that.
>>>>>>
>>>>>>> Now factor in that _any_ write will result in unsharing the  
>>>>>>> region being written to, rounded to the nearest full  
>>>>>>> filesystem block in both directions (this is mandatory, it's a  
>>>>>>> side effect of the copy-on-write nature of BTRFS, and is why  
>>>>>>> files that experience heavy internal rewrites get fragmented  
>>>>>>> very heavily and very quickly on BTRFS).
>>>>>>
>>>>>> You mean: when defrag performs a write, the new data is  
>>>>>> unshared because every write is unshared? Really?
>>>>>>
>>>>>> Consider there is an extent E55 shared by two files A and B.  
>>>>>> The defrag has to move E55 to another location. In order to do  
>>>>>> that, defrag creates a new extent E70. It makes it belong to  
>>>>>> file A by changing the reflink of extent E55 in file A to point  
>>>>>> to E70.
>>>>>>
>>>>>> Now, to retain the original sharing structure, the defrag has  
>>>>>> to change the reflink of extent E55 in file B to point to E70.  
>>>>>> You are telling me this is not possible? Bullshit!
>>>>>>
>>>>>> Please explain to me how this 'defrag has to unshare' story of  
>>>>>> yours isn't an intentional attempt to mislead me.
>>>>
>>>>> As mentioned in the previous email, we actually did have a  
>>>>> (mostly) working reflink-aware defrag a few years back.  It got  
>>>>> removed because it had serious performance issues.  Note that  
>>>>> we're not talking a few seconds of extra time to defrag a full  
>>>>> tree here, we're talking double-digit _minutes_ of extra time to  
>>>>> defrag a moderate sized (low triple digit GB) subvolume with  
>>>>> dozens of snapshots, _if you were lucky_ (if you weren't, you  
>>>>> would be looking at potentially multiple _hours_ of runtime for  
>>>>> the defrag).  The performance scaled inversely proportionate to  
>>>>> the number of reflinks involved and the total amount of data in  
>>>>> the subvolume being defragmented, and was pretty bad even in the  
>>>>> case of only a couple of snapshots.
>>>>
>>>> You cannot ever make the worst program, because an even worse  
>>>> program can be made by slowing down the original by a factor of 2.
>>>> So, you had a badly implemented defrag. At least you got some  
>>>> experience. Let's see what went wrong.
>>>>
>>>>> Ultimately, there are a couple of issues at play here:
>>>>>
>>>>> * Online defrag has to maintain consistency during operation.   
>>>>> The current implementation does this by rewriting the regions  
>>>>> being defragmented (which causes them to become a single new  
>>>>> extent (most of the time)), which avoids a whole lot of  
>>>>> otherwise complicated logic required to make sure things happen  
>>>>> correctly, and also means that only the file being operated on  
>>>>> is impacted and only the parts being modified need to be  
>>>>> protected against concurrent writes.  Properly handling reflinks  
>>>>> means that _every_ file that shares some part of an extent with  
>>>>> the file being operated on needs to have the reflinked regions  
>>>>> locked for the defrag operation, which has a huge impact on  
>>>>> performance. Using your example, the update to E55 in both files  
>>>>> A and B has to happen as part of the same commit, which can  
>>>>> contain no other writes in that region of the file, otherwise  
>>>>> you run the risk of losing writes to file B that occur while  
>>>>> file A is being defragmented.
>>>>
>>>> Nah. I think there is a workaround. You can first (atomically)  
>>>> update A, then whatever, then you can update B later. I know,  
>>>> your yelling "what if E55 gets updated in B". Doesn't matter. The  
>>>> defrag continues later by searching for reflink to E55 in B. Then  
>>>> it checks the data contained in E55. If the data matches the E70,  
>>>> then it can safely update the reflink in B. Or the defrag can  
>>>> just verify that neither E55 nor E70 have been written to in the  
>>>> meantime. That means they still have the same data.
>>
>>> So, IOW, you don't care if the total space used by the data is  
>>> instantaneously larger than what you started with?  That seems to  
>>> be at odds with your previous statements, but OK, if we allow for  
>>> that then this is indeed a non-issue.
>>
>> It is normal and common for defrag operation to use some disk space  
>> while it is running. I estimate that a reasonable limit would be to  
>> use up to 1% of total partition size. So, if a partition size is  
>> 100 GB, the defrag can use 1 GB. Lets call this "defrag operation  
>> space".
>>
>> The defrag should, when started, verify that there is "sufficient  
>> free space" on the partition. In the case that there is no  
>> sufficient free space, the defrag should output the message to the  
>> user and abort. The size of "sufficient free space" must be larger  
>> than the "defrag operation space". I would estimate that a good  
>> limit would be 2% of the partition size. "defrag operation space"  
>> is a part of "sufficient free space" while defrag operation is in  
>> progress.
>>
>> If, during defrag operation, sufficient free space drops below 2%,  
>> the defrag should output a message and abort. Another possibility  
>> is for defrag to pause until the user frees some disk space, but  
>> this is not common in other defrag implementations AFAIK.
>>
>>>>> It's not horrible when it's just a small region in two files,  
>>>>> but it becomes a big issue when dealing with lots of files  
>>>>> and/or particularly large extents (extents in BTRFS can get into  
>>>>> the GB range in terms of size when dealing with really big files).
>>>>
>>>> You must just split large extents in a smart way. So, in the  
>>>> beginning, the defrag can split large extents (2GB) into smaller  
>>>> ones (32MB) to facilitate more responsive and easier defrag.
>>>>
>>>> If you have lots of files, update them one-by one. It is  
>>>> possible. Or you can update in big batches. Whatever is faster.
>>
>>> Neither will solve this though.  Large numbers of files are an  
>>> issue because the operation is expensive and has to be done on  
>>> each file, not because the number of files somehow makes the  
>>> operation more espensive. It's O(n) relative to files, not higher  
>>> time complexity.
>>
>> I would say that updating in big batches helps a lot, to the point  
>> that it gets almost as fast as defragging any other file system.  
>> What defrag needs to do is to write a big bunch of defragged file  
>> (data) extents to the disk, and then update the b-trees. What  
>> happens is that many of the updates to the b-trees would fall into  
>> the same disk sector/extent, so instead of many writes there will  
>> be just one write.
>>
>> Here is the general outline for implementation:
>>     - write a big bunch of defragged file extents to disk
>>         - a minimal set of updates of the b-trees that cannot be  
>> delayed is performed (this is nothing or almost nothing in most  
>> circumstances)
>>         - put the rest of required updates of b-trees into "pending  
>> operations buffer"
>>     - analyze the "pending operations buffer", and find out  
>> (approximately) the biggest part of it that can be flushed out by  
>> doing minimal number of disk writes
>>         - flush out that part of "pending operations buffer"
>>     - repeat

> It helps, but you still can't get around having to recompute the new  
> tree state, and that is going to take time proportionate to the  
> number of nodes that need to change, which in turn is proportionate  
> to the number of files.

Yes, but that is just a computation. The defrag performance mostly  
depends on minimizing disk I/O operations, not on computations.

In the past many good and fast defrag computation algorithms have been  
produced, and I don't see any reason why this project wouldn't be also  
able to create such a good algorithm.

>>>> The point is that the defrag can keep a buffer of a "pending  
>>>> operations". Pending operations are those that should be  
>>>> performed in order to keep the original sharing structure. If the  
>>>> defrag gets interrupted, then files in "pending operations" will  
>>>> be unshared. But this should really be some important and urgent  
>>>> interrupt, as the "pending operations" buffer needs at most a  
>>>> second or two to complete its operations.
>>
>>> Depending on the exact situation, it can take well more than a few  
>>> seconds to complete stuff. Especially if there are lots of reflinks.
>>
>> Nope. You are quite wrong there.
>> In the worst case, the "pending operations buffer" will update  
>> (write to disk) all the b-trees. So, the upper limit on time to  
>> flush the "pending operations buffer" equals the time to write the  
>> entire b-tree structure to the disk (into new extents). I estimate  
>> that takes at most a few seconds.

> So what you're talking about is journaling the computed state of  
> defrag operations.  That shouldn't be too bad (as long as it's done  
> in memory instead of on-disk) if you batch the computations  
> properly.  I thought you meant having a buffer of what operations to  
> do, and then computing them on-the-fly (which would have significant  
> overhead)

Looks close to what I was thinking. Soon we might be able to  
communicate. I'm not sure what you mean by "journaling the computed  
state of defrag operations". Maybe it doesn't matter.

What happens is that file (extent) data is first written to disk  
(defragmented), but b-tree is not immediately updated. It doesn't have  
to be. Even if there is a power loss, nothing happens.

So, the changes that should be done to the b-trees are put into  
pending-operations-buffer. When a lot of file (extent) data is written  
to disk, such that defrag-operation-space (1 GB) is close to being  
exhausted, the pending-operations-buffer is examined in order to  
attempt to free as much of defrag-operation-space as possible. The  
simplest algorithm is to flush the entire pending-operations-buffer at  
once. This reduces the number of writes that update the b-trees  
because many changes to the b-trees fall into the same or neighbouring  
disk sectors.

>>>>> * Reflinks can reference partial extents.  This means,  
>>>>> ultimately, that you may end up having to split extents in odd  
>>>>> ways during defrag if you want to preserve reflinks, and might  
>>>>> have to split extents _elsewhere_ that are only tangentially  
>>>>> related to the region being defragmented. See the example in my  
>>>>> previous email for a case like this, maintaining the shared  
>>>>> regions as being shared when you defragment either file to a  
>>>>> single extent will require splitting extents in the other file  
>>>>> (in either case, whichever file you don't defragment to a single  
>>>>> extent will end up having 7 extents if you try to force the one  
>>>>> that's been defragmented to be the canonical version).  Once you  
>>>>> consider that a given extent can have multiple ranges reflinked  
>>>>> from multiple other locations, it gets even more complicated.
>>>>
>>>> I think that this problem can be solved, and that it can be  
>>>> solved perfectly (the result is a perfectly-defragmented file).  
>>>> But, if it is so hard to do, just skip those problematic extents  
>>>> in initial version of defrag.
>>>>
>>>> Ultimately, in the super-duper defrag, those partially-referenced  
>>>> extents should be split up by defrag.
>>>>
>>>>> * If you choose to just not handle the above point by not  
>>>>> letting defrag split extents, you put a hard lower limit on the  
>>>>> amount of fragmentation present in a file if you want to  
>>>>> preserve reflinks.  IOW, you can't defragment files past a  
>>>>> certain point.  If we go this way, neither of the two files in  
>>>>> the example from my previous email could be defragmented any  
>>>>> further than they already are, because doing so would require  
>>>>> splitting extents.
>>>>
>>>> Oh, you're reading my thoughts. That's good.
>>>>
>>>> Initial implementation of defrag might be not-so-perfect. It  
>>>> would still be better than the current defrag.
>>>>
>>>> This is not a one-way street. Handling of partially-used extents  
>>>> can be improved in later versions.
>>>>
>>>>> * Determining all the reflinks to a given region of a given  
>>>>> extent is not a cheap operation, and the information may  
>>>>> immediately be stale (because an operation right after you fetch  
>>>>> the info might change things).  We could work around this by  
>>>>> locking the extent somehow, but doing so would be expensive  
>>>>> because you would have to hold the lock for the entire defrag  
>>>>> operation.
>>>>
>>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>>
>>>> Instead, you have to create a hook in every function that updates  
>>>> the reflink structure or extents (for exaple, write-to-file  
>>>> operation). So, when a reflink gets changed, the defrag is  
>>>> immediately notified about this. That way the defrag can keep its  
>>>> data about reflinks in-sync with the filesystem.
>>
>>> This doesn't get around the fact that it's still an expensive  
>>> operation to enumerate all the reflinks for a given region of a  
>>> file or extent.
>>
>> No, you are wrong.
>>
>> In order to enumerate all the reflinks in a region, the defrag  
>> needs to have another array, which is also kept in memory and in  
>> sync with the filesystem. It is the easiest to divide the disk into  
>> regions of equal size, where each region is a few MB large. Lets  
>> call this array "regions-to-extents" array. This array doesn't need  
>> to be associative, it is a plain array.
>> This in-memory array links regions of disk to extents that are in  
>> the region. The array in initialized when defrag starts.
>>
>> This array makes the operation of finding all extents of a region  
>> extremely fast.
> That has two issues:
>
> * That's going to be a _lot_ of memory.  You still need to be able  
> to defragment big (dozens plus TB) arrays without needing multiple  
> GB of RAM just for the defrag operation, otherwise it's not  
> realistically useful (remember, it was big arrays that had issues  
> with the old reflink-aware defrag too).

Ok, but let's get some calculations there. If regions are 4 MB in  
size, the region-extents array for an 8 TB partition would have 2  
million entries. If entries average 64 bytes, that would be:

  - a total of 128 MB memory for an 8 TB partition.

Of course, I'm guessing a lot of numbers there, but it should be doable.

> * You still have to populate the array in the first place.  A sane  
> implementation wouldn't be keeping it in memory even when defrag is  
> not running (no way is anybody going to tolerate even dozens of MB  
> of memory overhead for this), so you're not going to get around the  
> need to enumerate all the reflinks for a file at least once (during  
> startup, or when starting to process that file), so you're just  
> moving the overhead around instead of eliminating it.

Yes, when the defrag starts, the entire b-tree structure is examined  
in order for region-extents array and extents-backref associative  
array to be populated.

Of course, those two arrays exist only during defrag operation. When  
defrag completes, those arrays are deallocated.

>>> It also allows a very real possibility of a user functionally  
>>> delaying the defrag operation indefinitely (by triggering a  
>>> continuous stream of operations that would cause reflink changes  
>>> for a file being operated on by defrag) if not implemented very  
>>> carefully.
>>
>> Yes, if a user does something like that, the defrag can be paused  
>> or even aborted. That is normal.
> Not really.  Most defrag implementations either avoid files that  
> could reasonably be written to, or freeze writes to the file they're  
> operating on, or in some other way just sidestep the issue without  
> delaying the defragmentation process.
>>
>> There are many ways around this problem, but it really doesn't  
>> matter, those are just details. The initial version of defrag can  
>> just abort. The more mature versions of defrag can have a better  
>> handling of this problem.

> Details like this are the deciding factor for whether something is  
> sanely usable in certain use cases, as you have yourself found out  
> (for a lot of users, the fact that defrag can unshare extents is  
> 'just a detail' that's not worth worrying about).

I wouldn't agree there.

Not every issue is equal. Some issues are more important, some are  
trivial, some are tolerable etc...

The defrag is usually allowed to abort. It can easily be restarted  
later. Workaround: You can make a defrag-supervisor program, which  
starts a defrag, and if defrag aborts then it is restarted after some  
(configurable) amount of time.

On the other hand, unsharing is not easy to get undone.

So, those issues are not equals.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 21:34                             ` General Zed
@ 2019-09-12 22:28                               ` Chris Murphy
  2019-09-12 22:57                                 ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Chris Murphy @ 2019-09-12 22:28 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
>
>
> Quoting Chris Murphy <lists@colorremedies.com>:
>
> > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
> >>
> >> It is normal and common for defrag operation to use some disk space
> >> while it is running. I estimate that a reasonable limit would be to
> >> use up to 1% of total partition size. So, if a partition size is 100
> >> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
> >
> > The simplest case of a file with no shared extents, the minimum free
> > space should be set to the potential maximum rewrite of the file, i.e.
> > 100% of the file size. Since Btrfs is COW, the entire operation must
> > succeed or fail, no possibility of an ambiguous in between state, and
> > this does apply to defragment.
> >
> > So if you're defragging a 10GiB file, you need 10GiB minimum free
> > space to COW those extents to a new, mostly contiguous, set of exents,
>
> False.
>
> You can defragment just 1 GB of that file, and then just write out to
> disk (in new extents) an entire new version of b-trees.
> Of course, you don't really need to do all that, as usually only a
> small part of the b-trees need to be updated.

The `-l` option allows the user to choose a maximum amount to
defragment. Setting up a default defragment behavior that has a
variable outcome is not idempotent and probably not a good idea.

As for kernel behavior, it presumably could defragment in portions,
but it would have to completely update all affected metadata after
each e.g. 1GiB section, translating into 10 separate rewrites of file
metadata, all affected nodes, all the way up the tree to the super.
There is no such thing as metadata overwrites in Btrfs. You're
familiar with the wandering trees problem?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 19:54                           ` Austin S. Hemmelgarn
  2019-09-12 22:21                             ` General Zed
@ 2019-09-12 22:47                             ` General Zed
  1 sibling, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-12 22:47 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-12 15:18, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>>
>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>
>>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>>
>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>
>>>>

>>>>> * Reflinks can reference partial extents.  This means,  
>>>>> ultimately, that you may end up having to split extents in odd  
>>>>> ways during defrag if you want to preserve reflinks, and might  
>>>>> have to split extents _elsewhere_ that are only tangentially  
>>>>> related to the region being defragmented. See the example in my  
>>>>> previous email for a case like this, maintaining the shared  
>>>>> regions as being shared when you defragment either file to a  
>>>>> single extent will require splitting extents in the other file  
>>>>> (in either case, whichever file you don't defragment to a single  
>>>>> extent will end up having 7 extents if you try to force the one  
>>>>> that's been defragmented to be the canonical version).  Once you  
>>>>> consider that a given extent can have multiple ranges reflinked  
>>>>> from multiple other locations, it gets even more complicated.
>>>>
>>>> I think that this problem can be solved, and that it can be  
>>>> solved perfectly (the result is a perfectly-defragmented file).  
>>>> But, if it is so hard to do, just skip those problematic extents  
>>>> in initial version of defrag.
>>>>
>>>> Ultimately, in the super-duper defrag, those partially-referenced  
>>>> extents should be split up by defrag.
>>>>
>>>>> * If you choose to just not handle the above point by not  
>>>>> letting defrag split extents, you put a hard lower limit on the  
>>>>> amount of fragmentation present in a file if you want to  
>>>>> preserve reflinks.  IOW, you can't defragment files past a  
>>>>> certain point.  If we go this way, neither of the two files in  
>>>>> the example from my previous email could be defragmented any  
>>>>> further than they already are, because doing so would require  
>>>>> splitting extents.
>>>>
>>>> Oh, you're reading my thoughts. That's good.
>>>>
>>>> Initial implementation of defrag might be not-so-perfect. It  
>>>> would still be better than the current defrag.
>>>>
>>>> This is not a one-way street. Handling of partially-used extents  
>>>> can be improved in later versions.
>>>>
>>>>> * Determining all the reflinks to a given region of a given  
>>>>> extent is not a cheap operation, and the information may  
>>>>> immediately be stale (because an operation right after you fetch  
>>>>> the info might change things).  We could work around this by  
>>>>> locking the extent somehow, but doing so would be expensive  
>>>>> because you would have to hold the lock for the entire defrag  
>>>>> operation.
>>>>
>>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>>
>>>> Instead, you have to create a hook in every function that updates  
>>>> the reflink structure or extents (for exaple, write-to-file  
>>>> operation). So, when a reflink gets changed, the defrag is  
>>>> immediately notified about this. That way the defrag can keep its  
>>>> data about reflinks in-sync with the filesystem.
>>
>>> This doesn't get around the fact that it's still an expensive  
>>> operation to enumerate all the reflinks for a given region of a  
>>> file or extent.
>>
>> No, you are wrong.
>>
>> In order to enumerate all the reflinks in a region, the defrag  
>> needs to have another array, which is also kept in memory and in  
>> sync with the filesystem. It is the easiest to divide the disk into  
>> regions of equal size, where each region is a few MB large. Lets  
>> call this array "regions-to-extents" array. This array doesn't need  
>> to be associative, it is a plain array.
>> This in-memory array links regions of disk to extents that are in  
>> the region. The array in initialized when defrag starts.
>>
>> This array makes the operation of finding all extents of a region  
>> extremely fast.
> That has two issues:
>
> * That's going to be a _lot_ of memory.  You still need to be able  
> to defragment big (dozens plus TB) arrays without needing multiple  
> GB of RAM just for the defrag operation, otherwise it's not  
> realistically useful (remember, it was big arrays that had issues  
> with the old reflink-aware defrag too).

> * You still have to populate the array in the first place.  A sane  
> implementation wouldn't be keeping it in memory even when defrag is  
> not running (no way is anybody going to tolerate even dozens of MB  
> of memory overhead for this), so you're not going to get around the  
> need to enumerate all the reflinks for a file at least once (during  
> startup, or when starting to process that file), so you're just  
> moving the overhead around instead of eliminating it.

Nope, I'm not just "moving the overhead around instead of eliminating  
it", I am eliminating it.

The only overhead is at defrag startup, when the entire b-tree  
structure has to be loaded and examined. That happens in a few seconds.

After this point, there is no more "overhead" because the running  
defrag is always notified of any changes to the b-trees (by hookc in  
b-tree update routines). Whenever there is such a change,  
region-extents array gets updated. Since this region-extents array is  
in-memory, the update is so fast that it can be considered a zero  
overhead.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 22:28                               ` Chris Murphy
@ 2019-09-12 22:57                                 ` General Zed
  2019-09-12 23:54                                   ` Zygo Blaxell
  2019-09-13 11:09                                   ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 111+ messages in thread
From: General Zed @ 2019-09-12 22:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Chris Murphy <lists@colorremedies.com>:

> On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
>>
>>
>> Quoting Chris Murphy <lists@colorremedies.com>:
>>
>> > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>> >>
>> >> It is normal and common for defrag operation to use some disk space
>> >> while it is running. I estimate that a reasonable limit would be to
>> >> use up to 1% of total partition size. So, if a partition size is 100
>> >> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>> >
>> > The simplest case of a file with no shared extents, the minimum free
>> > space should be set to the potential maximum rewrite of the file, i.e.
>> > 100% of the file size. Since Btrfs is COW, the entire operation must
>> > succeed or fail, no possibility of an ambiguous in between state, and
>> > this does apply to defragment.
>> >
>> > So if you're defragging a 10GiB file, you need 10GiB minimum free
>> > space to COW those extents to a new, mostly contiguous, set of exents,
>>
>> False.
>>
>> You can defragment just 1 GB of that file, and then just write out to
>> disk (in new extents) an entire new version of b-trees.
>> Of course, you don't really need to do all that, as usually only a
>> small part of the b-trees need to be updated.
>
> The `-l` option allows the user to choose a maximum amount to
> defragment. Setting up a default defragment behavior that has a
> variable outcome is not idempotent and probably not a good idea.

We are talking about a future, imagined defrag. It has no -l option  
yet, as we haven't discussed it yet.

> As for kernel behavior, it presumably could defragment in portions,
> but it would have to completely update all affected metadata after
> each e.g. 1GiB section, translating into 10 separate rewrites of file
> metadata, all affected nodes, all the way up the tree to the super.
> There is no such thing as metadata overwrites in Btrfs. You're
> familiar with the wandering trees problem?

No, but it doesn't matter.

At worst, it just has to completely write-out "all metadata", all the  
way up to the super. It needs to be done just once, because what's the  
point of writing it 10 times over? Then, the super is updated as the  
final commit.

On my comouter the ENTIRE METADATA is 1 GB. That would be very  
tolerable and doable.

But that is a very bad case, because usually not much metadata has to  
be updated or written out to disk.

So, there is no problem.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 22:57                                 ` General Zed
@ 2019-09-12 23:54                                   ` Zygo Blaxell
  2019-09-13  0:26                                     ` General Zed
  2019-09-13 11:04                                     ` Austin S. Hemmelgarn
  2019-09-13 11:09                                   ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-12 23:54 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> 
> Quoting Chris Murphy <lists@colorremedies.com>:
> 
> > On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
> > > 
> > > 
> > > Quoting Chris Murphy <lists@colorremedies.com>:
> > > 
> > > > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
> > > >>
> > > >> It is normal and common for defrag operation to use some disk space
> > > >> while it is running. I estimate that a reasonable limit would be to
> > > >> use up to 1% of total partition size. So, if a partition size is 100
> > > >> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
> > > >
> > > > The simplest case of a file with no shared extents, the minimum free
> > > > space should be set to the potential maximum rewrite of the file, i.e.
> > > > 100% of the file size. Since Btrfs is COW, the entire operation must
> > > > succeed or fail, no possibility of an ambiguous in between state, and
> > > > this does apply to defragment.
> > > >
> > > > So if you're defragging a 10GiB file, you need 10GiB minimum free
> > > > space to COW those extents to a new, mostly contiguous, set of exents,
> > > 
> > > False.
> > > 
> > > You can defragment just 1 GB of that file, and then just write out to
> > > disk (in new extents) an entire new version of b-trees.
> > > Of course, you don't really need to do all that, as usually only a
> > > small part of the b-trees need to be updated.
> > 
> > The `-l` option allows the user to choose a maximum amount to
> > defragment. Setting up a default defragment behavior that has a
> > variable outcome is not idempotent and probably not a good idea.
> 
> We are talking about a future, imagined defrag. It has no -l option yet, as
> we haven't discussed it yet.
> 
> > As for kernel behavior, it presumably could defragment in portions,
> > but it would have to completely update all affected metadata after
> > each e.g. 1GiB section, translating into 10 separate rewrites of file
> > metadata, all affected nodes, all the way up the tree to the super.
> > There is no such thing as metadata overwrites in Btrfs. You're
> > familiar with the wandering trees problem?
> 
> No, but it doesn't matter.
> 
> At worst, it just has to completely write-out "all metadata", all the way up
> to the super. It needs to be done just once, because what's the point of
> writing it 10 times over? Then, the super is updated as the final commit.

This is kind of a silly discussion.  The biggest extent possible on
btrfs is 128MB, and the incremental gains of forcing 128MB extents to
be consecutive are negligible.  If you're defragging a 10GB file, you're
just going to end up doing 80 separate defrag operations.

128MB is big enough you're going to be seeking in the middle of reading
an extent anyway.  Once you have the file arranged in 128MB contiguous
fragments (or even a tenth of that on medium-fast spinning drives),
the job is done.

> On my comouter the ENTIRE METADATA is 1 GB. That would be very tolerable and
> doable.

You must have a small filesystem...mine range from 16 to 156GB, a bit too
big to fit in RAM comfortably.

Don't forget you have to write new checksum and free space tree pages.
In the worst case, you'll need about 1GB of new metadata pages for each
128MB you defrag (though you get to delete 99.5% of them immediately
after).

> But that is a very bad case, because usually not much metadata has to be
> updated or written out to disk.
> 
> So, there is no problem.
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 23:54                                   ` Zygo Blaxell
@ 2019-09-13  0:26                                     ` General Zed
  2019-09-13  3:12                                       ` Zygo Blaxell
  2019-09-13 11:04                                     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-13  0:26 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>>
>> Quoting Chris Murphy <lists@colorremedies.com>:
>>
>> > On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
>> > >
>> > >
>> > > Quoting Chris Murphy <lists@colorremedies.com>:
>> > >
>> > > > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>> > > >>
>> > > >> It is normal and common for defrag operation to use some disk space
>> > > >> while it is running. I estimate that a reasonable limit would be to
>> > > >> use up to 1% of total partition size. So, if a partition size is 100
>> > > >> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>> > > >
>> > > > The simplest case of a file with no shared extents, the minimum free
>> > > > space should be set to the potential maximum rewrite of the file, i.e.
>> > > > 100% of the file size. Since Btrfs is COW, the entire operation must
>> > > > succeed or fail, no possibility of an ambiguous in between state, and
>> > > > this does apply to defragment.
>> > > >
>> > > > So if you're defragging a 10GiB file, you need 10GiB minimum free
>> > > > space to COW those extents to a new, mostly contiguous, set of exents,
>> > >
>> > > False.
>> > >
>> > > You can defragment just 1 GB of that file, and then just write out to
>> > > disk (in new extents) an entire new version of b-trees.
>> > > Of course, you don't really need to do all that, as usually only a
>> > > small part of the b-trees need to be updated.
>> >
>> > The `-l` option allows the user to choose a maximum amount to
>> > defragment. Setting up a default defragment behavior that has a
>> > variable outcome is not idempotent and probably not a good idea.
>>
>> We are talking about a future, imagined defrag. It has no -l option yet, as
>> we haven't discussed it yet.
>>
>> > As for kernel behavior, it presumably could defragment in portions,
>> > but it would have to completely update all affected metadata after
>> > each e.g. 1GiB section, translating into 10 separate rewrites of file
>> > metadata, all affected nodes, all the way up the tree to the super.
>> > There is no such thing as metadata overwrites in Btrfs. You're
>> > familiar with the wandering trees problem?
>>
>> No, but it doesn't matter.
>>
>> At worst, it just has to completely write-out "all metadata", all the way up
>> to the super. It needs to be done just once, because what's the point of
>> writing it 10 times over? Then, the super is updated as the final commit.
>
> This is kind of a silly discussion.  The biggest extent possible on
> btrfs is 128MB, and the incremental gains of forcing 128MB extents to
> be consecutive are negligible.  If you're defragging a 10GB file, you're
> just going to end up doing 80 separate defrag operations.

Ok, then the max extent is 128 MB, that's fine. Someone here  
previously said that it is 2 GB, so he has disinformed me (in order to  
further his false argument).

I didn't ever said that I would force extents larger than 128 MB.

If you are defragging a 10 GB file, you'll likely have to do it in 10  
steps, because the defrag is usually allowed to only use a limited  
amount of disk space while in operation. That has nothing to do with  
the extent size.

> 128MB is big enough you're going to be seeking in the middle of reading
> an extent anyway.  Once you have the file arranged in 128MB contiguous
> fragments (or even a tenth of that on medium-fast spinning drives),
> the job is done.

Ok. When did I say anything different?

>> On my comouter the ENTIRE METADATA is 1 GB. That would be very tolerable and
>> doable.
>
> You must have a small filesystem...mine range from 16 to 156GB, a bit too
> big to fit in RAM comfortably.

You mean: all metadata size is 156 GB on one of your systems. However,  
you don't typically have to put ALL metadata in RAM.
You need just some parts needed for defrag operation. So, for defrag,  
what you really need is just some large metadata cache present in RAM.  
I would say that if such a metadata cache is using 128 MB (for 2 TB  
disk) to 2 GB (for 156 GB disk), than the defrag will run sufficiently  
fast.

> Don't forget you have to write new checksum and free space tree pages.
> In the worst case, you'll need about 1GB of new metadata pages for each
> 128MB you defrag (though you get to delete 99.5% of them immediately
> after).

Yes, here we are debating some worst-case scenaraio which is actually  
imposible in practice due to various reasons.

So, doesn't matter.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-11 21:42                       ` Zygo Blaxell
@ 2019-09-13  1:33                         ` General Zed
  0 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-13  1:33 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Austin S. Hemmelgarn, linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Wed, Sep 11, 2019 at 04:01:01PM -0400, webmaster@zedlx.com wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>

>> > Not necessarily. Even ignoring the case of data deduplication (which
>> > needs to be considered if you care at all about enterprise usage, and is
>> > part of the whole point of using a CoW filesystem), there are existing
>> > applications that actively use reflinks, either directly or indirectly
>> > (via things like the `copy_file_range` system call), and the number of
>> > such applications is growing.
>>
>> The same argument goes here: If data-deduplication was performed, then the
>> user has specifically requested it.
>> Therefore, since it was user's will, the defrag has to honor it, and so the
>> defrag must not unshare deduplicated extents because the user wants them
>> shared. This might prevent a perfect defrag, but that is exactly what the
>> user has requested, either directly or indirectly, by some policy he has
>> choosen.

>> You can't both perfectly defrag and honor deduplication. Therefore, the
>> defrag has to do the best possible thing while still honoring user's will.
>> <<<!!! So, the fact that the deduplication was performed is actually the
>> reason FOR not unsharing, not against it, as you made it look in that
>> paragraph. !!!>>>
>
> IMHO the current kernel 'defrag' API shouldn't be used any more.  We need
> a tool that handles dedupe and defrag at the same time, for precisely
> this reason:  currently the two operations have no knowledge of each
> other and duplicate or reverse each others work.  You don't need to defrag
> an extent if you can find a duplicate, and you don't want to use fragmented
> extents as dedupe sources.

Yes! The current defrag that you have is a bad counterpart for deduplication.

To preserve deduplication, you need the defrag that I suggested: the  
defrah which never unshares file data.

>> If the system unshares automatically after deduplication, then the user will
>> need to run deduplication again. Ridiculous!
>>
>> > > When a user creates a reflink to a file in the same subvolume, he is
>> > > willingly denying himself the assurance of a perfect defrag.
>> > > Because, as your example proves, if there are a few writes to BOTH
>> > > files, it gets impossible to defrag perfectly. So, if the user
>> > > creates such reflinks, it's his own whish and his own fault.
>>
>> > The same argument can be made about snapshots.  It's an invalid argument
>> > in both cases though because it's not always the user who's creating the
>> > reflinks or snapshots.
>>
>> Um, I don't agree.
>>
>> 1) Actually, it is always the user who is creating reflinks, and snapshots,
>> too. Ultimately, it's always the user who does absolutely everything,
>> because a computer is supposed to be under his full control. But, in the
>> case of reflink-copies, this is even more true
>> because reflinks are not an essential feature for normal OS operation, at
>> least as far as today's OSes go. Every OS has to copy files around. Every OS
>> requires the copy operation. No current OS requires the reflinked-copy
>> operation in order to function.
>
> If we don't do reflinks all day, every day, our disks fill up in a matter
> of hours...

The defrag which I am proposing will honor all your reflinks and won't  
unshare them ever withut user's specific request.

At the same time, it can still defrag this reflinked data, not  
perfectly, but almost as good as a perfect defrag.
So you can enjoy both your reflinks and have a reasonably  
defragmented, fast disk IO.

You can have both. It can be done!




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  0:26                                     ` General Zed
@ 2019-09-13  3:12                                       ` Zygo Blaxell
  2019-09-13  5:05                                         ` General Zed
                                                           ` (4 more replies)
  0 siblings, 5 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-13  3:12 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> > > 
> > > Quoting Chris Murphy <lists@colorremedies.com>:
> > > 
> > > > On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
> > > > >
> > > > >
> > > > > Quoting Chris Murphy <lists@colorremedies.com>:
> > > > >
> > > > > > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
> > > > > >>
> > > > > >> It is normal and common for defrag operation to use some disk space
> > > > > >> while it is running. I estimate that a reasonable limit would be to
> > > > > >> use up to 1% of total partition size. So, if a partition size is 100
> > > > > >> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
> > > > > >
> > > > > > The simplest case of a file with no shared extents, the minimum free
> > > > > > space should be set to the potential maximum rewrite of the file, i.e.
> > > > > > 100% of the file size. Since Btrfs is COW, the entire operation must
> > > > > > succeed or fail, no possibility of an ambiguous in between state, and
> > > > > > this does apply to defragment.
> > > > > >
> > > > > > So if you're defragging a 10GiB file, you need 10GiB minimum free
> > > > > > space to COW those extents to a new, mostly contiguous, set of exents,
> > > > >
> > > > > False.
> > > > >
> > > > > You can defragment just 1 GB of that file, and then just write out to
> > > > > disk (in new extents) an entire new version of b-trees.
> > > > > Of course, you don't really need to do all that, as usually only a
> > > > > small part of the b-trees need to be updated.
> > > >
> > > > The `-l` option allows the user to choose a maximum amount to
> > > > defragment. Setting up a default defragment behavior that has a
> > > > variable outcome is not idempotent and probably not a good idea.
> > > 
> > > We are talking about a future, imagined defrag. It has no -l option yet, as
> > > we haven't discussed it yet.
> > > 
> > > > As for kernel behavior, it presumably could defragment in portions,
> > > > but it would have to completely update all affected metadata after
> > > > each e.g. 1GiB section, translating into 10 separate rewrites of file
> > > > metadata, all affected nodes, all the way up the tree to the super.
> > > > There is no such thing as metadata overwrites in Btrfs. You're
> > > > familiar with the wandering trees problem?
> > > 
> > > No, but it doesn't matter.
> > > 
> > > At worst, it just has to completely write-out "all metadata", all the way up
> > > to the super. It needs to be done just once, because what's the point of
> > > writing it 10 times over? Then, the super is updated as the final commit.
> > 
> > This is kind of a silly discussion.  The biggest extent possible on
> > btrfs is 128MB, and the incremental gains of forcing 128MB extents to
> > be consecutive are negligible.  If you're defragging a 10GB file, you're
> > just going to end up doing 80 separate defrag operations.
> 
> Ok, then the max extent is 128 MB, that's fine. Someone here previously said
> that it is 2 GB, so he has disinformed me (in order to further his false
> argument).

If the 128MB limit is removed, you then hit the block group size limit,
which is some number of GB from 1 to 10 depending on number of disks
available and raid profile selection (the striping raid profiles cap
block group sizes at 10 disks, and single/raid1 profiles always use 1GB
block groups regardless of disk count).  So 2GB is _also_ a valid extent
size limit, just not the first limit that is relevant for defrag.

A lot of people get confused by 'filefrag -v' output, which coalesces
physically adjacent but distinct extents.  So if you use that tool,
it can _seem_ like there is a 2.5GB extent in a file, but it is really
20 distinct 128MB extents that start and end at adjacent addresses.
You can see the true structure in 'btrfs ins dump-tree' output.

That also brings up another reason why 10GB defrags are absurd on btrfs:
extent addresses are virtual.  There's no guarantee that a pair of extents
that meet at a block group boundary are physically adjacent, and after
operations like RAID array reorganization or free space defragmentation,
they are typically quite far apart physically.

> I didn't ever said that I would force extents larger than 128 MB.
> 
> If you are defragging a 10 GB file, you'll likely have to do it in 10 steps,
> because the defrag is usually allowed to only use a limited amount of disk
> space while in operation. That has nothing to do with the extent size.

Defrag is literally manipulating the extent size.  Fragments and extents
are the same thing in btrfs.

Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
commit metadata updates after each step, so more than 128MB of temporary
space may be used (especially if your disks are fast and empty,
and you start just after the end of the previous commit interval).
There are some opportunities to coalsce metadata updates, occupying up
to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
a flush, whichever comes first), but exploiting those opportunities
requires more space for uncommitted data.

If the filesystem starts to get low on space during a defrag, it can
inject commits to force metadata updates to happen more often, which
reduces the amount of temporary space needed (we can't delete the original
fragmented extents until their replacement extent is committed); however,
if the filesystem is so low on space that you're worried about running
out during a defrag, then you probably don't have big enough contiguous
free areas to relocate data into anyway, i.e. the defrag is just going to
push data from one fragmented location to a different fragmented location,
or bail out with "sorry, can't defrag that."

> > 128MB is big enough you're going to be seeking in the middle of reading
> > an extent anyway.  Once you have the file arranged in 128MB contiguous
> > fragments (or even a tenth of that on medium-fast spinning drives),
> > the job is done.
> 
> Ok. When did I say anything different?

There are multiple parties in this thread.  I'm not addressing just you.

> > > On my comouter the ENTIRE METADATA is 1 GB. That would be very tolerable and
> > > doable.
> > 
> > You must have a small filesystem...mine range from 16 to 156GB, a bit too
> > big to fit in RAM comfortably.
> 
> You mean: all metadata size is 156 GB on one of your systems. However, you
> don't typically have to put ALL metadata in RAM.
> You need just some parts needed for defrag operation. So, for defrag, what
> you really need is just some large metadata cache present in RAM. I would
> say that if such a metadata cache is using 128 MB (for 2 TB disk) to 2 GB
> (for 156 GB disk), than the defrag will run sufficiently fast.

You're missing something (metadata requirement for delete?) in those
estimates.

Total metadata size does not affect how much metadata cache you need
to defragment one extent quickly.  That number is a product of factors
including input and output and extent size ratio, the ratio of various
metadata item sizes to the metadata page size, and the number of trees you
have to update (number of reflinks + 3 for extent, csum, and free space
trees).

It follows from the above that if you're joining just 2 unshared extents
together, the total metadata required is well under a MB.

If you're defragging a 128MB journal file with 32768 4K extents, it can
create several GB of new metadata and spill out of RAM cache (which is
currently capped at 512MB for assorted reasons).  Add reflinks and you
might need more cache, or take a performance hit.  Yes, a GB might be
the total size of all your metadata, but if you run defrag on a 128MB
log file you could rewrite all of your filesystem's metadata in a single
transaction (almost...you probably won't need to update the device or
uuid trees).

If you want to pipeline multiple extents per commit to avoid seeking,
you need to multiply the above numbers by the size of the pipeline.

You can also reduce the metadata cache requirement by reducing the output
extent size.  A 16MB target extent size requires only 64MB of cache for
the logfile case.

> > Don't forget you have to write new checksum and free space tree pages.
> > In the worst case, you'll need about 1GB of new metadata pages for each
> > 128MB you defrag (though you get to delete 99.5% of them immediately
> > after).
> 
> Yes, here we are debating some worst-case scenaraio which is actually
> imposible in practice due to various reasons.

No, it's quite possible.  A log file written slowly on an active
filesystem above a few TB will do that accidentally.  Every now and then
I hit that case.  It can take several hours to do a logrotate on spinning
arrays because of all the metadata fetches and updates associated with
worst-case file delete.  Long enough to watch the delete happen, and
even follow along in the source code.

I guess if I did a proactive defrag every few hours, it might take less
time to do the logrotate, but that would mean spreading out all the
seeky IO load during the day instead of getting it all done at night.
Logrotate does the same job as defrag in this case (replacing a file in
thousands of fragments spread across the disk with a few large fragments
close together), except logrotate gets better compression.

To be more accurate, the example I gave above is the worst case you
can expect from normal user workloads.  If I throw in some reflinks
and snapshots, I can make it arbitrarily worse, until the entire disk
is consumed by the metadata update of a single extent defrag.

> So, doesn't matter.
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  3:12                                       ` Zygo Blaxell
@ 2019-09-13  5:05                                         ` General Zed
  2019-09-14  0:56                                           ` Zygo Blaxell
  2019-09-13  5:22                                         ` General Zed
                                                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-13  5:05 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > Don't forget you have to write new checksum and free space tree pages.
>> > In the worst case, you'll need about 1GB of new metadata pages for each
>> > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > after).
>>
>> Yes, here we are debating some worst-case scenaraio which is actually
>> imposible in practice due to various reasons.
>
> No, it's quite possible.  A log file written slowly on an active
> filesystem above a few TB will do that accidentally.  Every now and then
> I hit that case.  It can take several hours to do a logrotate on spinning
> arrays because of all the metadata fetches and updates associated with
> worst-case file delete.  Long enough to watch the delete happen, and
> even follow along in the source code.
>
> I guess if I did a proactive defrag every few hours, it might take less
> time to do the logrotate, but that would mean spreading out all the
> seeky IO load during the day instead of getting it all done at night.
> Logrotate does the same job as defrag in this case (replacing a file in
> thousands of fragments spread across the disk with a few large fragments
> close together), except logrotate gets better compression.
>
> To be more accurate, the example I gave above is the worst case you
> can expect from normal user workloads.  If I throw in some reflinks
> and snapshots, I can make it arbitrarily worse, until the entire disk
> is consumed by the metadata update of a single extent defrag.
>

I can't believe I am considering this case.

So, we have a 1TB log file "ultralog" split into 256 million 4 KB  
extents randomly over the entire disk. We have 512 GB free RAM and 2%  
free disk space. The file needs to be defragmented.

In order to do that, defrag needs to be able to copy-move multiple  
extents in one batch, and update the metadata.

The metadata has a total of at least 256 million entries, each of some  
size, but each one should hold at least a pointer to the extent (8  
bytes) and a checksum (8 bytes): In reality, it could be that there is  
a lot of other data there per entry.

The metadata is organized as a b-tree. Therefore, nearby nodes should  
contain data of consecutive file extents.

The trick, in this case, is to select one part of "ultralog" which is  
localized in the metadata, and defragment it. Repeating this step will  
ultimately defragment the entire file.

So, the defrag selects some part of metadata which is entirely a  
descendant of some b-tree node not far from the bottom of b-tree. It  
selects it such that the required update to the metadata is less than,  
let's say, 64 MB, and simultaneously the affected "ultralog" file  
fragments total less han 512 MB (therefore, less than 128 thousand  
metadata leaf entries, each pointing to a 4 KB fragment). Then it  
finds all the file extents pointed to by that part of metadata. They  
are consecutive (as file fragments), because we have selected such  
part of metadata. Now the defrag can safely copy-move those fragments  
to a new area and update the metadata.

In order to quickly select that small part of metadata, the defrag  
needs a metatdata cache that can hold somewhat more than 128 thousand  
localized metadata leaf entries. That fits into 128 MB RAM definitely.

Of course, there are many other small issues there, but this outlines  
the general procedure.

Problem solved?



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  3:12                                       ` Zygo Blaxell
  2019-09-13  5:05                                         ` General Zed
@ 2019-09-13  5:22                                         ` General Zed
  2019-09-13  6:16                                         ` General Zed
                                                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-13  5:22 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> You mean: all metadata size is 156 GB on one of your systems. However, you
>> don't typically have to put ALL metadata in RAM.
>> You need just some parts needed for defrag operation. So, for defrag, what
>> you really need is just some large metadata cache present in RAM. I would
>> say that if such a metadata cache is using 128 MB (for 2 TB disk) to 2 GB
>> (for 156 GB disk), than the defrag will run sufficiently fast.
>
> You're missing something (metadata requirement for delete?) in those
> estimates.
>
> Total metadata size does not affect how much metadata cache you need
> to defragment one extent quickly.  That number is a product of factors
> including input and output and extent size ratio, the ratio of various
> metadata item sizes to the metadata page size, and the number of trees you
> have to update (number of reflinks + 3 for extent, csum, and free space
> trees).
>
> It follows from the above that if you're joining just 2 unshared extents
> together, the total metadata required is well under a MB.
>
> If you're defragging a 128MB journal file with 32768 4K extents, it can
> create several GB of new metadata and spill out of RAM cache (which is
> currently capped at 512MB for assorted reasons).  Add reflinks and you
> might need more cache, or take a performance hit.  Yes, a GB might be
> the total size of all your metadata, but if you run defrag on a 128MB
> log file you could rewrite all of your filesystem's metadata in a single
> transaction (almost...you probably won't need to update the device or
> uuid trees).
>
> If you want to pipeline multiple extents per commit to avoid seeking,
> you need to multiply the above numbers by the size of the pipeline.
>
> You can also reduce the metadata cache requirement by reducing the output
> extent size.  A 16MB target extent size requires only 64MB of cache for
> the logfile case.
>
>> > Don't forget you have to write new checksum and free space tree pages.
>> > In the worst case, you'll need about 1GB of new metadata pages for each
>> > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > after).
>>
>> Yes, here we are debating some worst-case scenaraio which is actually
>> imposible in practice due to various reasons.
>
> No, it's quite possible.  A log file written slowly on an active
> filesystem above a few TB will do that accidentally.  Every now and then
> I hit that case.  It can take several hours to do a logrotate on spinning
> arrays because of all the metadata fetches and updates associated with
> worst-case file delete.  Long enough to watch the delete happen, and
> even follow along in the source code.
>
> I guess if I did a proactive defrag every few hours, it might take less
> time to do the logrotate, but that would mean spreading out all the
> seeky IO load during the day instead of getting it all done at night.
> Logrotate does the same job as defrag in this case (replacing a file in
> thousands of fragments spread across the disk with a few large fragments
> close together), except logrotate gets better compression.
>
> To be more accurate, the example I gave above is the worst case you
> can expect from normal user workloads.  If I throw in some reflinks
> and snapshots, I can make it arbitrarily worse, until the entire disk
> is consumed by the metadata update of a single extent defrag.
>

In fact, I overcomplicated it in my previous answer.

So, we have a 1TB log file "ultralog" split into 256 million 4 KB  
extents randomly over the entire disk. We have 512 GB free RAM and 2%  
free disk space. The file needs to be defragmented.

We select some (any) consecutive 512 MB of file segments. They are  
certainly localized in the metadata, because we are talking about an  
ordered b-tree. We write those 512 MB of file extents to another place  
on the partition, defragmented (don't update b-tree yet). Then defrag  
calculates the fuse (merge) operation on those written extents. Then  
it calculates which metadata updates are necessary. Since we have  
selected (at the start) consecutive 512 MB of file segments, the  
updates to metadata are certainly localazed. The defrag writes out, in  
new extents, the required changes to metadata, then updates the super  
to commit.

Easy.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  3:12                                       ` Zygo Blaxell
  2019-09-13  5:05                                         ` General Zed
  2019-09-13  5:22                                         ` General Zed
@ 2019-09-13  6:16                                         ` General Zed
  2019-09-13  6:58                                         ` General Zed
  2019-09-13  7:51                                         ` General Zed
  4 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-13  6:16 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> You mean: all metadata size is 156 GB on one of your systems. However, you
>> don't typically have to put ALL metadata in RAM.
>> You need just some parts needed for defrag operation. So, for defrag, what
>> you really need is just some large metadata cache present in RAM. I would
>> say that if such a metadata cache is using 128 MB (for 2 TB disk) to 2 GB
>> (for 156 GB disk), than the defrag will run sufficiently fast.
>
> You're missing something (metadata requirement for delete?) in those
> estimates.
>
> Total metadata size does not affect how much metadata cache you need
> to defragment one extent quickly.  That number is a product of factors
> including input and output and extent size ratio, the ratio of various
> metadata item sizes to the metadata page size, and the number of trees you
> have to update (number of reflinks + 3 for extent, csum, and free space
> trees).
>
> It follows from the above that if you're joining just 2 unshared extents
> together, the total metadata required is well under a MB.
>
> If you're defragging a 128MB journal file with 32768 4K extents, it can
> create several GB of new metadata and spill out of RAM cache (which is
> currently capped at 512MB for assorted reasons).  Add reflinks and you
> might need more cache, or take a performance hit.  Yes, a GB might be
> the total size of all your metadata, but if you run defrag on a 128MB
> log file you could rewrite all of your filesystem's metadata in a single
> transaction (almost...you probably won't need to update the device or
> uuid trees).

I can't see how that can happen. If you are defragmenting a single 128  
MB journal file, the metadata that points to it is certainly a small  
part of the entire b-tree (because the tree is ordered). If that part  
of the b-tree is to completely change, al the way up to super, the  
entire update of b-tree (written into new exents) can't be more than a  
tenth of the filesize (128 MB). So, there is no big overhead.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  3:12                                       ` Zygo Blaxell
                                                           ` (2 preceding siblings ...)
  2019-09-13  6:16                                         ` General Zed
@ 2019-09-13  6:58                                         ` General Zed
  2019-09-13  9:25                                           ` General Zed
  2019-09-13  7:51                                         ` General Zed
  4 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-13  6:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>> > >
>> > > At worst, it just has to completely write-out "all metadata",  
>> all the way up
>> > > to the super. It needs to be done just once, because what's the point of
>> > > writing it 10 times over? Then, the super is updated as the  
>> final commit.
>> >
>> > This is kind of a silly discussion.  The biggest extent possible on
>> > btrfs is 128MB, and the incremental gains of forcing 128MB extents to
>> > be consecutive are negligible.  If you're defragging a 10GB file, you're
>> > just going to end up doing 80 separate defrag operations.
>>
>> Ok, then the max extent is 128 MB, that's fine. Someone here previously said
>> that it is 2 GB, so he has disinformed me (in order to further his false
>> argument).
>
> If the 128MB limit is removed, you then hit the block group size limit,
> which is some number of GB from 1 to 10 depending on number of disks
> available and raid profile selection (the striping raid profiles cap
> block group sizes at 10 disks, and single/raid1 profiles always use 1GB
> block groups regardless of disk count).  So 2GB is _also_ a valid extent
> size limit, just not the first limit that is relevant for defrag.
>
> A lot of people get confused by 'filefrag -v' output, which coalesces
> physically adjacent but distinct extents.  So if you use that tool,
> it can _seem_ like there is a 2.5GB extent in a file, but it is really
> 20 distinct 128MB extents that start and end at adjacent addresses.
> You can see the true structure in 'btrfs ins dump-tree' output.
>
> That also brings up another reason why 10GB defrags are absurd on btrfs:
> extent addresses are virtual.  There's no guarantee that a pair of extents
> that meet at a block group boundary are physically adjacent, and after
> operations like RAID array reorganization or free space defragmentation,
> they are typically quite far apart physically.
>
>> I didn't ever said that I would force extents larger than 128 MB.
>>
>> If you are defragging a 10 GB file, you'll likely have to do it in 10 steps,
>> because the defrag is usually allowed to only use a limited amount of disk
>> space while in operation. That has nothing to do with the extent size.
>
> Defrag is literally manipulating the extent size.  Fragments and extents
> are the same thing in btrfs.
>
> Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
> commit metadata updates after each step, so more than 128MB of temporary
> space may be used (especially if your disks are fast and empty,
> and you start just after the end of the previous commit interval).
> There are some opportunities to coalsce metadata updates, occupying up
> to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
> a flush, whichever comes first), but exploiting those opportunities
> requires more space for uncommitted data.
>
> If the filesystem starts to get low on space during a defrag, it can
> inject commits to force metadata updates to happen more often, which
> reduces the amount of temporary space needed (we can't delete the original
> fragmented extents until their replacement extent is committed); however,
> if the filesystem is so low on space that you're worried about running
> out during a defrag, then you probably don't have big enough contiguous
> free areas to relocate data into anyway, i.e. the defrag is just going to
> push data from one fragmented location to a different fragmented location,
> or bail out with "sorry, can't defrag that."

Nope.

Each defrag "cycle" consists of two parts:
      1) move-out part
      2) move-in part

The move-out part select one contiguous area of the disk. Almost any  
area will do, but some smart choices are better. It then moves-out all  
data from that contiguous area into whatever holes there are left  
empty on the disk. The biggest problem is actually updating the  
metadata, since the updates are not localized.
Anyway, this part can even be skipped.

The move-in part now populates the completely free contiguous area  
with defragmented data.

In the case that the move-out part needs to be skipped because the  
defrag estimates that the update to metatada will be too big (like in  
the pathological case of a disk with 156 GB of metadata), it can  
sucessfully defrag by performing only the move-in part. In that case,  
the move-in area is not free of data and "defragmented" data won't be  
fully defragmented. Also, there should be at least 20% free disk space  
in this case in order to avoid defrag turning pathological.

But, these are all some pathological cases. They should be considered  
in some other discussion.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  3:12                                       ` Zygo Blaxell
                                                           ` (3 preceding siblings ...)
  2019-09-13  6:58                                         ` General Zed
@ 2019-09-13  7:51                                         ` General Zed
  4 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-13  7:51 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>> > >
>> > > Quoting Chris Murphy <lists@colorremedies.com>:
>> > >
>> > > > On Thu, Sep 12, 2019 at 3:34 PM General Zed  
>> <general-zed@zedlx.com> wrote:
>> > > > >
>> > > > >
>> > > > > Quoting Chris Murphy <lists@colorremedies.com>:
>> > > > >
>> > > > > > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>> > > > > >>
>> > > > > >> It is normal and common for defrag operation to use some  
>> disk space
>> > > > > >> while it is running. I estimate that a reasonable limit  
>> would be to
>> > > > > >> use up to 1% of total partition size. So, if a partition  
>> size is 100
>> > > > > >> GB, the defrag can use 1 GB. Lets call this "defrag  
>> operation space".
>> > > > > >
>> > > > > > The simplest case of a file with no shared extents, the  
>> minimum free
>> > > > > > space should be set to the potential maximum rewrite of  
>> the file, i.e.
>> > > > > > 100% of the file size. Since Btrfs is COW, the entire  
>> operation must
>> > > > > > succeed or fail, no possibility of an ambiguous in  
>> between state, and
>> > > > > > this does apply to defragment.
>> > > > > >
>> > > > > > So if you're defragging a 10GiB file, you need 10GiB minimum free
>> > > > > > space to COW those extents to a new, mostly contiguous,  
>> set of exents,
>> > > > >
>> > > > > False.
>> > > > >
>> > > > > You can defragment just 1 GB of that file, and then just  
>> write out to
>> > > > > disk (in new extents) an entire new version of b-trees.
>> > > > > Of course, you don't really need to do all that, as usually only a
>> > > > > small part of the b-trees need to be updated.
>> > > >
>> > > > The `-l` option allows the user to choose a maximum amount to
>> > > > defragment. Setting up a default defragment behavior that has a
>> > > > variable outcome is not idempotent and probably not a good idea.
>> > >
>> > > We are talking about a future, imagined defrag. It has no -l  
>> option yet, as
>> > > we haven't discussed it yet.
>> > >
>> > > > As for kernel behavior, it presumably could defragment in portions,
>> > > > but it would have to completely update all affected metadata after
>> > > > each e.g. 1GiB section, translating into 10 separate rewrites of file
>> > > > metadata, all affected nodes, all the way up the tree to the super.
>> > > > There is no such thing as metadata overwrites in Btrfs. You're
>> > > > familiar with the wandering trees problem?
>> > >
>> > > No, but it doesn't matter.
>> > >
>> > > At worst, it just has to completely write-out "all metadata",  
>> all the way up
>> > > to the super. It needs to be done just once, because what's the point of
>> > > writing it 10 times over? Then, the super is updated as the  
>> final commit.
>> >
>> > This is kind of a silly discussion.  The biggest extent possible on
>> > btrfs is 128MB, and the incremental gains of forcing 128MB extents to
>> > be consecutive are negligible.  If you're defragging a 10GB file, you're
>> > just going to end up doing 80 separate defrag operations.
>>
>> Ok, then the max extent is 128 MB, that's fine. Someone here previously said
>> that it is 2 GB, so he has disinformed me (in order to further his false
>> argument).
>
> If the 128MB limit is removed, you then hit the block group size limit,
> which is some number of GB from 1 to 10 depending on number of disks
> available and raid profile selection (the striping raid profiles cap
> block group sizes at 10 disks, and single/raid1 profiles always use 1GB
> block groups regardless of disk count).  So 2GB is _also_ a valid extent
> size limit, just not the first limit that is relevant for defrag.
>
> A lot of people get confused by 'filefrag -v' output, which coalesces
> physically adjacent but distinct extents.  So if you use that tool,
> it can _seem_ like there is a 2.5GB extent in a file, but it is really
> 20 distinct 128MB extents that start and end at adjacent addresses.
> You can see the true structure in 'btrfs ins dump-tree' output.
>
> That also brings up another reason why 10GB defrags are absurd on btrfs:
> extent addresses are virtual.  There's no guarantee that a pair of extents
> that meet at a block group boundary are physically adjacent, and after
> operations like RAID array reorganization or free space defragmentation,
> they are typically quite far apart physically.
>
>> I didn't ever said that I would force extents larger than 128 MB.
>>
>> If you are defragging a 10 GB file, you'll likely have to do it in 10 steps,
>> because the defrag is usually allowed to only use a limited amount of disk
>> space while in operation. That has nothing to do with the extent size.
>
> Defrag is literally manipulating the extent size.  Fragments and extents
> are the same thing in btrfs.
>
> Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
> commit metadata updates after each step, so more than 128MB of temporary
> space may be used (especially if your disks are fast and empty,
> and you start just after the end of the previous commit interval).
> There are some opportunities to coalsce metadata updates, occupying up
> to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
> a flush, whichever comes first), but exploiting those opportunities
> requires more space for uncommitted data.
>
> If the filesystem starts to get low on space during a defrag, it can
> inject commits to force metadata updates to happen more often, which
> reduces the amount of temporary space needed (we can't delete the original
> fragmented extents until their replacement extent is committed); however,
> if the filesystem is so low on space that you're worried about running
> out during a defrag, then you probably don't have big enough contiguous
> free areas to relocate data into anyway, i.e. the defrag is just going to
> push data from one fragmented location to a different fragmented location,
> or bail out with "sorry, can't defrag that."

If the filesystem starts to get low on space during a defrag, it  
should abort and notify the user. The only question is: how low amount  
of free space can be tolerated?

Forcing commits too often (and having a smaller operation area / move  
in area) increases the number of metadata updates.

Technically, you don't really need to have a big enough contiguous  
free areas, those areas can be quite 'dirty', and the defrag will  
still work, albeit at a slower pace.

The question you are posing here is a question of minimal free space  
required in order to not slow down the defrag significantly.  
Unfortunately, there is no simple answer about how to calculate that  
minimal free space. There should be some experimentation, some  
experience.

Certainly, a good idea would be to give the user some options.

For example, if the defrag estimates that it is twice as slow as it  
could be due to low free space, than it should proceed only if the  
user has supplied the option --lowFreeSpace. If the defrag estimates  
that it is six times as slow as it could be due to low free space,  
than it should proceed only if the user has supplied the option  
--veryLowFreeSpace.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  6:58                                         ` General Zed
@ 2019-09-13  9:25                                           ` General Zed
  2019-09-13 17:02                                             ` General Zed
  2019-09-14  0:59                                             ` Zygo Blaxell
  0 siblings, 2 replies; 111+ messages in thread
From: General Zed @ 2019-09-13  9:25 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting General Zed <general-zed@zedlx.com>:

> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>
>> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>>
>>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>
>>>> On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>>>> >
>>>> > At worst, it just has to completely write-out "all metadata",  
>>>> all the way up
>>>> > to the super. It needs to be done just once, because what's the point of
>>>> > writing it 10 times over? Then, the super is updated as the  
>>>> final commit.
>>>>
>>>> This is kind of a silly discussion.  The biggest extent possible on
>>>> btrfs is 128MB, and the incremental gains of forcing 128MB extents to
>>>> be consecutive are negligible.  If you're defragging a 10GB file, you're
>>>> just going to end up doing 80 separate defrag operations.
>>>
>>> Ok, then the max extent is 128 MB, that's fine. Someone here  
>>> previously said
>>> that it is 2 GB, so he has disinformed me (in order to further his false
>>> argument).
>>
>> If the 128MB limit is removed, you then hit the block group size limit,
>> which is some number of GB from 1 to 10 depending on number of disks
>> available and raid profile selection (the striping raid profiles cap
>> block group sizes at 10 disks, and single/raid1 profiles always use 1GB
>> block groups regardless of disk count).  So 2GB is _also_ a valid extent
>> size limit, just not the first limit that is relevant for defrag.
>>
>> A lot of people get confused by 'filefrag -v' output, which coalesces
>> physically adjacent but distinct extents.  So if you use that tool,
>> it can _seem_ like there is a 2.5GB extent in a file, but it is really
>> 20 distinct 128MB extents that start and end at adjacent addresses.
>> You can see the true structure in 'btrfs ins dump-tree' output.
>>
>> That also brings up another reason why 10GB defrags are absurd on btrfs:
>> extent addresses are virtual.  There's no guarantee that a pair of extents
>> that meet at a block group boundary are physically adjacent, and after
>> operations like RAID array reorganization or free space defragmentation,
>> they are typically quite far apart physically.
>>
>>> I didn't ever said that I would force extents larger than 128 MB.
>>>
>>> If you are defragging a 10 GB file, you'll likely have to do it in  
>>> 10 steps,
>>> because the defrag is usually allowed to only use a limited amount of disk
>>> space while in operation. That has nothing to do with the extent size.
>>
>> Defrag is literally manipulating the extent size.  Fragments and extents
>> are the same thing in btrfs.
>>
>> Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
>> commit metadata updates after each step, so more than 128MB of temporary
>> space may be used (especially if your disks are fast and empty,
>> and you start just after the end of the previous commit interval).
>> There are some opportunities to coalsce metadata updates, occupying up
>> to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
>> a flush, whichever comes first), but exploiting those opportunities
>> requires more space for uncommitted data.
>>
>> If the filesystem starts to get low on space during a defrag, it can
>> inject commits to force metadata updates to happen more often, which
>> reduces the amount of temporary space needed (we can't delete the original
>> fragmented extents until their replacement extent is committed); however,
>> if the filesystem is so low on space that you're worried about running
>> out during a defrag, then you probably don't have big enough contiguous
>> free areas to relocate data into anyway, i.e. the defrag is just going to
>> push data from one fragmented location to a different fragmented location,
>> or bail out with "sorry, can't defrag that."
>
> Nope.
>
> Each defrag "cycle" consists of two parts:
>      1) move-out part
>      2) move-in part
>
> The move-out part select one contiguous area of the disk. Almost any  
> area will do, but some smart choices are better. It then moves-out  
> all data from that contiguous area into whatever holes there are  
> left empty on the disk. The biggest problem is actually updating the  
> metadata, since the updates are not localized.
> Anyway, this part can even be skipped.
>
> The move-in part now populates the completely free contiguous area  
> with defragmented data.
>
> In the case that the move-out part needs to be skipped because the  
> defrag estimates that the update to metatada will be too big (like  
> in the pathological case of a disk with 156 GB of metadata), it can  
> sucessfully defrag by performing only the move-in part. In that  
> case, the move-in area is not free of data and "defragmented" data  
> won't be fully defragmented. Also, there should be at least 20% free  
> disk space in this case in order to avoid defrag turning pathological.
>
> But, these are all some pathological cases. They should be  
> considered in some other discussion.

I know how to do this pathological case. Figured it out!

Yeah, always ask General Zed, he knows the best!!!

The move-in phase is not a problem, because this phase generally  
affects a low number of files.

So, let's consider the move-out phase. The main concern here is that  
the move-out area may contain so many different files and fragments  
that the move-out forces a practically undoable metadata update.

So, the way to do it is to select files for move-out, one by one (or  
even more granular, by fragments of files), while keeping track of the  
size of the necessary metadata update. When the metadata update  
exceeds a certain amount (let's say 128 MB, an amount that can easily  
fit into RAM), the move-out is performed with only currently selected  
files (file fragments). (The move-out often doesn't affect a whole  
file since only a part of each file lies within the move-out area).

Now the defrag has to decide: whether to continue with another round  
of the move-out to get a cleaner move-in area (by repeating the same  
procedure above), or should it continue with a move-in into a partialy  
dirty area. I can't tell you what's better right now, as this can be  
determined only by experiments.

Lastly, the move-in phase is performed (can be done whether the  
move-in area is dirty or completely clean). Again, the same trick can  
be used: files can be selected one by one until the calculated  
metadata update exceeds 128 MB. However, it is more likely that the  
size of move-in area will be exhausted before this happens.

This algorithm will work even if you have only 3% free disk space left.

This algorithm will also work if you have metadata of huge size, but  
in that case it is better to have much more free disk space (20%) to  
avoid significantly slowing down the defrag operation.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 23:54                                   ` Zygo Blaxell
  2019-09-13  0:26                                     ` General Zed
@ 2019-09-13 11:04                                     ` Austin S. Hemmelgarn
  2019-09-13 20:43                                       ` Zygo Blaxell
  2019-09-14 18:29                                       ` Chris Murphy
  1 sibling, 2 replies; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-13 11:04 UTC (permalink / raw)
  To: Zygo Blaxell, General Zed; +Cc: Chris Murphy, Btrfs BTRFS

On 2019-09-12 19:54, Zygo Blaxell wrote:
> On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>>
>> Quoting Chris Murphy <lists@colorremedies.com>:
>>
>>> On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
>>>>
>>>>
>>>> Quoting Chris Murphy <lists@colorremedies.com>:
>>>>
>>>>> On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>>>>>>
>>>>>> It is normal and common for defrag operation to use some disk space
>>>>>> while it is running. I estimate that a reasonable limit would be to
>>>>>> use up to 1% of total partition size. So, if a partition size is 100
>>>>>> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>>>>>
>>>>> The simplest case of a file with no shared extents, the minimum free
>>>>> space should be set to the potential maximum rewrite of the file, i.e.
>>>>> 100% of the file size. Since Btrfs is COW, the entire operation must
>>>>> succeed or fail, no possibility of an ambiguous in between state, and
>>>>> this does apply to defragment.
>>>>>
>>>>> So if you're defragging a 10GiB file, you need 10GiB minimum free
>>>>> space to COW those extents to a new, mostly contiguous, set of exents,
>>>>
>>>> False.
>>>>
>>>> You can defragment just 1 GB of that file, and then just write out to
>>>> disk (in new extents) an entire new version of b-trees.
>>>> Of course, you don't really need to do all that, as usually only a
>>>> small part of the b-trees need to be updated.
>>>
>>> The `-l` option allows the user to choose a maximum amount to
>>> defragment. Setting up a default defragment behavior that has a
>>> variable outcome is not idempotent and probably not a good idea.
>>
>> We are talking about a future, imagined defrag. It has no -l option yet, as
>> we haven't discussed it yet.
>>
>>> As for kernel behavior, it presumably could defragment in portions,
>>> but it would have to completely update all affected metadata after
>>> each e.g. 1GiB section, translating into 10 separate rewrites of file
>>> metadata, all affected nodes, all the way up the tree to the super.
>>> There is no such thing as metadata overwrites in Btrfs. You're
>>> familiar with the wandering trees problem?
>>
>> No, but it doesn't matter.
>>
>> At worst, it just has to completely write-out "all metadata", all the way up
>> to the super. It needs to be done just once, because what's the point of
>> writing it 10 times over? Then, the super is updated as the final commit.
> 
> This is kind of a silly discussion.  The biggest extent possible on
> btrfs is 128MB, and the incremental gains of forcing 128MB extents to
> be consecutive are negligible.  If you're defragging a 10GB file, you're
> just going to end up doing 80 separate defrag operations.
Do you have a source for this claim of a 128MB max extent size?  Because 
everything I've seen indicates the max extent size is a full data chunk 
(so 1GB for the common case, potentially up to about 5GB for really big 
filesystems)
> 
> 128MB is big enough you're going to be seeking in the middle of reading
> an extent anyway.  Once you have the file arranged in 128MB contiguous
> fragments (or even a tenth of that on medium-fast spinning drives),
> the job is done.
> 
>> On my comouter the ENTIRE METADATA is 1 GB. That would be very tolerable and
>> doable.
> 
> You must have a small filesystem...mine range from 16 to 156GB, a bit too
> big to fit in RAM comfortably.
> 
> Don't forget you have to write new checksum and free space tree pages.
> In the worst case, you'll need about 1GB of new metadata pages for each
> 128MB you defrag (though you get to delete 99.5% of them immediately
> after).
> 
>> But that is a very bad case, because usually not much metadata has to be
>> updated or written out to disk.
>>
>> So, there is no problem.
>>
>>


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 22:57                                 ` General Zed
  2019-09-12 23:54                                   ` Zygo Blaxell
@ 2019-09-13 11:09                                   ` Austin S. Hemmelgarn
  2019-09-13 17:20                                     ` General Zed
  1 sibling, 1 reply; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-13 11:09 UTC (permalink / raw)
  To: General Zed, Chris Murphy; +Cc: Btrfs BTRFS

On 2019-09-12 18:57, General Zed wrote:
> 
> Quoting Chris Murphy <lists@colorremedies.com>:
> 
>> On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> 
>> wrote:
>>>
>>>
>>> Quoting Chris Murphy <lists@colorremedies.com>:
>>>
>>> > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>>> >>
>>> >> It is normal and common for defrag operation to use some disk space
>>> >> while it is running. I estimate that a reasonable limit would be to
>>> >> use up to 1% of total partition size. So, if a partition size is 100
>>> >> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>>> >
>>> > The simplest case of a file with no shared extents, the minimum free
>>> > space should be set to the potential maximum rewrite of the file, i.e.
>>> > 100% of the file size. Since Btrfs is COW, the entire operation must
>>> > succeed or fail, no possibility of an ambiguous in between state, and
>>> > this does apply to defragment.
>>> >
>>> > So if you're defragging a 10GiB file, you need 10GiB minimum free
>>> > space to COW those extents to a new, mostly contiguous, set of exents,
>>>
>>> False.
>>>
>>> You can defragment just 1 GB of that file, and then just write out to
>>> disk (in new extents) an entire new version of b-trees.
>>> Of course, you don't really need to do all that, as usually only a
>>> small part of the b-trees need to be updated.
>>
>> The `-l` option allows the user to choose a maximum amount to
>> defragment. Setting up a default defragment behavior that has a
>> variable outcome is not idempotent and probably not a good idea.
> 
> We are talking about a future, imagined defrag. It has no -l option yet, 
> as we haven't discussed it yet.
> 
>> As for kernel behavior, it presumably could defragment in portions,
>> but it would have to completely update all affected metadata after
>> each e.g. 1GiB section, translating into 10 separate rewrites of file
>> metadata, all affected nodes, all the way up the tree to the super.
>> There is no such thing as metadata overwrites in Btrfs. You're
>> familiar with the wandering trees problem?
> 
> No, but it doesn't matter.
No, it does matter.  Each time you update metadata, you have to update 
_the entire tree up to the tree root_.  Even if you batch your updates, 
you still have to propagate the update all the way up to the root of the 
tree.
> 
> At worst, it just has to completely write-out "all metadata", all the 
> way up to the super. It needs to be done just once, because what's the 
> point of writing it 10 times over? Then, the super is updated as the 
> final commit.
> 
> On my comouter the ENTIRE METADATA is 1 GB. That would be very tolerable 
> and doable.
You sound like you're dealing with a desktop use case.  It's not unusual 
for very large arrays (double digit TB or larger) to have metadata well 
into the hundreds of GB.  Hell, I've got a 200GB volume with bunches of 
small files that's got almost 5GB of metadata space used.
> 
> But that is a very bad case, because usually not much metadata has to be 
> updated or written out to disk.

> 
> So, there is no problem.
> 
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 22:21                             ` General Zed
@ 2019-09-13 11:53                               ` Austin S. Hemmelgarn
  2019-09-13 16:54                                 ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-13 11:53 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

On 2019-09-12 18:21, General Zed wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2019-09-12 15:18, webmaster@zedlx.com wrote:
>>>
>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>
>>>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>>>
>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>
>>>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>>>
>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>
>>>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>>>
>>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>>
>>>>>
>>>>>>>> Given this, defrag isn't willfully unsharing anything, it's just 
>>>>>>>> a side-effect of how it works (since it's rewriting the block 
>>>>>>>> layout of the file in-place).
>>>>>>>
>>>>>>> The current defrag has to unshare because, as you said, because 
>>>>>>> it is unaware of the full reflink structure. If it doesn't know 
>>>>>>> about all reflinks, it has to unshare, there is no way around that.
>>>>>>>
>>>>>>>> Now factor in that _any_ write will result in unsharing the 
>>>>>>>> region being written to, rounded to the nearest full filesystem 
>>>>>>>> block in both directions (this is mandatory, it's a side effect 
>>>>>>>> of the copy-on-write nature of BTRFS, and is why files that 
>>>>>>>> experience heavy internal rewrites get fragmented very heavily 
>>>>>>>> and very quickly on BTRFS).
>>>>>>>
>>>>>>> You mean: when defrag performs a write, the new data is unshared 
>>>>>>> because every write is unshared? Really?
>>>>>>>
>>>>>>> Consider there is an extent E55 shared by two files A and B. The 
>>>>>>> defrag has to move E55 to another location. In order to do that, 
>>>>>>> defrag creates a new extent E70. It makes it belong to file A by 
>>>>>>> changing the reflink of extent E55 in file A to point to E70.
>>>>>>>
>>>>>>> Now, to retain the original sharing structure, the defrag has to 
>>>>>>> change the reflink of extent E55 in file B to point to E70. You 
>>>>>>> are telling me this is not possible? Bullshit!
>>>>>>>
>>>>>>> Please explain to me how this 'defrag has to unshare' story of 
>>>>>>> yours isn't an intentional attempt to mislead me.
>>>>>
>>>>>> As mentioned in the previous email, we actually did have a 
>>>>>> (mostly) working reflink-aware defrag a few years back.  It got 
>>>>>> removed because it had serious performance issues.  Note that 
>>>>>> we're not talking a few seconds of extra time to defrag a full 
>>>>>> tree here, we're talking double-digit _minutes_ of extra time to 
>>>>>> defrag a moderate sized (low triple digit GB) subvolume with 
>>>>>> dozens of snapshots, _if you were lucky_ (if you weren't, you 
>>>>>> would be looking at potentially multiple _hours_ of runtime for 
>>>>>> the defrag).  The performance scaled inversely proportionate to 
>>>>>> the number of reflinks involved and the total amount of data in 
>>>>>> the subvolume being defragmented, and was pretty bad even in the 
>>>>>> case of only a couple of snapshots.
>>>>>
>>>>> You cannot ever make the worst program, because an even worse 
>>>>> program can be made by slowing down the original by a factor of 2.
>>>>> So, you had a badly implemented defrag. At least you got some 
>>>>> experience. Let's see what went wrong.
>>>>>
>>>>>> Ultimately, there are a couple of issues at play here:
>>>>>>
>>>>>> * Online defrag has to maintain consistency during operation.  The 
>>>>>> current implementation does this by rewriting the regions being 
>>>>>> defragmented (which causes them to become a single new extent 
>>>>>> (most of the time)), which avoids a whole lot of otherwise 
>>>>>> complicated logic required to make sure things happen correctly, 
>>>>>> and also means that only the file being operated on is impacted 
>>>>>> and only the parts being modified need to be protected against 
>>>>>> concurrent writes.  Properly handling reflinks means that _every_ 
>>>>>> file that shares some part of an extent with the file being 
>>>>>> operated on needs to have the reflinked regions locked for the 
>>>>>> defrag operation, which has a huge impact on performance. Using 
>>>>>> your example, the update to E55 in both files A and B has to 
>>>>>> happen as part of the same commit, which can contain no other 
>>>>>> writes in that region of the file, otherwise you run the risk of 
>>>>>> losing writes to file B that occur while file A is being 
>>>>>> defragmented.
>>>>>
>>>>> Nah. I think there is a workaround. You can first (atomically) 
>>>>> update A, then whatever, then you can update B later. I know, your 
>>>>> yelling "what if E55 gets updated in B". Doesn't matter. The defrag 
>>>>> continues later by searching for reflink to E55 in B. Then it 
>>>>> checks the data contained in E55. If the data matches the E70, then 
>>>>> it can safely update the reflink in B. Or the defrag can just 
>>>>> verify that neither E55 nor E70 have been written to in the 
>>>>> meantime. That means they still have the same data.
>>>
>>>> So, IOW, you don't care if the total space used by the data is 
>>>> instantaneously larger than what you started with?  That seems to be 
>>>> at odds with your previous statements, but OK, if we allow for that 
>>>> then this is indeed a non-issue.
>>>
>>> It is normal and common for defrag operation to use some disk space 
>>> while it is running. I estimate that a reasonable limit would be to 
>>> use up to 1% of total partition size. So, if a partition size is 100 
>>> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>>>
>>> The defrag should, when started, verify that there is "sufficient 
>>> free space" on the partition. In the case that there is no sufficient 
>>> free space, the defrag should output the message to the user and 
>>> abort. The size of "sufficient free space" must be larger than the 
>>> "defrag operation space". I would estimate that a good limit would be 
>>> 2% of the partition size. "defrag operation space" is a part of 
>>> "sufficient free space" while defrag operation is in progress.
>>>
>>> If, during defrag operation, sufficient free space drops below 2%, 
>>> the defrag should output a message and abort. Another possibility is 
>>> for defrag to pause until the user frees some disk space, but this is 
>>> not common in other defrag implementations AFAIK.
>>>
>>>>>> It's not horrible when it's just a small region in two files, but 
>>>>>> it becomes a big issue when dealing with lots of files and/or 
>>>>>> particularly large extents (extents in BTRFS can get into the GB 
>>>>>> range in terms of size when dealing with really big files).
>>>>>
>>>>> You must just split large extents in a smart way. So, in the 
>>>>> beginning, the defrag can split large extents (2GB) into smaller 
>>>>> ones (32MB) to facilitate more responsive and easier defrag.
>>>>>
>>>>> If you have lots of files, update them one-by one. It is possible. 
>>>>> Or you can update in big batches. Whatever is faster.
>>>
>>>> Neither will solve this though.  Large numbers of files are an issue 
>>>> because the operation is expensive and has to be done on each file, 
>>>> not because the number of files somehow makes the operation more 
>>>> espensive. It's O(n) relative to files, not higher time complexity.
>>>
>>> I would say that updating in big batches helps a lot, to the point 
>>> that it gets almost as fast as defragging any other file system. What 
>>> defrag needs to do is to write a big bunch of defragged file (data) 
>>> extents to the disk, and then update the b-trees. What happens is 
>>> that many of the updates to the b-trees would fall into the same disk 
>>> sector/extent, so instead of many writes there will be just one write.
>>>
>>> Here is the general outline for implementation:
>>>     - write a big bunch of defragged file extents to disk
>>>         - a minimal set of updates of the b-trees that cannot be 
>>> delayed is performed (this is nothing or almost nothing in most 
>>> circumstances)
>>>         - put the rest of required updates of b-trees into "pending 
>>> operations buffer"
>>>     - analyze the "pending operations buffer", and find out 
>>> (approximately) the biggest part of it that can be flushed out by 
>>> doing minimal number of disk writes
>>>         - flush out that part of "pending operations buffer"
>>>     - repeat
> 
>> It helps, but you still can't get around having to recompute the new 
>> tree state, and that is going to take time proportionate to the number 
>> of nodes that need to change, which in turn is proportionate to the 
>> number of files.
> 
> Yes, but that is just a computation. The defrag performance mostly 
> depends on minimizing disk I/O operations, not on computations.
You're assuming the defrag is being done on a system that's otherwise 
perfectly idle.  In the real world, that rarely, if ever, will be the 
case,  The system may be doing other things at the same time, and the 
more computation the defrag operation has to do, the more likely it is 
to negatively impact those other things.
> 
> In the past many good and fast defrag computation algorithms have been 
> produced, and I don't see any reason why this project wouldn't be also 
> able to create such a good algorithm.
Because it's not just the new extent locations you have to compute, you 
also need to compute the resultant metadata tree state, and the 
resultant extent tree state, and after all of that the resultant 
checksum tree state.  Yeah, figuring out optimal block layouts is 
solved, but you can't get around the overhead of recomputing the new 
tree state and all the block checksums for it.

The current defrag has to deal with this too, but it doesn't need to do 
as much computation because it's not worried about preserving reflinks 
(and therefore defragmenting a single file won't require updates to any 
other files).
> 
>>>>> The point is that the defrag can keep a buffer of a "pending 
>>>>> operations". Pending operations are those that should be performed 
>>>>> in order to keep the original sharing structure. If the defrag gets 
>>>>> interrupted, then files in "pending operations" will be unshared. 
>>>>> But this should really be some important and urgent interrupt, as 
>>>>> the "pending operations" buffer needs at most a second or two to 
>>>>> complete its operations.
>>>
>>>> Depending on the exact situation, it can take well more than a few 
>>>> seconds to complete stuff. Especially if there are lots of reflinks.
>>>
>>> Nope. You are quite wrong there.
>>> In the worst case, the "pending operations buffer" will update (write 
>>> to disk) all the b-trees. So, the upper limit on time to flush the 
>>> "pending operations buffer" equals the time to write the entire 
>>> b-tree structure to the disk (into new extents). I estimate that 
>>> takes at most a few seconds.
> 
>> So what you're talking about is journaling the computed state of 
>> defrag operations.  That shouldn't be too bad (as long as it's done in 
>> memory instead of on-disk) if you batch the computations properly.  I 
>> thought you meant having a buffer of what operations to do, and then 
>> computing them on-the-fly (which would have significant overhead)
> 
> Looks close to what I was thinking. Soon we might be able to 
> communicate. I'm not sure what you mean by "journaling the computed 
> state of defrag operations". Maybe it doesn't matter.
Essentially, doing a write-ahead log of pending operations.  Journaling 
is just the common term for such things when dealing with Linux 
filesystems because of ext* and XFS.  Based on what you say below, it 
sounds like we're on the same page here other than the terminology.
> 
> What happens is that file (extent) data is first written to disk 
> (defragmented), but b-tree is not immediately updated. It doesn't have 
> to be. Even if there is a power loss, nothing happens.
> 
> So, the changes that should be done to the b-trees are put into 
> pending-operations-buffer. When a lot of file (extent) data is written 
> to disk, such that defrag-operation-space (1 GB) is close to being 
> exhausted, the pending-operations-buffer is examined in order to attempt 
> to free as much of defrag-operation-space as possible. The simplest 
> algorithm is to flush the entire pending-operations-buffer at once. This 
> reduces the number of writes that update the b-trees because many 
> changes to the b-trees fall into the same or neighbouring disk sectors.
> 
>>>>>> * Reflinks can reference partial extents.  This means, ultimately, 
>>>>>> that you may end up having to split extents in odd ways during 
>>>>>> defrag if you want to preserve reflinks, and might have to split 
>>>>>> extents _elsewhere_ that are only tangentially related to the 
>>>>>> region being defragmented. See the example in my previous email 
>>>>>> for a case like this, maintaining the shared regions as being 
>>>>>> shared when you defragment either file to a single extent will 
>>>>>> require splitting extents in the other file (in either case, 
>>>>>> whichever file you don't defragment to a single extent will end up 
>>>>>> having 7 extents if you try to force the one that's been 
>>>>>> defragmented to be the canonical version).  Once you consider that 
>>>>>> a given extent can have multiple ranges reflinked from multiple 
>>>>>> other locations, it gets even more complicated.
>>>>>
>>>>> I think that this problem can be solved, and that it can be solved 
>>>>> perfectly (the result is a perfectly-defragmented file). But, if it 
>>>>> is so hard to do, just skip those problematic extents in initial 
>>>>> version of defrag.
>>>>>
>>>>> Ultimately, in the super-duper defrag, those partially-referenced 
>>>>> extents should be split up by defrag.
>>>>>
>>>>>> * If you choose to just not handle the above point by not letting 
>>>>>> defrag split extents, you put a hard lower limit on the amount of 
>>>>>> fragmentation present in a file if you want to preserve reflinks.  
>>>>>> IOW, you can't defragment files past a certain point.  If we go 
>>>>>> this way, neither of the two files in the example from my previous 
>>>>>> email could be defragmented any further than they already are, 
>>>>>> because doing so would require splitting extents.
>>>>>
>>>>> Oh, you're reading my thoughts. That's good.
>>>>>
>>>>> Initial implementation of defrag might be not-so-perfect. It would 
>>>>> still be better than the current defrag.
>>>>>
>>>>> This is not a one-way street. Handling of partially-used extents 
>>>>> can be improved in later versions.
>>>>>
>>>>>> * Determining all the reflinks to a given region of a given extent 
>>>>>> is not a cheap operation, and the information may immediately be 
>>>>>> stale (because an operation right after you fetch the info might 
>>>>>> change things).  We could work around this by locking the extent 
>>>>>> somehow, but doing so would be expensive because you would have to 
>>>>>> hold the lock for the entire defrag operation.
>>>>>
>>>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>>>
>>>>> Instead, you have to create a hook in every function that updates 
>>>>> the reflink structure or extents (for exaple, write-to-file 
>>>>> operation). So, when a reflink gets changed, the defrag is 
>>>>> immediately notified about this. That way the defrag can keep its 
>>>>> data about reflinks in-sync with the filesystem.
>>>
>>>> This doesn't get around the fact that it's still an expensive 
>>>> operation to enumerate all the reflinks for a given region of a file 
>>>> or extent.
>>>
>>> No, you are wrong.
>>>
>>> In order to enumerate all the reflinks in a region, the defrag needs 
>>> to have another array, which is also kept in memory and in sync with 
>>> the filesystem. It is the easiest to divide the disk into regions of 
>>> equal size, where each region is a few MB large. Lets call this array 
>>> "regions-to-extents" array. This array doesn't need to be 
>>> associative, it is a plain array.
>>> This in-memory array links regions of disk to extents that are in the 
>>> region. The array in initialized when defrag starts.
>>>
>>> This array makes the operation of finding all extents of a region 
>>> extremely fast.
>> That has two issues:
>>
>> * That's going to be a _lot_ of memory.  You still need to be able to 
>> defragment big (dozens plus TB) arrays without needing multiple GB of 
>> RAM just for the defrag operation, otherwise it's not realistically 
>> useful (remember, it was big arrays that had issues with the old 
>> reflink-aware defrag too).
> 
> Ok, but let's get some calculations there. If regions are 4 MB in size, 
> the region-extents array for an 8 TB partition would have 2 million 
> entries. If entries average 64 bytes, that would be:
> 
>   - a total of 128 MB memory for an 8 TB partition.
> 
> Of course, I'm guessing a lot of numbers there, but it should be doable.
Even if we assume such an optimistic estimation as you provide (I 
suspect it will require more than 64 bytes per-entry), that's a lot of 
RAM when you look at what it's potentially displacing.  That's enough 
RAM for receive and transmit buffers for a few hundred thousand network 
connections, or for caching multiple hundreds of thousands of dentries, 
or a few hundred thousand inodes.  Hell, that's enough RAM to run all 
the standard network services for a small network (DHCP, DNS, NTP, TFTP, 
mDNS relay, UPnP/NAT-PMP, SNMP, IGMP proxy, VPN of your choice) at least 
twice over.
> 
>> * You still have to populate the array in the first place.  A sane 
>> implementation wouldn't be keeping it in memory even when defrag is 
>> not running (no way is anybody going to tolerate even dozens of MB of 
>> memory overhead for this), so you're not going to get around the need 
>> to enumerate all the reflinks for a file at least once (during 
>> startup, or when starting to process that file), so you're just moving 
>> the overhead around instead of eliminating it.
> 
> Yes, when the defrag starts, the entire b-tree structure is examined in 
> order for region-extents array and extents-backref associative array to 
> be populated.
So your startup is going to take forever on any reasonably large volume. 
  This isn't eliminating the overhead, it's just moving it all to one 
place.  That might make it a bit more efficient than it would be 
interspersed throughout the operation, but only because it is reading 
all the relevant data at once.
> 
> Of course, those two arrays exist only during defrag operation. When 
> defrag completes, those arrays are deallocated.
> 
>>>> It also allows a very real possibility of a user functionally 
>>>> delaying the defrag operation indefinitely (by triggering a 
>>>> continuous stream of operations that would cause reflink changes for 
>>>> a file being operated on by defrag) if not implemented very carefully.
>>>
>>> Yes, if a user does something like that, the defrag can be paused or 
>>> even aborted. That is normal.
>> Not really.  Most defrag implementations either avoid files that could 
>> reasonably be written to, or freeze writes to the file they're 
>> operating on, or in some other way just sidestep the issue without 
>> delaying the defragmentation process.
>>>
>>> There are many ways around this problem, but it really doesn't 
>>> matter, those are just details. The initial version of defrag can 
>>> just abort. The more mature versions of defrag can have a better 
>>> handling of this problem.
> 
>> Details like this are the deciding factor for whether something is 
>> sanely usable in certain use cases, as you have yourself found out 
>> (for a lot of users, the fact that defrag can unshare extents is 'just 
>> a detail' that's not worth worrying about).
> 
> I wouldn't agree there.
> 
> Not every issue is equal. Some issues are more important, some are 
> trivial, some are tolerable etc...
> 
> The defrag is usually allowed to abort. It can easily be restarted 
> later. Workaround: You can make a defrag-supervisor program, which 
> starts a defrag, and if defrag aborts then it is restarted after some 
> (configurable) amount of time.
The fact that the defrag can be functionally deferred indefinitely by a 
user means that a user can, with a bit of effort, force degraded 
performance for everyone using the system.  Aborting the defrag doesn't 
solve that, and it's a significant issue for anybody doing shared hosting.
> 
> On the other hand, unsharing is not easy to get undone.
But, again, it this just doesn't matter for some people.
> 
> So, those issues are not equals.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 11:53                               ` Austin S. Hemmelgarn
@ 2019-09-13 16:54                                 ` General Zed
  2019-09-13 18:29                                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-13 16:54 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-12 18:21, General Zed wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-12 15:18, webmaster@zedlx.com wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>>>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>>>>
>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>
>>>>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>>>>
>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>
>>>>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>>>>
>>>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>>>
>>>>>>
>>>>>>>>> Given this, defrag isn't willfully unsharing anything, it's  
>>>>>>>>> just a side-effect of how it works (since it's rewriting the  
>>>>>>>>> block layout of the file in-place).
>>>>>>>>
>>>>>>>> The current defrag has to unshare because, as you said,  
>>>>>>>> because it is unaware of the full reflink structure. If it  
>>>>>>>> doesn't know about all reflinks, it has to unshare, there is  
>>>>>>>> no way around that.
>>>>>>>>
>>>>>>>>> Now factor in that _any_ write will result in unsharing the  
>>>>>>>>> region being written to, rounded to the nearest full  
>>>>>>>>> filesystem block in both directions (this is mandatory, it's  
>>>>>>>>> a side effect of the copy-on-write nature of BTRFS, and is  
>>>>>>>>> why files that experience heavy internal rewrites get  
>>>>>>>>> fragmented very heavily and very quickly on BTRFS).
>>>>>>>>
>>>>>>>> You mean: when defrag performs a write, the new data is  
>>>>>>>> unshared because every write is unshared? Really?
>>>>>>>>
>>>>>>>> Consider there is an extent E55 shared by two files A and B.  
>>>>>>>> The defrag has to move E55 to another location. In order to  
>>>>>>>> do that, defrag creates a new extent E70. It makes it belong  
>>>>>>>> to file A by changing the reflink of extent E55 in file A to  
>>>>>>>> point to E70.
>>>>>>>>
>>>>>>>> Now, to retain the original sharing structure, the defrag has  
>>>>>>>> to change the reflink of extent E55 in file B to point to  
>>>>>>>> E70. You are telling me this is not possible? Bullshit!
>>>>>>>>
>>>>>>>> Please explain to me how this 'defrag has to unshare' story  
>>>>>>>> of yours isn't an intentional attempt to mislead me.
>>>>>>
>>>>>>> As mentioned in the previous email, we actually did have a  
>>>>>>> (mostly) working reflink-aware defrag a few years back.  It  
>>>>>>> got removed because it had serious performance issues.  Note  
>>>>>>> that we're not talking a few seconds of extra time to defrag a  
>>>>>>> full tree here, we're talking double-digit _minutes_ of extra  
>>>>>>> time to defrag a moderate sized (low triple digit GB)  
>>>>>>> subvolume with dozens of snapshots, _if you were lucky_ (if  
>>>>>>> you weren't, you would be looking at potentially multiple  
>>>>>>> _hours_ of runtime for the defrag).  The performance scaled  
>>>>>>> inversely proportionate to the number of reflinks involved and  
>>>>>>> the total amount of data in the subvolume being defragmented,  
>>>>>>> and was pretty bad even in the case of only a couple of  
>>>>>>> snapshots.
>>>>>>
>>>>>> You cannot ever make the worst program, because an even worse  
>>>>>> program can be made by slowing down the original by a factor of  
>>>>>> 2.
>>>>>> So, you had a badly implemented defrag. At least you got some  
>>>>>> experience. Let's see what went wrong.
>>>>>>
>>>>>>> Ultimately, there are a couple of issues at play here:
>>>>>>>
>>>>>>> * Online defrag has to maintain consistency during operation.   
>>>>>>> The current implementation does this by rewriting the regions  
>>>>>>> being defragmented (which causes them to become a single new  
>>>>>>> extent (most of the time)), which avoids a whole lot of  
>>>>>>> otherwise complicated logic required to make sure things  
>>>>>>> happen correctly, and also means that only the file being  
>>>>>>> operated on is impacted and only the parts being modified need  
>>>>>>> to be protected against concurrent writes.  Properly handling  
>>>>>>> reflinks means that _every_ file that shares some part of an  
>>>>>>> extent with the file being operated on needs to have the  
>>>>>>> reflinked regions locked for the defrag operation, which has a  
>>>>>>> huge impact on performance. Using your example, the update to  
>>>>>>> E55 in both files A and B has to happen as part of the same  
>>>>>>> commit, which can contain no other writes in that region of  
>>>>>>> the file, otherwise you run the risk of losing writes to file  
>>>>>>> B that occur while file A is being defragmented.
>>>>>>
>>>>>> Nah. I think there is a workaround. You can first (atomically)  
>>>>>> update A, then whatever, then you can update B later. I know,  
>>>>>> your yelling "what if E55 gets updated in B". Doesn't matter.  
>>>>>> The defrag continues later by searching for reflink to E55 in  
>>>>>> B. Then it checks the data contained in E55. If the data  
>>>>>> matches the E70, then it can safely update the reflink in B. Or  
>>>>>> the defrag can just verify that neither E55 nor E70 have been  
>>>>>> written to in the meantime. That means they still have the same  
>>>>>> data.
>>>>
>>>>> So, IOW, you don't care if the total space used by the data is  
>>>>> instantaneously larger than what you started with?  That seems  
>>>>> to be at odds with your previous statements, but OK, if we allow  
>>>>> for that then this is indeed a non-issue.
>>>>
>>>> It is normal and common for defrag operation to use some disk  
>>>> space while it is running. I estimate that a reasonable limit  
>>>> would be to use up to 1% of total partition size. So, if a  
>>>> partition size is 100 GB, the defrag can use 1 GB. Lets call this  
>>>> "defrag operation space".
>>>>
>>>> The defrag should, when started, verify that there is "sufficient  
>>>> free space" on the partition. In the case that there is no  
>>>> sufficient free space, the defrag should output the message to  
>>>> the user and abort. The size of "sufficient free space" must be  
>>>> larger than the "defrag operation space". I would estimate that a  
>>>> good limit would be 2% of the partition size. "defrag operation  
>>>> space" is a part of "sufficient free space" while defrag  
>>>> operation is in progress.
>>>>
>>>> If, during defrag operation, sufficient free space drops below  
>>>> 2%, the defrag should output a message and abort. Another  
>>>> possibility is for defrag to pause until the user frees some disk  
>>>> space, but this is not common in other defrag implementations  
>>>> AFAIK.
>>>>
>>>>>>> It's not horrible when it's just a small region in two files,  
>>>>>>> but it becomes a big issue when dealing with lots of files  
>>>>>>> and/or particularly large extents (extents in BTRFS can get  
>>>>>>> into the GB range in terms of size when dealing with really  
>>>>>>> big files).
>>>>>>
>>>>>> You must just split large extents in a smart way. So, in the  
>>>>>> beginning, the defrag can split large extents (2GB) into  
>>>>>> smaller ones (32MB) to facilitate more responsive and easier  
>>>>>> defrag.
>>>>>>
>>>>>> If you have lots of files, update them one-by one. It is  
>>>>>> possible. Or you can update in big batches. Whatever is faster.
>>>>
>>>>> Neither will solve this though.  Large numbers of files are an  
>>>>> issue because the operation is expensive and has to be done on  
>>>>> each file, not because the number of files somehow makes the  
>>>>> operation more espensive. It's O(n) relative to files, not  
>>>>> higher time complexity.
>>>>
>>>> I would say that updating in big batches helps a lot, to the  
>>>> point that it gets almost as fast as defragging any other file  
>>>> system. What defrag needs to do is to write a big bunch of  
>>>> defragged file (data) extents to the disk, and then update the  
>>>> b-trees. What happens is that many of the updates to the b-trees  
>>>> would fall into the same disk sector/extent, so instead of many  
>>>> writes there will be just one write.
>>>>
>>>> Here is the general outline for implementation:
>>>>     - write a big bunch of defragged file extents to disk
>>>>         - a minimal set of updates of the b-trees that cannot be  
>>>> delayed is performed (this is nothing or almost nothing in most  
>>>> circumstances)
>>>>         - put the rest of required updates of b-trees into  
>>>> "pending operations buffer"
>>>>     - analyze the "pending operations buffer", and find out  
>>>> (approximately) the biggest part of it that can be flushed out by  
>>>> doing minimal number of disk writes
>>>>         - flush out that part of "pending operations buffer"
>>>>     - repeat
>>
>>> It helps, but you still can't get around having to recompute the  
>>> new tree state, and that is going to take time proportionate to  
>>> the number of nodes that need to change, which in turn is  
>>> proportionate to the number of files.
>>
>> Yes, but that is just a computation. The defrag performance mostly  
>> depends on minimizing disk I/O operations, not on computations.

> You're assuming the defrag is being done on a system that's  
> otherwise perfectly idle.  In the real world, that rarely, if ever,  
> will be the case,  The system may be doing other things at the same  
> time, and the more computation the defrag operation has to do, the  
> more likely it is to negatively impact those other things.

No, I'm not assuming that the system is perfectly idle. I'm assuming  
that the required computations don't take much CPU time, like it is  
common in a well implemented defrag.

>> In the past many good and fast defrag computation algorithms have  
>> been produced, and I don't see any reason why this project wouldn't  
>> be also able to create such a good algorithm.

> Because it's not just the new extent locations you have to compute,  
> you also need to compute the resultant metadata tree state, and the  
> resultant extent tree state, and after all of that the resultant  
> checksum tree state.  Yeah, figuring out optimal block layouts is  
> solved, but you can't get around the overhead of recomputing the new  
> tree state and all the block checksums for it.
>
> The current defrag has to deal with this too, but it doesn't need to  
> do as much computation because it's not worried about preserving  
> reflinks (and therefore defragmenting a single file won't require  
> updates to any other files).

Yes, the defrag algorithm needs to compute the new tree state.  
However, it shouldn't be slow at all. All operations on b-trees can be  
done in at most N*logN time, which is sufficiently fast. There is no  
operation there that I can think of that takes N*N or N*M time. So, it  
should all take little CPU time. Essentially a non-issue.

The ONLY concern that causes N*M time is the presence of sharing. But,  
even this is unfair, as the computation time will still be N*logN with  
regards to the total number of reflinks. That is still fast, even for  
100 GB metadata with a billion reflinks.

I don't understand why do you think that recomputing the new tree  
state must be slow. Even if there are a 100 new tree states that need  
to be recomputed, there is still no problem. Each metadata update will  
change only a small portion of b-trees, so the complexity and size of  
b-trees should not seriously affect the computation time.

>>>>>> The point is that the defrag can keep a buffer of a "pending  
>>>>>> operations". Pending operations are those that should be  
>>>>>> performed in order to keep the original sharing structure. If  
>>>>>> the defrag gets interrupted, then files in "pending operations"  
>>>>>> will be unshared. But this should really be some important and  
>>>>>> urgent interrupt, as the "pending operations" buffer needs at  
>>>>>> most a second or two to complete its operations.
>>>>
>>>>> Depending on the exact situation, it can take well more than a  
>>>>> few seconds to complete stuff. Especially if there are lots of  
>>>>> reflinks.
>>>>
>>>> Nope. You are quite wrong there.
>>>> In the worst case, the "pending operations buffer" will update  
>>>> (write to disk) all the b-trees. So, the upper limit on time to  
>>>> flush the "pending operations buffer" equals the time to write  
>>>> the entire b-tree structure to the disk (into new extents). I  
>>>> estimate that takes at most a few seconds.
>>
>>> So what you're talking about is journaling the computed state of  
>>> defrag operations.  That shouldn't be too bad (as long as it's  
>>> done in memory instead of on-disk) if you batch the computations  
>>> properly.  I thought you meant having a buffer of what operations  
>>> to do, and then computing them on-the-fly (which would have  
>>> significant overhead)
>>
>> Looks close to what I was thinking. Soon we might be able to  
>> communicate. I'm not sure what you mean by "journaling the computed  
>> state of defrag operations". Maybe it doesn't matter.

> Essentially, doing a write-ahead log of pending operations.   
> Journaling is just the common term for such things when dealing with  
> Linux filesystems because of ext* and XFS.  Based on what you say  
> below, it sounds like we're on the same page here other than the  
> terminology.
>>
>> What happens is that file (extent) data is first written to disk  
>> (defragmented), but b-tree is not immediately updated. It doesn't  
>> have to be. Even if there is a power loss, nothing happens.
>>
>> So, the changes that should be done to the b-trees are put into  
>> pending-operations-buffer. When a lot of file (extent) data is  
>> written to disk, such that defrag-operation-space (1 GB) is close  
>> to being exhausted, the pending-operations-buffer is examined in  
>> order to attempt to free as much of defrag-operation-space as  
>> possible. The simplest algorithm is to flush the entire  
>> pending-operations-buffer at once. This reduces the number of  
>> writes that update the b-trees because many changes to the b-trees  
>> fall into the same or neighbouring disk sectors.
>>
>>>>>>> * Reflinks can reference partial extents.  This means,  
>>>>>>> ultimately, that you may end up having to split extents in odd  
>>>>>>> ways during defrag if you want to preserve reflinks, and might  
>>>>>>> have to split extents _elsewhere_ that are only tangentially  
>>>>>>> related to the region being defragmented. See the example in  
>>>>>>> my previous email for a case like this, maintaining the shared  
>>>>>>> regions as being shared when you defragment either file to a  
>>>>>>> single extent will require splitting extents in the other file  
>>>>>>> (in either case, whichever file you don't defragment to a  
>>>>>>> single extent will end up having 7 extents if you try to force  
>>>>>>> the one that's been defragmented to be the canonical  
>>>>>>> version).  Once you consider that a given extent can have  
>>>>>>> multiple ranges reflinked from multiple other locations, it  
>>>>>>> gets even more complicated.
>>>>>>
>>>>>> I think that this problem can be solved, and that it can be  
>>>>>> solved perfectly (the result is a perfectly-defragmented file).  
>>>>>> But, if it is so hard to do, just skip those problematic  
>>>>>> extents in initial version of defrag.
>>>>>>
>>>>>> Ultimately, in the super-duper defrag, those  
>>>>>> partially-referenced extents should be split up by defrag.
>>>>>>
>>>>>>> * If you choose to just not handle the above point by not  
>>>>>>> letting defrag split extents, you put a hard lower limit on  
>>>>>>> the amount of fragmentation present in a file if you want to  
>>>>>>> preserve reflinks.  IOW, you can't defragment files past a  
>>>>>>> certain point.  If we go this way, neither of the two files in  
>>>>>>> the example from my previous email could be defragmented any  
>>>>>>> further than they already are, because doing so would require  
>>>>>>> splitting extents.
>>>>>>
>>>>>> Oh, you're reading my thoughts. That's good.
>>>>>>
>>>>>> Initial implementation of defrag might be not-so-perfect. It  
>>>>>> would still be better than the current defrag.
>>>>>>
>>>>>> This is not a one-way street. Handling of partially-used  
>>>>>> extents can be improved in later versions.
>>>>>>
>>>>>>> * Determining all the reflinks to a given region of a given  
>>>>>>> extent is not a cheap operation, and the information may  
>>>>>>> immediately be stale (because an operation right after you  
>>>>>>> fetch the info might change things).  We could work around  
>>>>>>> this by locking the extent somehow, but doing so would be  
>>>>>>> expensive because you would have to hold the lock for the  
>>>>>>> entire defrag operation.
>>>>>>
>>>>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>>>>
>>>>>> Instead, you have to create a hook in every function that  
>>>>>> updates the reflink structure or extents (for exaple,  
>>>>>> write-to-file operation). So, when a reflink gets changed, the  
>>>>>> defrag is immediately notified about this. That way the defrag  
>>>>>> can keep its data about reflinks in-sync with the filesystem.
>>>>
>>>>> This doesn't get around the fact that it's still an expensive  
>>>>> operation to enumerate all the reflinks for a given region of a  
>>>>> file or extent.
>>>>
>>>> No, you are wrong.
>>>>
>>>> In order to enumerate all the reflinks in a region, the defrag  
>>>> needs to have another array, which is also kept in memory and in  
>>>> sync with the filesystem. It is the easiest to divide the disk  
>>>> into regions of equal size, where each region is a few MB large.  
>>>> Lets call this array "regions-to-extents" array. This array  
>>>> doesn't need to be associative, it is a plain array.
>>>> This in-memory array links regions of disk to extents that are in  
>>>> the region. The array in initialized when defrag starts.
>>>>
>>>> This array makes the operation of finding all extents of a region  
>>>> extremely fast.
>>> That has two issues:
>>>
>>> * That's going to be a _lot_ of memory.  You still need to be able  
>>> to defragment big (dozens plus TB) arrays without needing multiple  
>>> GB of RAM just for the defrag operation, otherwise it's not  
>>> realistically useful (remember, it was big arrays that had issues  
>>> with the old reflink-aware defrag too).
>>
>> Ok, but let's get some calculations there. If regions are 4 MB in  
>> size, the region-extents array for an 8 TB partition would have 2  
>> million entries. If entries average 64 bytes, that would be:
>>
>>  - a total of 128 MB memory for an 8 TB partition.
>>
>> Of course, I'm guessing a lot of numbers there, but it should be doable.

> Even if we assume such an optimistic estimation as you provide (I  
> suspect it will require more than 64 bytes per-entry), that's a lot  
> of RAM when you look at what it's potentially displacing.  That's  
> enough RAM for receive and transmit buffers for a few hundred  
> thousand network connections, or for caching multiple hundreds of  
> thousands of dentries, or a few hundred thousand inodes.  Hell,  
> that's enough RAM to run all the standard network services for a  
> small network (DHCP, DNS, NTP, TFTP, mDNS relay, UPnP/NAT-PMP, SNMP,  
> IGMP proxy, VPN of your choice) at least twice over.

That depends on the average size of an extent. If the average size of  
an extent is around 4 MB, than my numbers should be good. Do you have  
any data which would suggest that my estimate is wrong? What's the  
average size of an extent on your filesystems (used space divided by  
number of extents)?
This "regions-to-extents" array can be further optimized if necessary.

You are not thinking correctly there (misplaced priorities). If the  
system needs to be defragmented, that's the priority. You can't do  
comparisons like that, that's unfair debating.

The defrag that I'm proposing should be able to run within common  
memory limits of today's computer systems. So, it will likely take  
somewhat less than 700 MB of RAM in most common situations, including  
the small servers. They all have 700 MB RAM.

700 MB is a lot for a defrag, but there is no way around it. Btrfs is  
simply a filesystem with such complexity that a good defrag requires a  
lot of RAM to operate.

If, for some reason, you would like to cover a use-case with  
constrained RAM conditions, then that is an entirely different concern  
for a different project. You can't make a project like this to cover  
ALL the possible circumstances. Some cases have to be left out. Here  
we are talking about a defrag that is usable in a general and common  
set of circumstances.

Please, don't drop special circumstances argument on me. That's not fair.

>>> * You still have to populate the array in the first place.  A sane  
>>> implementation wouldn't be keeping it in memory even when defrag  
>>> is not running (no way is anybody going to tolerate even dozens of  
>>> MB of memory overhead for this), so you're not going to get around  
>>> the need to enumerate all the reflinks for a file at least once  
>>> (during startup, or when starting to process that file), so you're  
>>> just moving the overhead around instead of eliminating it.
>>
>> Yes, when the defrag starts, the entire b-tree structure is  
>> examined in order for region-extents array and extents-backref  
>> associative array to be populated.

> So your startup is going to take forever on any reasonably large  
> volume.  This isn't eliminating the overhead, it's just moving it  
> all to one place.  That might make it a bit more efficient than it  
> would be interspersed throughout the operation, but only because it  
> is reading all the relevant data at once.

No, the startup will not take forever.

The startup needs exactly 1 (one) pass through the entire metadata. It  
needs this to find all the backlinks and to populate the  
"regios-extents" array.  The time to do 1 pass through metadata  
depends on the metadata size on disk, as entire metadata has to be  
read out (one piece at a time, you won't keep it all in RAM). In most  
cases, the time-to read the metadata will be less than 1 minute, on an  
SSD less than 20 seconds.

There is no way around it: to defrag, you eventually need to read all  
the b-trees, so nothing is lost there.

All computations in this defrag are simple. Finding all refliks in  
metadata is simple. It is a single pass metadata read-out.

>> Of course, those two arrays exist only during defrag operation.  
>> When defrag completes, those arrays are deallocated.
>>
>>>>> It also allows a very real possibility of a user functionally  
>>>>> delaying the defrag operation indefinitely (by triggering a  
>>>>> continuous stream of operations that would cause reflink changes  
>>>>> for a file being operated on by defrag) if not implemented very  
>>>>> carefully.
>>>>
>>>> Yes, if a user does something like that, the defrag can be paused  
>>>> or even aborted. That is normal.
>>> Not really.  Most defrag implementations either avoid files that  
>>> could reasonably be written to, or freeze writes to the file  
>>> they're operating on, or in some other way just sidestep the issue  
>>> without delaying the defragmentation process.
>>>>
>>>> There are many ways around this problem, but it really doesn't  
>>>> matter, those are just details. The initial version of defrag can  
>>>> just abort. The more mature versions of defrag can have a better  
>>>> handling of this problem.
>>
>>> Details like this are the deciding factor for whether something is  
>>> sanely usable in certain use cases, as you have yourself found out  
>>> (for a lot of users, the fact that defrag can unshare extents is  
>>> 'just a detail' that's not worth worrying about).
>>
>> I wouldn't agree there.
>>
>> Not every issue is equal. Some issues are more important, some are  
>> trivial, some are tolerable etc...
>>
>> The defrag is usually allowed to abort. It can easily be restarted  
>> later. Workaround: You can make a defrag-supervisor program, which  
>> starts a defrag, and if defrag aborts then it is restarted after  
>> some (configurable) amount of time.

> The fact that the defrag can be functionally deferred indefinitely  
> by a user means that a user can, with a bit of effort, force  
> degraded performance for everyone using the system.  Aborting the  
> defrag doesn't solve that, and it's a significant issue for anybody  
> doing shared hosting.

This is a quality-of-implementation issue. Not worthy of consideration  
at this time. It can be solved.

You can go and pick this kind of stuff all the time, with any system.  
I mean, because of the FACT that we have never proven that all  
security holes are eliminated, the computers shouldn't be powered on  
at all. Therefore, all computers should be shut down immediately and  
then there is absolutely no need to continue working on the btrfs. It  
is also impossible to produce the btrfs defrag, because all computers  
have to be shut down immediately.

Can we have a bit more fair discussion? Please?

>>
>> On the other hand, unsharing is not easy to get undone.
> But, again, it this just doesn't matter for some people.
>>
>> So, those issues are not equals.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  9:25                                           ` General Zed
@ 2019-09-13 17:02                                             ` General Zed
  2019-09-14  0:59                                             ` Zygo Blaxell
  1 sibling, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-13 17:02 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting General Zed <general-zed@zedlx.com>:

> Quoting General Zed <general-zed@zedlx.com>:
>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>>> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>>>
>>>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>>
>>>>> On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>>>>>>
>>>>>> At worst, it just has to completely write-out "all metadata",  
>>>>>> all the way up
>>>>>> to the super. It needs to be done just once, because what's the point of
>>>>>> writing it 10 times over? Then, the super is updated as the  
>>>>>> final commit.
>>>>>
>>>>> This is kind of a silly discussion.  The biggest extent possible on
>>>>> btrfs is 128MB, and the incremental gains of forcing 128MB extents to
>>>>> be consecutive are negligible.  If you're defragging a 10GB file, you're
>>>>> just going to end up doing 80 separate defrag operations.
>>>>
>>>> Ok, then the max extent is 128 MB, that's fine. Someone here  
>>>> previously said
>>>> that it is 2 GB, so he has disinformed me (in order to further his false
>>>> argument).
>>>
>>> If the 128MB limit is removed, you then hit the block group size limit,
>>> which is some number of GB from 1 to 10 depending on number of disks
>>> available and raid profile selection (the striping raid profiles cap
>>> block group sizes at 10 disks, and single/raid1 profiles always use 1GB
>>> block groups regardless of disk count).  So 2GB is _also_ a valid extent
>>> size limit, just not the first limit that is relevant for defrag.
>>>
>>> A lot of people get confused by 'filefrag -v' output, which coalesces
>>> physically adjacent but distinct extents.  So if you use that tool,
>>> it can _seem_ like there is a 2.5GB extent in a file, but it is really
>>> 20 distinct 128MB extents that start and end at adjacent addresses.
>>> You can see the true structure in 'btrfs ins dump-tree' output.
>>>
>>> That also brings up another reason why 10GB defrags are absurd on btrfs:
>>> extent addresses are virtual.  There's no guarantee that a pair of extents
>>> that meet at a block group boundary are physically adjacent, and after
>>> operations like RAID array reorganization or free space defragmentation,
>>> they are typically quite far apart physically.
>>>
>>>> I didn't ever said that I would force extents larger than 128 MB.
>>>>
>>>> If you are defragging a 10 GB file, you'll likely have to do it  
>>>> in 10 steps,
>>>> because the defrag is usually allowed to only use a limited amount of disk
>>>> space while in operation. That has nothing to do with the extent size.
>>>
>>> Defrag is literally manipulating the extent size.  Fragments and extents
>>> are the same thing in btrfs.
>>>
>>> Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
>>> commit metadata updates after each step, so more than 128MB of temporary
>>> space may be used (especially if your disks are fast and empty,
>>> and you start just after the end of the previous commit interval).
>>> There are some opportunities to coalsce metadata updates, occupying up
>>> to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
>>> a flush, whichever comes first), but exploiting those opportunities
>>> requires more space for uncommitted data.
>>>
>>> If the filesystem starts to get low on space during a defrag, it can
>>> inject commits to force metadata updates to happen more often, which
>>> reduces the amount of temporary space needed (we can't delete the original
>>> fragmented extents until their replacement extent is committed); however,
>>> if the filesystem is so low on space that you're worried about running
>>> out during a defrag, then you probably don't have big enough contiguous
>>> free areas to relocate data into anyway, i.e. the defrag is just going to
>>> push data from one fragmented location to a different fragmented location,
>>> or bail out with "sorry, can't defrag that."
>>
>> Nope.
>>
>> Each defrag "cycle" consists of two parts:
>>     1) move-out part
>>     2) move-in part
>>
>> The move-out part select one contiguous area of the disk. Almost  
>> any area will do, but some smart choices are better. It then  
>> moves-out all data from that contiguous area into whatever holes  
>> there are left empty on the disk. The biggest problem is actually  
>> updating the metadata, since the updates are not localized.
>> Anyway, this part can even be skipped.
>>
>> The move-in part now populates the completely free contiguous area  
>> with defragmented data.
>>
>> In the case that the move-out part needs to be skipped because the  
>> defrag estimates that the update to metatada will be too big (like  
>> in the pathological case of a disk with 156 GB of metadata), it can  
>> sucessfully defrag by performing only the move-in part. In that  
>> case, the move-in area is not free of data and "defragmented" data  
>> won't be fully defragmented. Also, there should be at least 20%  
>> free disk space in this case in order to avoid defrag turning  
>> pathological.
>>
>> But, these are all some pathological cases. They should be  
>> considered in some other discussion.
>
> I know how to do this pathological case. Figured it out!
>
> Yeah, always ask General Zed, he knows the best!!!
>
> The move-in phase is not a problem, because this phase generally  
> affects a low number of files.
>
> So, let's consider the move-out phase. The main concern here is that  
> the move-out area may contain so many different files and fragments  
> that the move-out forces a practically undoable metadata update.
>
> So, the way to do it is to select files for move-out, one by one (or  
> even more granular, by fragments of files), while keeping track of  
> the size of the necessary metadata update. When the metadata update  
> exceeds a certain amount (let's say 128 MB, an amount that can  
> easily fit into RAM), the move-out is performed with only currently  
> selected files (file fragments). (The move-out often doesn't affect  
> a whole file since only a part of each file lies within the move-out  
> area).
>
> Now the defrag has to decide: whether to continue with another round  
> of the move-out to get a cleaner move-in area (by repeating the same  
> procedure above), or should it continue with a move-in into a  
> partialy dirty area. I can't tell you what's better right now, as  
> this can be determined only by experiments.
>
> Lastly, the move-in phase is performed (can be done whether the  
> move-in area is dirty or completely clean). Again, the same trick  
> can be used: files can be selected one by one until the calculated  
> metadata update exceeds 128 MB. However, it is more likely that the  
> size of move-in area will be exhausted before this happens.
>
> This algorithm will work even if you have only 3% free disk space left.
>
> This algorithm will also work if you have metadata of huge size, but  
> in that case it is better to have much more free disk space (20%) to  
> avoid significantly slowing down the defrag operation.

I have just thought out an even better algorithm than this which gets  
to fully-defragged state faster, in a smaller number of disk writes.  
But I won't write it down unless someone says thanks for your effort  
so far, General Zed, and can you please tell us about your great new  
defrag algorithm for low free-space conditions.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 11:09                                   ` Austin S. Hemmelgarn
@ 2019-09-13 17:20                                     ` General Zed
  2019-09-13 18:20                                       ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-13 17:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-12 18:57, General Zed wrote:
>>
>> Quoting Chris Murphy <lists@colorremedies.com>:
>>
>>> On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
>>>>
>>>>
>>>> Quoting Chris Murphy <lists@colorremedies.com>:
>>>>
>>>>> On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>>>>>>
>>>>>> It is normal and common for defrag operation to use some disk space
>>>>>> while it is running. I estimate that a reasonable limit would be to
>>>>>> use up to 1% of total partition size. So, if a partition size is 100
>>>>>> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>>>>>
>>>>> The simplest case of a file with no shared extents, the minimum free
>>>>> space should be set to the potential maximum rewrite of the file, i.e.
>>>>> 100% of the file size. Since Btrfs is COW, the entire operation must
>>>>> succeed or fail, no possibility of an ambiguous in between state, and
>>>>> this does apply to defragment.
>>>>>
>>>>> So if you're defragging a 10GiB file, you need 10GiB minimum free
>>>>> space to COW those extents to a new, mostly contiguous, set of exents,
>>>>
>>>> False.
>>>>
>>>> You can defragment just 1 GB of that file, and then just write out to
>>>> disk (in new extents) an entire new version of b-trees.
>>>> Of course, you don't really need to do all that, as usually only a
>>>> small part of the b-trees need to be updated.
>>>
>>> The `-l` option allows the user to choose a maximum amount to
>>> defragment. Setting up a default defragment behavior that has a
>>> variable outcome is not idempotent and probably not a good idea.
>>
>> We are talking about a future, imagined defrag. It has no -l option  
>> yet, as we haven't discussed it yet.
>>
>>> As for kernel behavior, it presumably could defragment in portions,
>>> but it would have to completely update all affected metadata after
>>> each e.g. 1GiB section, translating into 10 separate rewrites of file
>>> metadata, all affected nodes, all the way up the tree to the super.
>>> There is no such thing as metadata overwrites in Btrfs. You're
>>> familiar with the wandering trees problem?
>>
>> No, but it doesn't matter.

> No, it does matter.  Each time you update metadata, you have to  
> update _the entire tree up to the tree root_.  Even if you batch  
> your updates, you still have to propagate the update all the way up  
> to the root of the tree.

Yes, you have to update ALL the way up to the root of the tree, but  
that is certainly not the ENTIRE b-tree. The "way up to the root" is  
just a small tiny part of any large b-tree.

Therefore, an non-issue.

>> At worst, it just has to completely write-out "all metadata", all  
>> the way up to the super. It needs to be done just once, because  
>> what's the point of writing it 10 times over? Then, the super is  
>> updated as the final commit.
>>
>> On my comouter the ENTIRE METADATA is 1 GB. That would be very  
>> tolerable and doable.

> You sound like you're dealing with a desktop use case.  It's not  
> unusual for very large arrays (double digit TB or larger) to have  
> metadata well into the hundreds of GB.  Hell, I've got a 200GB  
> volume with bunches of small files that's got almost 5GB of metadata  
> space used.

I just mentioned this "full-metadata-writeout" as some kind of  
imagined pathological case. This should never happen in a real defrag.  
A real defrag always updates only a small part of the metadata.

So, still no problem.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 17:20                                     ` General Zed
@ 2019-09-13 18:20                                       ` General Zed
  0 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-13 18:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS


Quoting General Zed <general-zed@zedlx.com>:

> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> On 2019-09-12 18:57, General Zed wrote:
>>>
>>> Quoting Chris Murphy <lists@colorremedies.com>:
>>>
>>>> On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
>>>>>
>>>>>
>>>>> Quoting Chris Murphy <lists@colorremedies.com>:
>>>>>
>>>>>> On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>>>>>>>
>>>>>>> It is normal and common for defrag operation to use some disk space
>>>>>>> while it is running. I estimate that a reasonable limit would be to
>>>>>>> use up to 1% of total partition size. So, if a partition size is 100
>>>>>>> GB, the defrag can use 1 GB. Lets call this "defrag operation space".
>>>>>>
>>>>>> The simplest case of a file with no shared extents, the minimum free
>>>>>> space should be set to the potential maximum rewrite of the file, i.e.
>>>>>> 100% of the file size. Since Btrfs is COW, the entire operation must
>>>>>> succeed or fail, no possibility of an ambiguous in between state, and
>>>>>> this does apply to defragment.
>>>>>>
>>>>>> So if you're defragging a 10GiB file, you need 10GiB minimum free
>>>>>> space to COW those extents to a new, mostly contiguous, set of exents,
>>>>>
>>>>> False.
>>>>>
>>>>> You can defragment just 1 GB of that file, and then just write out to
>>>>> disk (in new extents) an entire new version of b-trees.
>>>>> Of course, you don't really need to do all that, as usually only a
>>>>> small part of the b-trees need to be updated.
>>>>
>>>> The `-l` option allows the user to choose a maximum amount to
>>>> defragment. Setting up a default defragment behavior that has a
>>>> variable outcome is not idempotent and probably not a good idea.
>>>
>>> We are talking about a future, imagined defrag. It has no -l  
>>> option yet, as we haven't discussed it yet.
>>>
>>>> As for kernel behavior, it presumably could defragment in portions,
>>>> but it would have to completely update all affected metadata after
>>>> each e.g. 1GiB section, translating into 10 separate rewrites of file
>>>> metadata, all affected nodes, all the way up the tree to the super.
>>>> There is no such thing as metadata overwrites in Btrfs. You're
>>>> familiar with the wandering trees problem?
>>>
>>> No, but it doesn't matter.
>
>> No, it does matter.  Each time you update metadata, you have to  
>> update _the entire tree up to the tree root_.  Even if you batch  
>> your updates, you still have to propagate the update all the way up  
>> to the root of the tree.
>
> Yes, you have to update ALL the way up to the root of the tree, but  
> that is certainly not the ENTIRE b-tree. The "way up to the root" is  
> just a small tiny part of any large b-tree.

Why are you posting this misleading statements? You certainly know  
that the metadata can be only partially updatated.

Why did you say "_the entire tree up to the tree root_"? That sounds  
like you have to update the entire tree.

Why did you say "it does matter" when you know it doesn't matter?

> Therefore, an non-issue.
>
>>> At worst, it just has to completely write-out "all metadata", all  
>>> the way up to the super. It needs to be done just once, because  
>>> what's the point of writing it 10 times over? Then, the super is  
>>> updated as the final commit.
>>>
>>> On my comouter the ENTIRE METADATA is 1 GB. That would be very  
>>> tolerable and doable.
>
>> You sound like you're dealing with a desktop use case.  It's not  
>> unusual for very large arrays (double digit TB or larger) to have  
>> metadata well into the hundreds of GB.  Hell, I've got a 200GB  
>> volume with bunches of small files that's got almost 5GB of  
>> metadata space used.

Why are you trying to imply that I'm just thinking of desktops? If I  
said that I have 1 GB metadata on my computer, it doesn't mean that my  
defrag is supposed to work only on my computer and similars.

It appears as if you are relentless in attempts to make me look wrong,  
to the point of trying to misuse and misinterpret every statement that  
I make.

> I just mentioned this "full-metadata-writeout" as some kind of  
> imagined pathological case. This should never happen in a real  
> defrag. A real defrag always updates only a small part of the  
> metadata.
>
> So, still no problem.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 16:54                                 ` General Zed
@ 2019-09-13 18:29                                   ` Austin S. Hemmelgarn
  2019-09-13 19:40                                     ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-13 18:29 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

On 2019-09-13 12:54, General Zed wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2019-09-12 18:21, General Zed wrote:
>>>
>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>
>>>> On 2019-09-12 15:18, webmaster@zedlx.com wrote:
>>>>>
>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>
>>>>>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>>>>>
>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>
>>>>>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>>>>>
>>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>>
>>>>>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>>>>>
>>>>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>>>>
>>>>>>>
>>>>>>>>>> Given this, defrag isn't willfully unsharing anything, it's 
>>>>>>>>>> just a side-effect of how it works (since it's rewriting the 
>>>>>>>>>> block layout of the file in-place).
>>>>>>>>>
>>>>>>>>> The current defrag has to unshare because, as you said, because 
>>>>>>>>> it is unaware of the full reflink structure. If it doesn't know 
>>>>>>>>> about all reflinks, it has to unshare, there is no way around 
>>>>>>>>> that.
>>>>>>>>>
>>>>>>>>>> Now factor in that _any_ write will result in unsharing the 
>>>>>>>>>> region being written to, rounded to the nearest full 
>>>>>>>>>> filesystem block in both directions (this is mandatory, it's a 
>>>>>>>>>> side effect of the copy-on-write nature of BTRFS, and is why 
>>>>>>>>>> files that experience heavy internal rewrites get fragmented 
>>>>>>>>>> very heavily and very quickly on BTRFS).
>>>>>>>>>
>>>>>>>>> You mean: when defrag performs a write, the new data is 
>>>>>>>>> unshared because every write is unshared? Really?
>>>>>>>>>
>>>>>>>>> Consider there is an extent E55 shared by two files A and B. 
>>>>>>>>> The defrag has to move E55 to another location. In order to do 
>>>>>>>>> that, defrag creates a new extent E70. It makes it belong to 
>>>>>>>>> file A by changing the reflink of extent E55 in file A to point 
>>>>>>>>> to E70.
>>>>>>>>>
>>>>>>>>> Now, to retain the original sharing structure, the defrag has 
>>>>>>>>> to change the reflink of extent E55 in file B to point to E70. 
>>>>>>>>> You are telling me this is not possible? Bullshit!
>>>>>>>>>
>>>>>>>>> Please explain to me how this 'defrag has to unshare' story of 
>>>>>>>>> yours isn't an intentional attempt to mislead me.
>>>>>>>
>>>>>>>> As mentioned in the previous email, we actually did have a 
>>>>>>>> (mostly) working reflink-aware defrag a few years back.  It got 
>>>>>>>> removed because it had serious performance issues.  Note that 
>>>>>>>> we're not talking a few seconds of extra time to defrag a full 
>>>>>>>> tree here, we're talking double-digit _minutes_ of extra time to 
>>>>>>>> defrag a moderate sized (low triple digit GB) subvolume with 
>>>>>>>> dozens of snapshots, _if you were lucky_ (if you weren't, you 
>>>>>>>> would be looking at potentially multiple _hours_ of runtime for 
>>>>>>>> the defrag).  The performance scaled inversely proportionate to 
>>>>>>>> the number of reflinks involved and the total amount of data in 
>>>>>>>> the subvolume being defragmented, and was pretty bad even in the 
>>>>>>>> case of only a couple of snapshots.
>>>>>>>
>>>>>>> You cannot ever make the worst program, because an even worse 
>>>>>>> program can be made by slowing down the original by a factor of 2.
>>>>>>> So, you had a badly implemented defrag. At least you got some 
>>>>>>> experience. Let's see what went wrong.
>>>>>>>
>>>>>>>> Ultimately, there are a couple of issues at play here:
>>>>>>>>
>>>>>>>> * Online defrag has to maintain consistency during operation.  
>>>>>>>> The current implementation does this by rewriting the regions 
>>>>>>>> being defragmented (which causes them to become a single new 
>>>>>>>> extent (most of the time)), which avoids a whole lot of 
>>>>>>>> otherwise complicated logic required to make sure things happen 
>>>>>>>> correctly, and also means that only the file being operated on 
>>>>>>>> is impacted and only the parts being modified need to be 
>>>>>>>> protected against concurrent writes.  Properly handling reflinks 
>>>>>>>> means that _every_ file that shares some part of an extent with 
>>>>>>>> the file being operated on needs to have the reflinked regions 
>>>>>>>> locked for the defrag operation, which has a huge impact on 
>>>>>>>> performance. Using your example, the update to E55 in both files 
>>>>>>>> A and B has to happen as part of the same commit, which can 
>>>>>>>> contain no other writes in that region of the file, otherwise 
>>>>>>>> you run the risk of losing writes to file B that occur while 
>>>>>>>> file A is being defragmented.
>>>>>>>
>>>>>>> Nah. I think there is a workaround. You can first (atomically) 
>>>>>>> update A, then whatever, then you can update B later. I know, 
>>>>>>> your yelling "what if E55 gets updated in B". Doesn't matter. The 
>>>>>>> defrag continues later by searching for reflink to E55 in B. Then 
>>>>>>> it checks the data contained in E55. If the data matches the E70, 
>>>>>>> then it can safely update the reflink in B. Or the defrag can 
>>>>>>> just verify that neither E55 nor E70 have been written to in the 
>>>>>>> meantime. That means they still have the same data.
>>>>>
>>>>>> So, IOW, you don't care if the total space used by the data is 
>>>>>> instantaneously larger than what you started with?  That seems to 
>>>>>> be at odds with your previous statements, but OK, if we allow for 
>>>>>> that then this is indeed a non-issue.
>>>>>
>>>>> It is normal and common for defrag operation to use some disk space 
>>>>> while it is running. I estimate that a reasonable limit would be to 
>>>>> use up to 1% of total partition size. So, if a partition size is 
>>>>> 100 GB, the defrag can use 1 GB. Lets call this "defrag operation 
>>>>> space".
>>>>>
>>>>> The defrag should, when started, verify that there is "sufficient 
>>>>> free space" on the partition. In the case that there is no 
>>>>> sufficient free space, the defrag should output the message to the 
>>>>> user and abort. The size of "sufficient free space" must be larger 
>>>>> than the "defrag operation space". I would estimate that a good 
>>>>> limit would be 2% of the partition size. "defrag operation space" 
>>>>> is a part of "sufficient free space" while defrag operation is in 
>>>>> progress.
>>>>>
>>>>> If, during defrag operation, sufficient free space drops below 2%, 
>>>>> the defrag should output a message and abort. Another possibility 
>>>>> is for defrag to pause until the user frees some disk space, but 
>>>>> this is not common in other defrag implementations AFAIK.
>>>>>
>>>>>>>> It's not horrible when it's just a small region in two files, 
>>>>>>>> but it becomes a big issue when dealing with lots of files 
>>>>>>>> and/or particularly large extents (extents in BTRFS can get into 
>>>>>>>> the GB range in terms of size when dealing with really big files).
>>>>>>>
>>>>>>> You must just split large extents in a smart way. So, in the 
>>>>>>> beginning, the defrag can split large extents (2GB) into smaller 
>>>>>>> ones (32MB) to facilitate more responsive and easier defrag.
>>>>>>>
>>>>>>> If you have lots of files, update them one-by one. It is 
>>>>>>> possible. Or you can update in big batches. Whatever is faster.
>>>>>
>>>>>> Neither will solve this though.  Large numbers of files are an 
>>>>>> issue because the operation is expensive and has to be done on 
>>>>>> each file, not because the number of files somehow makes the 
>>>>>> operation more espensive. It's O(n) relative to files, not higher 
>>>>>> time complexity.
>>>>>
>>>>> I would say that updating in big batches helps a lot, to the point 
>>>>> that it gets almost as fast as defragging any other file system. 
>>>>> What defrag needs to do is to write a big bunch of defragged file 
>>>>> (data) extents to the disk, and then update the b-trees. What 
>>>>> happens is that many of the updates to the b-trees would fall into 
>>>>> the same disk sector/extent, so instead of many writes there will 
>>>>> be just one write.
>>>>>
>>>>> Here is the general outline for implementation:
>>>>>     - write a big bunch of defragged file extents to disk
>>>>>         - a minimal set of updates of the b-trees that cannot be 
>>>>> delayed is performed (this is nothing or almost nothing in most 
>>>>> circumstances)
>>>>>         - put the rest of required updates of b-trees into "pending 
>>>>> operations buffer"
>>>>>     - analyze the "pending operations buffer", and find out 
>>>>> (approximately) the biggest part of it that can be flushed out by 
>>>>> doing minimal number of disk writes
>>>>>         - flush out that part of "pending operations buffer"
>>>>>     - repeat
>>>
>>>> It helps, but you still can't get around having to recompute the new 
>>>> tree state, and that is going to take time proportionate to the 
>>>> number of nodes that need to change, which in turn is proportionate 
>>>> to the number of files.
>>>
>>> Yes, but that is just a computation. The defrag performance mostly 
>>> depends on minimizing disk I/O operations, not on computations.
> 
>> You're assuming the defrag is being done on a system that's otherwise 
>> perfectly idle.  In the real world, that rarely, if ever, will be the 
>> case,  The system may be doing other things at the same time, and the 
>> more computation the defrag operation has to do, the more likely it is 
>> to negatively impact those other things.
> 
> No, I'm not assuming that the system is perfectly idle. I'm assuming 
> that the required computations don't take much CPU time, like it is 
> common in a well implemented defrag.
Which also usually doesn't have to do anywhere near as much computation 
as is needed here.
> 
>>> In the past many good and fast defrag computation algorithms have 
>>> been produced, and I don't see any reason why this project wouldn't 
>>> be also able to create such a good algorithm.
> 
>> Because it's not just the new extent locations you have to compute, 
>> you also need to compute the resultant metadata tree state, and the 
>> resultant extent tree state, and after all of that the resultant 
>> checksum tree state.  Yeah, figuring out optimal block layouts is 
>> solved, but you can't get around the overhead of recomputing the new 
>> tree state and all the block checksums for it.
>>
>> The current defrag has to deal with this too, but it doesn't need to 
>> do as much computation because it's not worried about preserving 
>> reflinks (and therefore defragmenting a single file won't require 
>> updates to any other files).
> 
> Yes, the defrag algorithm needs to compute the new tree state. However, 
> it shouldn't be slow at all. All operations on b-trees can be done in at 
> most N*logN time, which is sufficiently fast. There is no operation 
> there that I can think of that takes N*N or N*M time. So, it should all 
> take little CPU time. Essentially a non-issue.
> 
> The ONLY concern that causes N*M time is the presence of sharing. But, 
> even this is unfair, as the computation time will still be N*logN with 
> regards to the total number of reflinks. That is still fast, even for 
> 100 GB metadata with a billion reflinks.
> 
> I don't understand why do you think that recomputing the new tree state 
> must be slow. Even if there are a 100 new tree states that need to be 
> recomputed, there is still no problem. Each metadata update will change 
> only a small portion of b-trees, so the complexity and size of b-trees 
> should not seriously affect the computation time.
Well, let's start with the checksum computations which then need to 
happen for each block that would be written, which can't be faster than 
O(n).

Yes, the structural overhead of the b-trees isn't bad by itself, but you 
have multiple trees that need to be updated in sequence (that is, you 
have to update one, then update the next based on that one, then update 
another based on both of the previous two, etc) and a number of other 
bits of data involved that need to be updated as part of the b-tree 
update which have worse time complexity than computing the structural 
changes to the b-trees.
> 
>>>>>>> The point is that the defrag can keep a buffer of a "pending 
>>>>>>> operations". Pending operations are those that should be 
>>>>>>> performed in order to keep the original sharing structure. If the 
>>>>>>> defrag gets interrupted, then files in "pending operations" will 
>>>>>>> be unshared. But this should really be some important and urgent 
>>>>>>> interrupt, as the "pending operations" buffer needs at most a 
>>>>>>> second or two to complete its operations.
>>>>>
>>>>>> Depending on the exact situation, it can take well more than a few 
>>>>>> seconds to complete stuff. Especially if there are lots of reflinks.
>>>>>
>>>>> Nope. You are quite wrong there.
>>>>> In the worst case, the "pending operations buffer" will update 
>>>>> (write to disk) all the b-trees. So, the upper limit on time to 
>>>>> flush the "pending operations buffer" equals the time to write the 
>>>>> entire b-tree structure to the disk (into new extents). I estimate 
>>>>> that takes at most a few seconds.
>>>
>>>> So what you're talking about is journaling the computed state of 
>>>> defrag operations.  That shouldn't be too bad (as long as it's done 
>>>> in memory instead of on-disk) if you batch the computations 
>>>> properly.  I thought you meant having a buffer of what operations to 
>>>> do, and then computing them on-the-fly (which would have significant 
>>>> overhead)
>>>
>>> Looks close to what I was thinking. Soon we might be able to 
>>> communicate. I'm not sure what you mean by "journaling the computed 
>>> state of defrag operations". Maybe it doesn't matter.
> 
>> Essentially, doing a write-ahead log of pending operations.  
>> Journaling is just the common term for such things when dealing with 
>> Linux filesystems because of ext* and XFS.  Based on what you say 
>> below, it sounds like we're on the same page here other than the 
>> terminology.
>>>
>>> What happens is that file (extent) data is first written to disk 
>>> (defragmented), but b-tree is not immediately updated. It doesn't 
>>> have to be. Even if there is a power loss, nothing happens.
>>>
>>> So, the changes that should be done to the b-trees are put into 
>>> pending-operations-buffer. When a lot of file (extent) data is 
>>> written to disk, such that defrag-operation-space (1 GB) is close to 
>>> being exhausted, the pending-operations-buffer is examined in order 
>>> to attempt to free as much of defrag-operation-space as possible. The 
>>> simplest algorithm is to flush the entire pending-operations-buffer 
>>> at once. This reduces the number of writes that update the b-trees 
>>> because many changes to the b-trees fall into the same or 
>>> neighbouring disk sectors.
>>>
>>>>>>>> * Reflinks can reference partial extents.  This means, 
>>>>>>>> ultimately, that you may end up having to split extents in odd 
>>>>>>>> ways during defrag if you want to preserve reflinks, and might 
>>>>>>>> have to split extents _elsewhere_ that are only tangentially 
>>>>>>>> related to the region being defragmented. See the example in my 
>>>>>>>> previous email for a case like this, maintaining the shared 
>>>>>>>> regions as being shared when you defragment either file to a 
>>>>>>>> single extent will require splitting extents in the other file 
>>>>>>>> (in either case, whichever file you don't defragment to a single 
>>>>>>>> extent will end up having 7 extents if you try to force the one 
>>>>>>>> that's been defragmented to be the canonical version).  Once you 
>>>>>>>> consider that a given extent can have multiple ranges reflinked 
>>>>>>>> from multiple other locations, it gets even more complicated.
>>>>>>>
>>>>>>> I think that this problem can be solved, and that it can be 
>>>>>>> solved perfectly (the result is a perfectly-defragmented file). 
>>>>>>> But, if it is so hard to do, just skip those problematic extents 
>>>>>>> in initial version of defrag.
>>>>>>>
>>>>>>> Ultimately, in the super-duper defrag, those partially-referenced 
>>>>>>> extents should be split up by defrag.
>>>>>>>
>>>>>>>> * If you choose to just not handle the above point by not 
>>>>>>>> letting defrag split extents, you put a hard lower limit on the 
>>>>>>>> amount of fragmentation present in a file if you want to 
>>>>>>>> preserve reflinks.  IOW, you can't defragment files past a 
>>>>>>>> certain point.  If we go this way, neither of the two files in 
>>>>>>>> the example from my previous email could be defragmented any 
>>>>>>>> further than they already are, because doing so would require 
>>>>>>>> splitting extents.
>>>>>>>
>>>>>>> Oh, you're reading my thoughts. That's good.
>>>>>>>
>>>>>>> Initial implementation of defrag might be not-so-perfect. It 
>>>>>>> would still be better than the current defrag.
>>>>>>>
>>>>>>> This is not a one-way street. Handling of partially-used extents 
>>>>>>> can be improved in later versions.
>>>>>>>
>>>>>>>> * Determining all the reflinks to a given region of a given 
>>>>>>>> extent is not a cheap operation, and the information may 
>>>>>>>> immediately be stale (because an operation right after you fetch 
>>>>>>>> the info might change things).  We could work around this by 
>>>>>>>> locking the extent somehow, but doing so would be expensive 
>>>>>>>> because you would have to hold the lock for the entire defrag 
>>>>>>>> operation.
>>>>>>>
>>>>>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>>>>>
>>>>>>> Instead, you have to create a hook in every function that updates 
>>>>>>> the reflink structure or extents (for exaple, write-to-file 
>>>>>>> operation). So, when a reflink gets changed, the defrag is 
>>>>>>> immediately notified about this. That way the defrag can keep its 
>>>>>>> data about reflinks in-sync with the filesystem.
>>>>>
>>>>>> This doesn't get around the fact that it's still an expensive 
>>>>>> operation to enumerate all the reflinks for a given region of a 
>>>>>> file or extent.
>>>>>
>>>>> No, you are wrong.
>>>>>
>>>>> In order to enumerate all the reflinks in a region, the defrag 
>>>>> needs to have another array, which is also kept in memory and in 
>>>>> sync with the filesystem. It is the easiest to divide the disk into 
>>>>> regions of equal size, where each region is a few MB large. Lets 
>>>>> call this array "regions-to-extents" array. This array doesn't need 
>>>>> to be associative, it is a plain array.
>>>>> This in-memory array links regions of disk to extents that are in 
>>>>> the region. The array in initialized when defrag starts.
>>>>>
>>>>> This array makes the operation of finding all extents of a region 
>>>>> extremely fast.
>>>> That has two issues:
>>>>
>>>> * That's going to be a _lot_ of memory.  You still need to be able 
>>>> to defragment big (dozens plus TB) arrays without needing multiple 
>>>> GB of RAM just for the defrag operation, otherwise it's not 
>>>> realistically useful (remember, it was big arrays that had issues 
>>>> with the old reflink-aware defrag too).
>>>
>>> Ok, but let's get some calculations there. If regions are 4 MB in 
>>> size, the region-extents array for an 8 TB partition would have 2 
>>> million entries. If entries average 64 bytes, that would be:
>>>
>>>  - a total of 128 MB memory for an 8 TB partition.
>>>
>>> Of course, I'm guessing a lot of numbers there, but it should be doable.
> 
>> Even if we assume such an optimistic estimation as you provide (I 
>> suspect it will require more than 64 bytes per-entry), that's a lot of 
>> RAM when you look at what it's potentially displacing.  That's enough 
>> RAM for receive and transmit buffers for a few hundred thousand 
>> network connections, or for caching multiple hundreds of thousands of 
>> dentries, or a few hundred thousand inodes.  Hell, that's enough RAM 
>> to run all the standard network services for a small network (DHCP, 
>> DNS, NTP, TFTP, mDNS relay, UPnP/NAT-PMP, SNMP, IGMP proxy, VPN of 
>> your choice) at least twice over.
> 
> That depends on the average size of an extent. If the average size of an 
> extent is around 4 MB, than my numbers should be good. Do you have any 
> data which would suggest that my estimate is wrong? What's the average 
> size of an extent on your filesystems (used space divided by number of 
> extents)?
Depends on what filesystem.

Worst case I have (which I monitor regularly, so I actually have good 
aggregate data on the actual distribution of extent sizes) is used for 
backed storage for virtual machine disk images.  The arithmetic mean 
extent size is just barely over 32k, but the median is actually closer 
to 48k, with the 10th percentile at 4k and the 90th percentile at just 
over 2M.  From what I can tell, this is a pretty typical distribution 
for this type of usage (high frequency small writes internal to existing 
files) on BTRFS.

Typical usage on most of my systems when dealing with data sets that 
include reflinks shows a theoretical average extent size of about 1M, 
though I suspect the 50th percentile to be a little bit higher than that 
(I don't regularly check any of those, but the times I have the 50th 
percentile has been just a bit than the arithmetic mean, which makes 
sense given that I have a lot more small files than large ones).

It might be normal on some systems to have larger extents than this, but 
I somewhat doubt that that will be the case for many potential users.
> This "regions-to-extents" array can be further optimized if necessary.
> 
> You are not thinking correctly there (misplaced priorities). If the 
> system needs to be defragmented, that's the priority. You can't do 
> comparisons like that, that's unfair debating.
> 
> The defrag that I'm proposing should be able to run within common memory 
> limits of today's computer systems. So, it will likely take somewhat 
> less than 700 MB of RAM in most common situations, including the small 
> servers. They all have 700 MB RAM.
> 
> 700 MB is a lot for a defrag, but there is no way around it. Btrfs is 
> simply a filesystem with such complexity that a good defrag requires a 
> lot of RAM to operate.
> 
> If, for some reason, you would like to cover a use-case with constrained 
> RAM conditions, then that is an entirely different concern for a 
> different project. You can't make a project like this to cover ALL the 
> possible circumstances. Some cases have to be left out. Here we are 
> talking about a defrag that is usable in a general and common set of 
> circumstances.
Memory constrained systems come up as a point of discussion pretty 
regularly when dealing with BTRFS, so they're obviously something users 
actually care about.  You have to keep in mind that it's not unusual in 
a consumer NAS system to have less than 4GB of RAM, but have arrays well 
into double digit terabytes in size.  Even 128MB of RAM needing to be 
used for defrag is trashing _a lot_ of cache on such a system.
> 
> Please, don't drop special circumstances argument on me. That's not fair.
> 
>>>> * You still have to populate the array in the first place.  A sane 
>>>> implementation wouldn't be keeping it in memory even when defrag is 
>>>> not running (no way is anybody going to tolerate even dozens of MB 
>>>> of memory overhead for this), so you're not going to get around the 
>>>> need to enumerate all the reflinks for a file at least once (during 
>>>> startup, or when starting to process that file), so you're just 
>>>> moving the overhead around instead of eliminating it.
>>>
>>> Yes, when the defrag starts, the entire b-tree structure is examined 
>>> in order for region-extents array and extents-backref associative 
>>> array to be populated.
> 
>> So your startup is going to take forever on any reasonably large 
>> volume.  This isn't eliminating the overhead, it's just moving it all 
>> to one place.  That might make it a bit more efficient than it would 
>> be interspersed throughout the operation, but only because it is 
>> reading all the relevant data at once.
> 
> No, the startup will not take forever.
Forever is subjective.  Arrays with hundreds of GB of metadata are not 
unusual, and that's dozens of minutes of just reading data right at the 
beginning before even considering what to defragment.

I would encourage you to take a closer look at some of the performance 
issues quota groups face when doing a rescan, as they have to deal with 
this kind of reflink tracking too, and they take quite a while on any 
reasonably sized volume.
> 
> The startup needs exactly 1 (one) pass through the entire metadata. It 
> needs this to find all the backlinks and to populate the 
> "regios-extents" array.  The time to do 1 pass through metadata depends 
> on the metadata size on disk, as entire metadata has to be read out (one 
> piece at a time, you won't keep it all in RAM). In most cases, the 
> time-to read the metadata will be less than 1 minute, on an SSD less 
> than 20 seconds.
> 
> There is no way around it: to defrag, you eventually need to read all 
> the b-trees, so nothing is lost there.
> 
> All computations in this defrag are simple. Finding all refliks in 
> metadata is simple. It is a single pass metadata read-out.
> 
>>> Of course, those two arrays exist only during defrag operation. When 
>>> defrag completes, those arrays are deallocated.
>>>
>>>>>> It also allows a very real possibility of a user functionally 
>>>>>> delaying the defrag operation indefinitely (by triggering a 
>>>>>> continuous stream of operations that would cause reflink changes 
>>>>>> for a file being operated on by defrag) if not implemented very 
>>>>>> carefully.
>>>>>
>>>>> Yes, if a user does something like that, the defrag can be paused 
>>>>> or even aborted. That is normal.
>>>> Not really.  Most defrag implementations either avoid files that 
>>>> could reasonably be written to, or freeze writes to the file they're 
>>>> operating on, or in some other way just sidestep the issue without 
>>>> delaying the defragmentation process.
>>>>>
>>>>> There are many ways around this problem, but it really doesn't 
>>>>> matter, those are just details. The initial version of defrag can 
>>>>> just abort. The more mature versions of defrag can have a better 
>>>>> handling of this problem.
>>>
>>>> Details like this are the deciding factor for whether something is 
>>>> sanely usable in certain use cases, as you have yourself found out 
>>>> (for a lot of users, the fact that defrag can unshare extents is 
>>>> 'just a detail' that's not worth worrying about).
>>>
>>> I wouldn't agree there.
>>>
>>> Not every issue is equal. Some issues are more important, some are 
>>> trivial, some are tolerable etc...
>>>
>>> The defrag is usually allowed to abort. It can easily be restarted 
>>> later. Workaround: You can make a defrag-supervisor program, which 
>>> starts a defrag, and if defrag aborts then it is restarted after some 
>>> (configurable) amount of time.
> 
>> The fact that the defrag can be functionally deferred indefinitely by 
>> a user means that a user can, with a bit of effort, force degraded 
>> performance for everyone using the system.  Aborting the defrag 
>> doesn't solve that, and it's a significant issue for anybody doing 
>> shared hosting.
> 
> This is a quality-of-implementation issue. Not worthy of consideration 
> at this time. It can be solved.Then solve it and be done with it, don't just punt it down the road. 
You're the one trying to convince the developers to spend _their_ time 
implementing _your_ idea, so you need to provide enough detail to solve 
issues that are brought up about your idea.
> 
> You can go and pick this kind of stuff all the time, with any system. I 
> mean, because of the FACT that we have never proven that all security 
> holes are eliminated, the computers shouldn't be powered on at all. 
> Therefore, all computers should be shut down immediately and then there 
> is absolutely no need to continue working on the btrfs. It is also 
> impossible to produce the btrfs defrag, because all computers have to be 
> shut down immediately.
> 
> Can we have a bit more fair discussion? Please?
I would ask the same, I provided a concrete example of a demonstrable 
security issue with your proposed implementation that's trivial to 
verify without even going beyond the described behavior of the 
implementation. You then dismissed at as a non-issue and tried to 
explain why my legitimate security concern wasn't even worth trying to 
think about using apagogical argument that's only tangentially related 
to my statement.
> 
>>>
>>> On the other hand, unsharing is not easy to get undone.
>> But, again, it this just doesn't matter for some people.
>>>
>>> So, those issues are not equals.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 18:29                                   ` Austin S. Hemmelgarn
@ 2019-09-13 19:40                                     ` General Zed
  2019-09-14 15:10                                       ` Jukka Larja
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-13 19:40 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2019-09-13 12:54, General Zed wrote:
>>
>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2019-09-12 18:21, General Zed wrote:
>>>>
>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>>>> On 2019-09-12 15:18, webmaster@zedlx.com wrote:
>>>>>>
>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>
>>>>>>> On 2019-09-11 17:37, webmaster@zedlx.com wrote:
>>>>>>>>
>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>
>>>>>>>>> On 2019-09-11 13:20, webmaster@zedlx.com wrote:
>>>>>>>>>>
>>>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>>>> Given this, defrag isn't willfully unsharing anything,  
>>>>>>>>>>> it's just a side-effect of how it works (since it's  
>>>>>>>>>>> rewriting the block layout of the file in-place).
>>>>>>>>>>
>>>>>>>>>> The current defrag has to unshare because, as you said,  
>>>>>>>>>> because it is unaware of the full reflink structure. If it  
>>>>>>>>>> doesn't know about all reflinks, it has to unshare, there  
>>>>>>>>>> is no way around that.
>>>>>>>>>>
>>>>>>>>>>> Now factor in that _any_ write will result in unsharing  
>>>>>>>>>>> the region being written to, rounded to the nearest full  
>>>>>>>>>>> filesystem block in both directions (this is mandatory,  
>>>>>>>>>>> it's a side effect of the copy-on-write nature of BTRFS,  
>>>>>>>>>>> and is why files that experience heavy internal rewrites  
>>>>>>>>>>> get fragmented very heavily and very quickly on BTRFS).
>>>>>>>>>>
>>>>>>>>>> You mean: when defrag performs a write, the new data is  
>>>>>>>>>> unshared because every write is unshared? Really?
>>>>>>>>>>
>>>>>>>>>> Consider there is an extent E55 shared by two files A and  
>>>>>>>>>> B. The defrag has to move E55 to another location. In order  
>>>>>>>>>> to do that, defrag creates a new extent E70. It makes it  
>>>>>>>>>> belong to file A by changing the reflink of extent E55 in  
>>>>>>>>>> file A to point to E70.
>>>>>>>>>>
>>>>>>>>>> Now, to retain the original sharing structure, the defrag  
>>>>>>>>>> has to change the reflink of extent E55 in file B to point  
>>>>>>>>>> to E70. You are telling me this is not possible? Bullshit!
>>>>>>>>>>
>>>>>>>>>> Please explain to me how this 'defrag has to unshare' story  
>>>>>>>>>> of yours isn't an intentional attempt to mislead me.
>>>>>>>>
>>>>>>>>> As mentioned in the previous email, we actually did have a  
>>>>>>>>> (mostly) working reflink-aware defrag a few years back.  It  
>>>>>>>>> got removed because it had serious performance issues.  Note  
>>>>>>>>> that we're not talking a few seconds of extra time to defrag  
>>>>>>>>> a full tree here, we're talking double-digit _minutes_ of  
>>>>>>>>> extra time to defrag a moderate sized (low triple digit GB)  
>>>>>>>>> subvolume with dozens of snapshots, _if you were lucky_ (if  
>>>>>>>>> you weren't, you would be looking at potentially multiple  
>>>>>>>>> _hours_ of runtime for the defrag).  The performance scaled  
>>>>>>>>> inversely proportionate to the number of reflinks involved  
>>>>>>>>> and the total amount of data in the subvolume being  
>>>>>>>>> defragmented, and was pretty bad even in the case of only a  
>>>>>>>>> couple of snapshots.
>>>>>>>>
>>>>>>>> You cannot ever make the worst program, because an even worse  
>>>>>>>> program can be made by slowing down the original by a factor  
>>>>>>>> of 2.
>>>>>>>> So, you had a badly implemented defrag. At least you got some  
>>>>>>>> experience. Let's see what went wrong.
>>>>>>>>
>>>>>>>>> Ultimately, there are a couple of issues at play here:
>>>>>>>>>
>>>>>>>>> * Online defrag has to maintain consistency during  
>>>>>>>>> operation.  The current implementation does this by  
>>>>>>>>> rewriting the regions being defragmented (which causes them  
>>>>>>>>> to become a single new extent (most of the time)), which  
>>>>>>>>> avoids a whole lot of otherwise complicated logic required  
>>>>>>>>> to make sure things happen correctly, and also means that  
>>>>>>>>> only the file being operated on is impacted and only the  
>>>>>>>>> parts being modified need to be protected against concurrent  
>>>>>>>>> writes.  Properly handling reflinks means that _every_ file  
>>>>>>>>> that shares some part of an extent with the file being  
>>>>>>>>> operated on needs to have the reflinked regions locked for  
>>>>>>>>> the defrag operation, which has a huge impact on  
>>>>>>>>> performance. Using your example, the update to E55 in both  
>>>>>>>>> files A and B has to happen as part of the same commit,  
>>>>>>>>> which can contain no other writes in that region of the  
>>>>>>>>> file, otherwise you run the risk of losing writes to file B  
>>>>>>>>> that occur while file A is being defragmented.
>>>>>>>>
>>>>>>>> Nah. I think there is a workaround. You can first  
>>>>>>>> (atomically) update A, then whatever, then you can update B  
>>>>>>>> later. I know, your yelling "what if E55 gets updated in B".  
>>>>>>>> Doesn't matter. The defrag continues later by searching for  
>>>>>>>> reflink to E55 in B. Then it checks the data contained in  
>>>>>>>> E55. If the data matches the E70, then it can safely update  
>>>>>>>> the reflink in B. Or the defrag can just verify that neither  
>>>>>>>> E55 nor E70 have been written to in the meantime. That means  
>>>>>>>> they still have the same data.
>>>>>>
>>>>>>> So, IOW, you don't care if the total space used by the data is  
>>>>>>> instantaneously larger than what you started with?  That seems  
>>>>>>> to be at odds with your previous statements, but OK, if we  
>>>>>>> allow for that then this is indeed a non-issue.
>>>>>>
>>>>>> It is normal and common for defrag operation to use some disk  
>>>>>> space while it is running. I estimate that a reasonable limit  
>>>>>> would be to use up to 1% of total partition size. So, if a  
>>>>>> partition size is 100 GB, the defrag can use 1 GB. Lets call  
>>>>>> this "defrag operation space".
>>>>>>
>>>>>> The defrag should, when started, verify that there is  
>>>>>> "sufficient free space" on the partition. In the case that  
>>>>>> there is no sufficient free space, the defrag should output the  
>>>>>> message to the user and abort. The size of "sufficient free  
>>>>>> space" must be larger than the "defrag operation space". I  
>>>>>> would estimate that a good limit would be 2% of the partition  
>>>>>> size. "defrag operation space" is a part of "sufficient free  
>>>>>> space" while defrag operation is in progress.
>>>>>>
>>>>>> If, during defrag operation, sufficient free space drops below  
>>>>>> 2%, the defrag should output a message and abort. Another  
>>>>>> possibility is for defrag to pause until the user frees some  
>>>>>> disk space, but this is not common in other defrag  
>>>>>> implementations AFAIK.
>>>>>>
>>>>>>>>> It's not horrible when it's just a small region in two  
>>>>>>>>> files, but it becomes a big issue when dealing with lots of  
>>>>>>>>> files and/or particularly large extents (extents in BTRFS  
>>>>>>>>> can get into the GB range in terms of size when dealing with  
>>>>>>>>> really big files).
>>>>>>>>
>>>>>>>> You must just split large extents in a smart way. So, in the  
>>>>>>>> beginning, the defrag can split large extents (2GB) into  
>>>>>>>> smaller ones (32MB) to facilitate more responsive and easier  
>>>>>>>> defrag.
>>>>>>>>
>>>>>>>> If you have lots of files, update them one-by one. It is  
>>>>>>>> possible. Or you can update in big batches. Whatever is faster.
>>>>>>
>>>>>>> Neither will solve this though.  Large numbers of files are an  
>>>>>>> issue because the operation is expensive and has to be done on  
>>>>>>> each file, not because the number of files somehow makes the  
>>>>>>> operation more espensive. It's O(n) relative to files, not  
>>>>>>> higher time complexity.
>>>>>>
>>>>>> I would say that updating in big batches helps a lot, to the  
>>>>>> point that it gets almost as fast as defragging any other file  
>>>>>> system. What defrag needs to do is to write a big bunch of  
>>>>>> defragged file (data) extents to the disk, and then update the  
>>>>>> b-trees. What happens is that many of the updates to the  
>>>>>> b-trees would fall into the same disk sector/extent, so instead  
>>>>>> of many writes there will be just one write.
>>>>>>
>>>>>> Here is the general outline for implementation:
>>>>>>     - write a big bunch of defragged file extents to disk
>>>>>>         - a minimal set of updates of the b-trees that cannot  
>>>>>> be delayed is performed (this is nothing or almost nothing in  
>>>>>> most circumstances)
>>>>>>         - put the rest of required updates of b-trees into  
>>>>>> "pending operations buffer"
>>>>>>     - analyze the "pending operations buffer", and find out  
>>>>>> (approximately) the biggest part of it that can be flushed out  
>>>>>> by doing minimal number of disk writes
>>>>>>         - flush out that part of "pending operations buffer"
>>>>>>     - repeat
>>>>
>>>>> It helps, but you still can't get around having to recompute the  
>>>>> new tree state, and that is going to take time proportionate to  
>>>>> the number of nodes that need to change, which in turn is  
>>>>> proportionate to the number of files.
>>>>
>>>> Yes, but that is just a computation. The defrag performance  
>>>> mostly depends on minimizing disk I/O operations, not on  
>>>> computations.
>>
>>> You're assuming the defrag is being done on a system that's  
>>> otherwise perfectly idle.  In the real world, that rarely, if  
>>> ever, will be the case,  The system may be doing other things at  
>>> the same time, and the more computation the defrag operation has  
>>> to do, the more likely it is to negatively impact those other  
>>> things.
>>
>> No, I'm not assuming that the system is perfectly idle. I'm  
>> assuming that the required computations don't take much CPU time,  
>> like it is common in a well implemented defrag.
> Which also usually doesn't have to do anywhere near as much  
> computation as is needed here.
>>
>>>> In the past many good and fast defrag computation algorithms have  
>>>> been produced, and I don't see any reason why this project  
>>>> wouldn't be also able to create such a good algorithm.
>>
>>> Because it's not just the new extent locations you have to  
>>> compute, you also need to compute the resultant metadata tree  
>>> state, and the resultant extent tree state, and after all of that  
>>> the resultant checksum tree state.  Yeah, figuring out optimal  
>>> block layouts is solved, but you can't get around the overhead of  
>>> recomputing the new tree state and all the block checksums for it.
>>>
>>> The current defrag has to deal with this too, but it doesn't need  
>>> to do as much computation because it's not worried about  
>>> preserving reflinks (and therefore defragmenting a single file  
>>> won't require updates to any other files).
>>
>> Yes, the defrag algorithm needs to compute the new tree state.  
>> However, it shouldn't be slow at all. All operations on b-trees can  
>> be done in at most N*logN time, which is sufficiently fast. There  
>> is no operation there that I can think of that takes N*N or N*M  
>> time. So, it should all take little CPU time. Essentially a  
>> non-issue.
>>
>> The ONLY concern that causes N*M time is the presence of sharing.  
>> But, even this is unfair, as the computation time will still be  
>> N*logN with regards to the total number of reflinks. That is still  
>> fast, even for 100 GB metadata with a billion reflinks.
>>
>> I don't understand why do you think that recomputing the new tree  
>> state must be slow. Even if there are a 100 new tree states that  
>> need to be recomputed, there is still no problem. Each metadata  
>> update will change only a small portion of b-trees, so the  
>> complexity and size of b-trees should not seriously affect the  
>> computation time.

> Well, let's start with the checksum computations which then need to  
> happen for each block that would be written, which can't be faster  
> than O(n).
>
> Yes, the structural overhead of the b-trees isn't bad by itself, but  
> you have multiple trees that need to be updated in sequence (that  
> is, you have to update one, then update the next based on that one,  
> then update another based on both of the previous two, etc) and a  
> number

Bullshit. You are making it look as if this all cannot be calculated  
in advance. It can be, and then just a single (bigger, line 128 MB)  
metadata update is required. But this is stil a small part of all  
metadata.

The only thing that needs to be performed in a separate operation is  
that final update of super, to commit the updates.

> of other bits of data involved that need to be updated as part of  
> the b-tree update which have worse time complexity than computing  
> the structural changes to the b-trees.

As I said, bullshit.

>>>>>>>> The point is that the defrag can keep a buffer of a "pending  
>>>>>>>> operations". Pending operations are those that should be  
>>>>>>>> performed in order to keep the original sharing structure. If  
>>>>>>>> the defrag gets interrupted, then files in "pending  
>>>>>>>> operations" will be unshared. But this should really be some  
>>>>>>>> important and urgent interrupt, as the "pending operations"  
>>>>>>>> buffer needs at most a second or two to complete its  
>>>>>>>> operations.
>>>>>>
>>>>>>> Depending on the exact situation, it can take well more than a  
>>>>>>> few seconds to complete stuff. Especially if there are lots of  
>>>>>>> reflinks.
>>>>>>
>>>>>> Nope. You are quite wrong there.
>>>>>> In the worst case, the "pending operations buffer" will update  
>>>>>> (write to disk) all the b-trees. So, the upper limit on time to  
>>>>>> flush the "pending operations buffer" equals the time to write  
>>>>>> the entire b-tree structure to the disk (into new extents). I  
>>>>>> estimate that takes at most a few seconds.
>>>>
>>>>> So what you're talking about is journaling the computed state of  
>>>>> defrag operations.  That shouldn't be too bad (as long as it's  
>>>>> done in memory instead of on-disk) if you batch the computations  
>>>>> properly.  I thought you meant having a buffer of what  
>>>>> operations to do, and then computing them on-the-fly (which  
>>>>> would have significant overhead)
>>>>
>>>> Looks close to what I was thinking. Soon we might be able to  
>>>> communicate. I'm not sure what you mean by "journaling the  
>>>> computed state of defrag operations". Maybe it doesn't matter.
>>
>>> Essentially, doing a write-ahead log of pending operations.   
>>> Journaling is just the common term for such things when dealing  
>>> with Linux filesystems because of ext* and XFS.  Based on what you  
>>> say below, it sounds like we're on the same page here other than  
>>> the terminology.
>>>>
>>>> What happens is that file (extent) data is first written to disk  
>>>> (defragmented), but b-tree is not immediately updated. It doesn't  
>>>> have to be. Even if there is a power loss, nothing happens.
>>>>
>>>> So, the changes that should be done to the b-trees are put into  
>>>> pending-operations-buffer. When a lot of file (extent) data is  
>>>> written to disk, such that defrag-operation-space (1 GB) is close  
>>>> to being exhausted, the pending-operations-buffer is examined in  
>>>> order to attempt to free as much of defrag-operation-space as  
>>>> possible. The simplest algorithm is to flush the entire  
>>>> pending-operations-buffer at once. This reduces the number of  
>>>> writes that update the b-trees because many changes to the  
>>>> b-trees fall into the same or neighbouring disk sectors.
>>>>
>>>>>>>>> * Reflinks can reference partial extents.  This means,  
>>>>>>>>> ultimately, that you may end up having to split extents in  
>>>>>>>>> odd ways during defrag if you want to preserve reflinks, and  
>>>>>>>>> might have to split extents _elsewhere_ that are only  
>>>>>>>>> tangentially related to the region being defragmented. See  
>>>>>>>>> the example in my previous email for a case like this,  
>>>>>>>>> maintaining the shared regions as being shared when you  
>>>>>>>>> defragment either file to a single extent will require  
>>>>>>>>> splitting extents in the other file (in either case,  
>>>>>>>>> whichever file you don't defragment to a single extent will  
>>>>>>>>> end up having 7 extents if you try to force the one that's  
>>>>>>>>> been defragmented to be the canonical version).  Once you  
>>>>>>>>> consider that a given extent can have multiple ranges  
>>>>>>>>> reflinked from multiple other locations, it gets even more  
>>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> I think that this problem can be solved, and that it can be  
>>>>>>>> solved perfectly (the result is a perfectly-defragmented  
>>>>>>>> file). But, if it is so hard to do, just skip those  
>>>>>>>> problematic extents in initial version of defrag.
>>>>>>>>
>>>>>>>> Ultimately, in the super-duper defrag, those  
>>>>>>>> partially-referenced extents should be split up by defrag.
>>>>>>>>
>>>>>>>>> * If you choose to just not handle the above point by not  
>>>>>>>>> letting defrag split extents, you put a hard lower limit on  
>>>>>>>>> the amount of fragmentation present in a file if you want to  
>>>>>>>>> preserve reflinks.  IOW, you can't defragment files past a  
>>>>>>>>> certain point.  If we go this way, neither of the two files  
>>>>>>>>> in the example from my previous email could be defragmented  
>>>>>>>>> any further than they already are, because doing so would  
>>>>>>>>> require splitting extents.
>>>>>>>>
>>>>>>>> Oh, you're reading my thoughts. That's good.
>>>>>>>>
>>>>>>>> Initial implementation of defrag might be not-so-perfect. It  
>>>>>>>> would still be better than the current defrag.
>>>>>>>>
>>>>>>>> This is not a one-way street. Handling of partially-used  
>>>>>>>> extents can be improved in later versions.
>>>>>>>>
>>>>>>>>> * Determining all the reflinks to a given region of a given  
>>>>>>>>> extent is not a cheap operation, and the information may  
>>>>>>>>> immediately be stale (because an operation right after you  
>>>>>>>>> fetch the info might change things).  We could work around  
>>>>>>>>> this by locking the extent somehow, but doing so would be  
>>>>>>>>> expensive because you would have to hold the lock for the  
>>>>>>>>> entire defrag operation.
>>>>>>>>
>>>>>>>> No. DO NOT LOCK TO RETRIEVE REFLINKS.
>>>>>>>>
>>>>>>>> Instead, you have to create a hook in every function that  
>>>>>>>> updates the reflink structure or extents (for exaple,  
>>>>>>>> write-to-file operation). So, when a reflink gets changed,  
>>>>>>>> the defrag is immediately notified about this. That way the  
>>>>>>>> defrag can keep its data about reflinks in-sync with the  
>>>>>>>> filesystem.
>>>>>>
>>>>>>> This doesn't get around the fact that it's still an expensive  
>>>>>>> operation to enumerate all the reflinks for a given region of  
>>>>>>> a file or extent.
>>>>>>
>>>>>> No, you are wrong.
>>>>>>
>>>>>> In order to enumerate all the reflinks in a region, the defrag  
>>>>>> needs to have another array, which is also kept in memory and  
>>>>>> in sync with the filesystem. It is the easiest to divide the  
>>>>>> disk into regions of equal size, where each region is a few MB  
>>>>>> large. Lets call this array "regions-to-extents" array. This  
>>>>>> array doesn't need to be associative, it is a plain array.
>>>>>> This in-memory array links regions of disk to extents that are  
>>>>>> in the region. The array in initialized when defrag starts.
>>>>>>
>>>>>> This array makes the operation of finding all extents of a  
>>>>>> region extremely fast.
>>>>> That has two issues:
>>>>>
>>>>> * That's going to be a _lot_ of memory.  You still need to be  
>>>>> able to defragment big (dozens plus TB) arrays without needing  
>>>>> multiple GB of RAM just for the defrag operation, otherwise it's  
>>>>> not realistically useful (remember, it was big arrays that had  
>>>>> issues with the old reflink-aware defrag too).
>>>>
>>>> Ok, but let's get some calculations there. If regions are 4 MB in  
>>>> size, the region-extents array for an 8 TB partition would have 2  
>>>> million entries. If entries average 64 bytes, that would be:
>>>>
>>>>  - a total of 128 MB memory for an 8 TB partition.
>>>>
>>>> Of course, I'm guessing a lot of numbers there, but it should be doable.
>>
>>> Even if we assume such an optimistic estimation as you provide (I  
>>> suspect it will require more than 64 bytes per-entry), that's a  
>>> lot of RAM when you look at what it's potentially displacing.   
>>> That's enough RAM for receive and transmit buffers for a few  
>>> hundred thousand network connections, or for caching multiple  
>>> hundreds of thousands of dentries, or a few hundred thousand  
>>> inodes.  Hell, that's enough RAM to run all the standard network  
>>> services for a small network (DHCP, DNS, NTP, TFTP, mDNS relay,  
>>> UPnP/NAT-PMP, SNMP, IGMP proxy, VPN of your choice) at least twice  
>>> over.
>>
>> That depends on the average size of an extent. If the average size  
>> of an extent is around 4 MB, than my numbers should be good. Do you  
>> have any data which would suggest that my estimate is wrong? What's  
>> the average size of an extent on your filesystems (used space  
>> divided by number of extents)?
> Depends on what filesystem.
>
> Worst case I have (which I monitor regularly, so I actually have  
> good aggregate data on the actual distribution of extent sizes) is  
> used for backed storage for virtual machine disk images.  The  
> arithmetic mean extent size is just barely over 32k, but the median  
> is actually closer to 48k, with the 10th percentile at 4k and the  
> 90th percentile at just over 2M.  From what I can tell, this is a  
> pretty typical distribution for this type of usage (high frequency  
> small writes internal to existing files) on BTRFS.

Ok, this is fair. If you have such a system, you might need more RAM  
in order to defrag. The amount of RAM required (for regions-to-extents  
array) depends on total number of extents, since all entent IDs have  
to be placed in memory. You'll probably need less than 32 bytes per  
extent.

So, in order for regions-to-extents array to occupy 1 GB memory, you  
would need at least 32 million extents. Since a minimum size for an  
extent is 4K, that means a minimum of 128 GB disk partition and the  
worst-fragmentation-case possible.

So in reality, unlikely.

If you really have such strange systems, in ordfer to defragment you  
just need to buy more RAM. But, almost noone will experience a case  
like this one you are describing.

> Typical usage on most of my systems when dealing with data sets that  
> include reflinks shows a theoretical average extent size of about  
> 1M, though I suspect the 50th percentile to be a little bit higher  
> than that (I don't regularly check any of those, but the times I  
> have the 50th percentile has been just a bit than the arithmetic  
> mean, which makes sense given that I have a lot more small files  
> than large ones).

Great data. In that case you would have 1 million extents per 1 TB of  
disk space used. Taking the max 32 bytes per extent value, this  
"typical usage scenario" that you describe requires less than 32 MB  
for regions-to-extents array per 1 TB of disk space used.

In a previous post I said 128 MB or less per typical disk partition.  
That 128 MB value would indicate 4 TB of disk space used. Obiously, I  
wasn't thinking of typical desktops.

> It might be normal on some systems to have larger extents than this,  
> but I somewhat doubt that that will be the case for many potential  
> users.

I think I managed to cover 99.9% of the users with the 32 MB RAM value  
per 1 TB disk space used.

>> This "regions-to-extents" array can be further optimized if necessary.
>>
>> You are not thinking correctly there (misplaced priorities). If the  
>> system needs to be defragmented, that's the priority. You can't do  
>> comparisons like that, that's unfair debating.
>>
>> The defrag that I'm proposing should be able to run within common  
>> memory limits of today's computer systems. So, it will likely take  
>> somewhat less than 700 MB of RAM in most common situations,  
>> including the small servers. They all have 700 MB RAM.
>>
>> 700 MB is a lot for a defrag, but there is no way around it. Btrfs  
>> is simply a filesystem with such complexity that a good defrag  
>> requires a lot of RAM to operate.
>>
>> If, for some reason, you would like to cover a use-case with  
>> constrained RAM conditions, then that is an entirely different  
>> concern for a different project. You can't make a project like this  
>> to cover ALL the possible circumstances. Some cases have to be left  
>> out. Here we are talking about a defrag that is usable in a general  
>> and common set of circumstances.

> Memory constrained systems come up as a point of discussion pretty  
> regularly when dealing with BTRFS, so they're obviously something  
> users actually care about.  You have to keep in mind that it's not  
> unusual in a consumer NAS system to have less than 4GB of RAM, but  
> have arrays well into double digit terabytes in size.  Even 128MB of  
> RAM needing to be used for defrag is trashing _a lot_ of cache on  
> such a system.

On a typical NAS, if we take at most 50 TB disk size, and 1 MB average  
extent size,
that would end up as 1.6 GB RAM required for regions-to-extents array.  
So, well within limits.

Also, on a 50 GB NAS, you will have a much bigger average extent size  
than 1 MB.

Also, the user of such NAS should had split it up into multiple  
partitions if he wants to have
many small files and little RAM and defrag working. So, just don't  
make entire 50 GB NAS one huge partition on 2 GB RAM machine and you  
are OK. Or don't use defrag. Or, don't use btrfs, use the ext4 and its  
defrag.

>> Please, don't drop special circumstances argument on me. That's not fair.

Yeah, you constantly keeping inventing those strange little cases  
which are actually all easily solvable if you just gave it a little  
thought. But NO, you have to write them here because you have a NEED  
to prove that my defrag idea can't work.

So that you can excuse yourself and the previous defrag project that  
failed. Yes, make it LOOK LIKE it can't be done any better than what  
you already have, in order to prove that nothing needs to change.

>>>>> * You still have to populate the array in the first place.  A  
>>>>> sane implementation wouldn't be keeping it in memory even when  
>>>>> defrag is not running (no way is anybody going to tolerate even  
>>>>> dozens of MB of memory overhead for this), so you're not going  
>>>>> to get around the need to enumerate all the reflinks for a file  
>>>>> at least once (during startup, or when starting to process that  
>>>>> file), so you're just moving the overhead around instead of  
>>>>> eliminating it.
>>>>
>>>> Yes, when the defrag starts, the entire b-tree structure is  
>>>> examined in order for region-extents array and extents-backref  
>>>> associative array to be populated.
>>
>>> So your startup is going to take forever on any reasonably large  
>>> volume.  This isn't eliminating the overhead, it's just moving it  
>>> all to one place.  That might make it a bit more efficient than it  
>>> would be interspersed throughout the operation, but only because  
>>> it is reading all the relevant data at once.
>>
>> No, the startup will not take forever.

> Forever is subjective.  Arrays with hundreds of GB of metadata are  
> not unusual, and that's dozens of minutes of just reading data right  
> at the beginning before even considering what to defragment.

If you want to perform a defrag, you need to read all metadata. There  
is no way around it. Every good defrag will ahve this same problem.

If you have such huge arrays, you should accept slower defrag initialization.

> I would encourage you to take a closer look at some of the  
> performance issues quota groups face when doing a rescan, as they  
> have to deal with this kind of reflink tracking too, and they take  
> quite a while on any reasonably sized volume.

NO. Irrelevant.

Who know's what is the implementation and constraints. Qgroups have to  
works with much stricter memory constraints than defrag (because  
qgroups processing is enabled ALL the time, and defrag is only  
operating sometimes).

>> The startup needs exactly 1 (one) pass through the entire metadata.  
>> It needs this to find all the backlinks and to populate the  
>> "regios-extents" array.  The time to do 1 pass through metadata  
>> depends on the metadata size on disk, as entire metadata has to be  
>> read out (one piece at a time, you won't keep it all in RAM). In  
>> most cases, the time-to read the metadata will be less than 1  
>> minute, on an SSD less than 20 seconds.
>>
>> There is no way around it: to defrag, you eventually need to read  
>> all the b-trees, so nothing is lost there.
>>
>> All computations in this defrag are simple. Finding all refliks in  
>> metadata is simple. It is a single pass metadata read-out.
>>
>>>> Of course, those two arrays exist only during defrag operation.  
>>>> When defrag completes, those arrays are deallocated.
>>>>
>>>>>>> It also allows a very real possibility of a user functionally  
>>>>>>> delaying the defrag operation indefinitely (by triggering a  
>>>>>>> continuous stream of operations that would cause reflink  
>>>>>>> changes for a file being operated on by defrag) if not  
>>>>>>> implemented very carefully.
>>>>>>
>>>>>> Yes, if a user does something like that, the defrag can be  
>>>>>> paused or even aborted. That is normal.
>>>>> Not really.  Most defrag implementations either avoid files that  
>>>>> could reasonably be written to, or freeze writes to the file  
>>>>> they're operating on, or in some other way just sidestep the  
>>>>> issue without delaying the defragmentation process.
>>>>>>
>>>>>> There are many ways around this problem, but it really doesn't  
>>>>>> matter, those are just details. The initial version of defrag  
>>>>>> can just abort. The more mature versions of defrag can have a  
>>>>>> better handling of this problem.
>>>>
>>>>> Details like this are the deciding factor for whether something  
>>>>> is sanely usable in certain use cases, as you have yourself  
>>>>> found out (for a lot of users, the fact that defrag can unshare  
>>>>> extents is 'just a detail' that's not worth worrying about).
>>>>
>>>> I wouldn't agree there.
>>>>
>>>> Not every issue is equal. Some issues are more important, some  
>>>> are trivial, some are tolerable etc...
>>>>
>>>> The defrag is usually allowed to abort. It can easily be  
>>>> restarted later. Workaround: You can make a defrag-supervisor  
>>>> program, which starts a defrag, and if defrag aborts then it is  
>>>> restarted after some (configurable) amount of time.
>>
>>> The fact that the defrag can be functionally deferred indefinitely  
>>> by a user means that a user can, with a bit of effort, force  
>>> degraded performance for everyone using the system.  Aborting the  
>>> defrag doesn't solve that, and it's a significant issue for  
>>> anybody doing shared hosting.
>>
>> This is a quality-of-implementation issue. Not worthy of  
>> consideration at this time. It can be solved.Then solve it and be  
>> done with it, don't just punt it down the road.

> You're the one trying to convince the developers to spend _their_  
> time implementing _your_ idea, so you need to provide enough detail  
> to solve issues that are brought up about your idea.

Yes, but I don't have to answer to ridiculous issues. It doesn't  
matter if you or anyone here is working pro bono, you still should try  
to post only relevant remarks.

Yes, I'm trying to convince developers here, and you are free to  
discard my idea if you like. I'm doing my best to help in the best way  
I can. I think everyone here can see that I am versed in programming,  
and very proficient in solving the high-level issues.

So, if you trully care about this project, you should at least  
consider the things that I'm saying here. But consider them seriously,  
don't throw it all out because of "100 TB home NAS with 1GB RAM" case  
which somehow suddenly becomes all-important, at the same time  
forgetting the millions of other users who are in a real need of a  
godd defrag for btrfs.

>> You can go and pick this kind of stuff all the time, with any  
>> system. I mean, because of the FACT that we have never proven that  
>> all security holes are eliminated, the computers shouldn't be  
>> powered on at all. Therefore, all computers should be shut down  
>> immediately and then there is absolutely no need to continue  
>> working on the btrfs. It is also impossible to produce the btrfs  
>> defrag, because all computers have to be shut down immediately.
>>
>> Can we have a bit more fair discussion? Please?

> I would ask the same, I provided a concrete example of a  
> demonstrable security issue with your proposed implementation that's  
> trivial to verify without even going beyond the described behavior  
> of the implementation. You then dismissed at as a non-issue and  
> tried to explain why my legitimate security concern wasn't even  
> worth trying to think about using apagogical argument that's only  
> tangentially related to my statement.

I didn't dissmiss it as non-issue.  I said:
"This is a quality-of-implementation issue. Not worthy of  
consideration at this time. It can be solved."

I said NOT WORTHY OF CONSIDERATION at this time, not a NON-ISSUE.

It is not worthy of consideration because there are obvious, multiple  
ways to solve it, but there are many separate cases, and I don't want  
to continue this discussion by listing 10 cases and a solution for  
each one separately. Then to each one of my solutions you are going to  
give some reservations and so on, but IT ALL DOESN'T MATTER because it  
is immediately obvious that it can be eventually solved.

I can provide you with a concrete example of "spectre" security  
issues. Or the example of bad process isolation on Linux desktops.  
Should that make users stop using Linux desktops? NO, because the  
issue is eventually solvable and temporarily torelable with a few  
simple workarounds.

And, the same is valid for the issue that you hightlighted.

Yeah, I have a "LEGITIMATE" security concerns that I'm being followed  
by KGB and CIA, since "DEMONSTRABLY" most computers and electronic are  
insecure. Therefore, I should kill myself.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 11:04                                     ` Austin S. Hemmelgarn
@ 2019-09-13 20:43                                       ` Zygo Blaxell
  2019-09-14  0:20                                         ` General Zed
  2019-09-14 18:29                                       ` Chris Murphy
  1 sibling, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-13 20:43 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: General Zed, Chris Murphy, Btrfs BTRFS

On Fri, Sep 13, 2019 at 07:04:28AM -0400, Austin S. Hemmelgarn wrote:
> On 2019-09-12 19:54, Zygo Blaxell wrote:
> > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> > > 
> > > Quoting Chris Murphy <lists@colorremedies.com>:
> > > 
> > > > On Thu, Sep 12, 2019 at 3:34 PM General Zed <general-zed@zedlx.com> wrote:
> > > > > 
> > > > > 
> > > > > Quoting Chris Murphy <lists@colorremedies.com>:
> > > > > 
> > > > > > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
> > > > > > > 
> > > > > > > It is normal and common for defrag operation to use some disk space
> > > > > > > while it is running. I estimate that a reasonable limit would be to
> > > > > > > use up to 1% of total partition size. So, if a partition size is 100
> > > > > > > GB, the defrag can use 1 GB. Lets call this "defrag operation space".
> > > > > > 
> > > > > > The simplest case of a file with no shared extents, the minimum free
> > > > > > space should be set to the potential maximum rewrite of the file, i.e.
> > > > > > 100% of the file size. Since Btrfs is COW, the entire operation must
> > > > > > succeed or fail, no possibility of an ambiguous in between state, and
> > > > > > this does apply to defragment.
> > > > > > 
> > > > > > So if you're defragging a 10GiB file, you need 10GiB minimum free
> > > > > > space to COW those extents to a new, mostly contiguous, set of exents,
> > > > > 
> > > > > False.
> > > > > 
> > > > > You can defragment just 1 GB of that file, and then just write out to
> > > > > disk (in new extents) an entire new version of b-trees.
> > > > > Of course, you don't really need to do all that, as usually only a
> > > > > small part of the b-trees need to be updated.
> > > > 
> > > > The `-l` option allows the user to choose a maximum amount to
> > > > defragment. Setting up a default defragment behavior that has a
> > > > variable outcome is not idempotent and probably not a good idea.
> > > 
> > > We are talking about a future, imagined defrag. It has no -l option yet, as
> > > we haven't discussed it yet.
> > > 
> > > > As for kernel behavior, it presumably could defragment in portions,
> > > > but it would have to completely update all affected metadata after
> > > > each e.g. 1GiB section, translating into 10 separate rewrites of file
> > > > metadata, all affected nodes, all the way up the tree to the super.
> > > > There is no such thing as metadata overwrites in Btrfs. You're
> > > > familiar with the wandering trees problem?
> > > 
> > > No, but it doesn't matter.
> > > 
> > > At worst, it just has to completely write-out "all metadata", all the way up
> > > to the super. It needs to be done just once, because what's the point of
> > > writing it 10 times over? Then, the super is updated as the final commit.
> > 
> > This is kind of a silly discussion.  The biggest extent possible on
> > btrfs is 128MB, and the incremental gains of forcing 128MB extents to
> > be consecutive are negligible.  If you're defragging a 10GB file, you're
> > just going to end up doing 80 separate defrag operations.

> Do you have a source for this claim of a 128MB max extent size?  

	~/linux$ git grep BTRFS.*MAX.*EXTENT
	fs/btrfs/ctree.h:#define BTRFS_MAX_EXTENT_SIZE SZ_128M

Plus years of watching bees logs scroll by, which never have an extent
above 128M in size that contains data.

I think there are a couple of exceptions for non-data-block extent items
like holes.  A hole extent item doesn't have any physical location on
disk, so its size field can be any 64-bit integer.  btrfs imposes no
restriction there.

PREALLOC extents are half hole, half nodatacow extent.  They can be
larger than 128M when they are empty, but when data is written to them,
they are replaced only in 128M chunks.

> Because
> everything I've seen indicates the max extent size is a full data chunk (so
> 1GB for the common case, potentially up to about 5GB for really big
> filesystems)

If what you've seen so far is 'filefrag -v' output (or any tool based
on the FIEMAP ioctl), then you are seeing post-processed extent sizes
(where adjacent extents where begin[n+1] == end[n] are coalesced for
human consumption), not true on-disk and in-metadata sizes.  FIEMAP is
slow and full of lies.

> > 128MB is big enough you're going to be seeking in the middle of reading
> > an extent anyway.  Once you have the file arranged in 128MB contiguous
> > fragments (or even a tenth of that on medium-fast spinning drives),
> > the job is done.
> > 
> > > On my comouter the ENTIRE METADATA is 1 GB. That would be very tolerable and
> > > doable.
> > 
> > You must have a small filesystem...mine range from 16 to 156GB, a bit too
> > big to fit in RAM comfortably.
> > 
> > Don't forget you have to write new checksum and free space tree pages.
> > In the worst case, you'll need about 1GB of new metadata pages for each
> > 128MB you defrag (though you get to delete 99.5% of them immediately
> > after).
> > 
> > > But that is a very bad case, because usually not much metadata has to be
> > > updated or written out to disk.
> > > 
> > > So, there is no problem.
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 20:43                                       ` Zygo Blaxell
@ 2019-09-14  0:20                                         ` General Zed
  0 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-14  0:20 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Austin S. Hemmelgarn, Chris Murphy, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 07:04:28AM -0400, Austin S. Hemmelgarn wrote:
>> On 2019-09-12 19:54, Zygo Blaxell wrote:
>> > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>> > >
>> > > Quoting Chris Murphy <lists@colorremedies.com>:
>> > >
>> > > > On Thu, Sep 12, 2019 at 3:34 PM General Zed  
>> <general-zed@zedlx.com> wrote:
>> > > > >
>> > > > >
>> > > > > Quoting Chris Murphy <lists@colorremedies.com>:
>> > > > >
>> > > > > > On Thu, Sep 12, 2019 at 1:18 PM <webmaster@zedlx.com> wrote:
>> > > > > > >
>> > > > > > > It is normal and common for defrag operation to use  
>> some disk space
>> > > > > > > while it is running. I estimate that a reasonable limit  
>> would be to
>> > > > > > > use up to 1% of total partition size. So, if a  
>> partition size is 100
>> > > > > > > GB, the defrag can use 1 GB. Lets call this "defrag  
>> operation space".
>> > > > > >
>> > > > > > The simplest case of a file with no shared extents, the  
>> minimum free
>> > > > > > space should be set to the potential maximum rewrite of  
>> the file, i.e.
>> > > > > > 100% of the file size. Since Btrfs is COW, the entire  
>> operation must
>> > > > > > succeed or fail, no possibility of an ambiguous in  
>> between state, and
>> > > > > > this does apply to defragment.
>> > > > > >
>> > > > > > So if you're defragging a 10GiB file, you need 10GiB minimum free
>> > > > > > space to COW those extents to a new, mostly contiguous,  
>> set of exents,
>> > > > >
>> > > > > False.
>> > > > >
>> > > > > You can defragment just 1 GB of that file, and then just  
>> write out to
>> > > > > disk (in new extents) an entire new version of b-trees.
>> > > > > Of course, you don't really need to do all that, as usually only a
>> > > > > small part of the b-trees need to be updated.
>> > > >
>> > > > The `-l` option allows the user to choose a maximum amount to
>> > > > defragment. Setting up a default defragment behavior that has a
>> > > > variable outcome is not idempotent and probably not a good idea.
>> > >
>> > > We are talking about a future, imagined defrag. It has no -l  
>> option yet, as
>> > > we haven't discussed it yet.
>> > >
>> > > > As for kernel behavior, it presumably could defragment in portions,
>> > > > but it would have to completely update all affected metadata after
>> > > > each e.g. 1GiB section, translating into 10 separate rewrites of file
>> > > > metadata, all affected nodes, all the way up the tree to the super.
>> > > > There is no such thing as metadata overwrites in Btrfs. You're
>> > > > familiar with the wandering trees problem?
>> > >
>> > > No, but it doesn't matter.
>> > >
>> > > At worst, it just has to completely write-out "all metadata",  
>> all the way up
>> > > to the super. It needs to be done just once, because what's the point of
>> > > writing it 10 times over? Then, the super is updated as the  
>> final commit.
>> >
>> > This is kind of a silly discussion.  The biggest extent possible on
>> > btrfs is 128MB, and the incremental gains of forcing 128MB extents to
>> > be consecutive are negligible.  If you're defragging a 10GB file, you're
>> > just going to end up doing 80 separate defrag operations.
>
>> Do you have a source for this claim of a 128MB max extent size?
>
> 	~/linux$ git grep BTRFS.*MAX.*EXTENT
> 	fs/btrfs/ctree.h:#define BTRFS_MAX_EXTENT_SIZE SZ_128M
>
> Plus years of watching bees logs scroll by, which never have an extent
> above 128M in size that contains data.
>
> I think there are a couple of exceptions for non-data-block extent items
> like holes.  A hole extent item doesn't have any physical location on
> disk, so its size field can be any 64-bit integer.  btrfs imposes no
> restriction there.
>
> PREALLOC extents are half hole, half nodatacow extent.  They can be
> larger than 128M when they are empty, but when data is written to them,
> they are replaced only in 128M chunks.
>
>> Because
>> everything I've seen indicates the max extent size is a full data chunk (so
>> 1GB for the common case, potentially up to about 5GB for really big
>> filesystems)
>
> If what you've seen so far is 'filefrag -v' output (or any tool based
> on the FIEMAP ioctl), then you are seeing post-processed extent sizes
> (where adjacent extents where begin[n+1] == end[n] are coalesced for
> human consumption), not true on-disk and in-metadata sizes.  FIEMAP is
> slow and full of lies.
>
>> > 128MB is big enough you're going to be seeking in the middle of reading
>> > an extent anyway.  Once you have the file arranged in 128MB contiguous
>> > fragments (or even a tenth of that on medium-fast spinning drives),
>> > the job is done.
>> >
>> > > On my comouter the ENTIRE METADATA is 1 GB. That would be very  
>> tolerable and
>> > > doable.
>> >
>> > You must have a small filesystem...mine range from 16 to 156GB, a bit too
>> > big to fit in RAM comfortably.
>> >
>> > Don't forget you have to write new checksum and free space tree pages.
>> > In the worst case, you'll need about 1GB of new metadata pages for each
>> > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > after).
>> >
>> > > But that is a very bad case, because usually not much metadata has to be
>> > > updated or written out to disk.
>> > >
>> > > So, there is no problem.
>> > >

Mr. Blaxell, could you be so kind to help me out on this mission of  
mine to describe a good defrag algorithm for BTRFS.

In order for me to better understand the circumstances, I need to know  
a few statistics about BTRFS filesystes. I'm interested in both the  
extreme, and in the common values.

One of the values in question is the total number of reflinks in BTRFS  
fielsystems. In fact, I would like to know the followin information  
related to some btrfs partition: number of extents, the number of  
reflinks, the size of physical data written on disk, and the size of  
logical (by sharing) data written on disk, the total size of the  
partition, the size of metadata, and the number of snapshots.

So, if you could please provide me with a few values that you think  
could be valid on typical (common) partitions, and also some of the  
extreme values that you encountered while using btrfs.

Thanks,

     General Zed





^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  5:05                                         ` General Zed
@ 2019-09-14  0:56                                           ` Zygo Blaxell
  2019-09-14  1:50                                             ` General Zed
  2019-09-14  1:56                                             ` General Zed
  0 siblings, 2 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14  0:56 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > 
> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > 
> > > > Don't forget you have to write new checksum and free space tree pages.
> > > > In the worst case, you'll need about 1GB of new metadata pages for each
> > > > 128MB you defrag (though you get to delete 99.5% of them immediately
> > > > after).
> > > 
> > > Yes, here we are debating some worst-case scenaraio which is actually
> > > imposible in practice due to various reasons.
> > 
> > No, it's quite possible.  A log file written slowly on an active
> > filesystem above a few TB will do that accidentally.  Every now and then
> > I hit that case.  It can take several hours to do a logrotate on spinning
> > arrays because of all the metadata fetches and updates associated with
> > worst-case file delete.  Long enough to watch the delete happen, and
> > even follow along in the source code.
> > 
> > I guess if I did a proactive defrag every few hours, it might take less
> > time to do the logrotate, but that would mean spreading out all the
> > seeky IO load during the day instead of getting it all done at night.
> > Logrotate does the same job as defrag in this case (replacing a file in
> > thousands of fragments spread across the disk with a few large fragments
> > close together), except logrotate gets better compression.
> > 
> > To be more accurate, the example I gave above is the worst case you
> > can expect from normal user workloads.  If I throw in some reflinks
> > and snapshots, I can make it arbitrarily worse, until the entire disk
> > is consumed by the metadata update of a single extent defrag.
> > 
> 
> I can't believe I am considering this case.
> 
> So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
> randomly over the entire disk. We have 512 GB free RAM and 2% free disk
> space. The file needs to be defragmented.
> 
> In order to do that, defrag needs to be able to copy-move multiple extents
> in one batch, and update the metadata.
> 
> The metadata has a total of at least 256 million entries, each of some size,
> but each one should hold at least a pointer to the extent (8 bytes) and a
> checksum (8 bytes): In reality, it could be that there is a lot of other
> data there per entry.

It's about 48KB per 4K extent, plus a few hundred bytes on average for each
reference.

> The metadata is organized as a b-tree. Therefore, nearby nodes should
> contain data of consecutive file extents.

It's 48KB per item.  As you remove the original data extents, you will
be touching a 16KB page in three trees for each extent that is removed:
Free space tree, csum tree, and extent tree.  This happens after the
merged extent is created.  It is part of the cleanup operation that
gets rid of the original 4K extents.

Because the file was written very slowly on a big filesystem, the extents
are scattered pessimally all over the virtual address space, not packed
close together.  If there are a few hundred extent allocations between
each log extent, then they will all occupy separate metadata pages.
When it is time to remove them, each of these pages must be updated.
This can be hit in a number of places in btrfs, including overwrite
and delete.

There's also 60ish bytes per extent in any subvol trees the file
actually appears in, but you do get locality in that one (the key is
inode and offset, so nothing can get between them and space them apart).
That's 12GB and change (you'll probably completely empty most of the
updated subvol metadata pages, so we can expect maybe 5 pages to remain
including root and interior nodes).  I haven't been unlucky enough to
get a "natural" 12GB, but I got over 1GB a few times recently.

Reflinks can be used to multiply that 12GB arbitrarily--you only get
locality if the reflinks are consecutive in (inode, offset) space,
so if the reflinks are scattered across subvols or files, they won't
share pages.

> The trick, in this case, is to select one part of "ultralog" which is
> localized in the metadata, and defragment it. Repeating this step will
> ultimately defragment the entire file.
> 
> So, the defrag selects some part of metadata which is entirely a descendant
> of some b-tree node not far from the bottom of b-tree. It selects it such
> that the required update to the metadata is less than, let's say, 64 MB, and
> simultaneously the affected "ultralog" file fragments total less han 512 MB
> (therefore, less than 128 thousand metadata leaf entries, each pointing to a
> 4 KB fragment). Then it finds all the file extents pointed to by that part
> of metadata. They are consecutive (as file fragments), because we have
> selected such part of metadata. Now the defrag can safely copy-move those
> fragments to a new area and update the metadata.
> 
> In order to quickly select that small part of metadata, the defrag needs a
> metatdata cache that can hold somewhat more than 128 thousand localized
> metadata leaf entries. That fits into 128 MB RAM definitely.
> 
> Of course, there are many other small issues there, but this outlines the
> general procedure.
> 
> Problem solved?

Problem missed completely.  The forward reference updates were the only
easy part.

My solution is to detect this is happening in real time, and merge the
extents while they're still too few to be a problem.  Now you might be
thinking "but doesn't that mean you'll merge the same data blocks over
and over, wasting iops?" but really it's a perfectly reasonable trade
considering the interest rates those unspent iops can collect on btrfs.
If the target minimum extent size is 192K, you turn this 12GB problem into
a 250MB one, and the 1GB problem that actually occurs becomes trivial.

Another solution would be to get the allocator to reserve some space
near growing files reserved for use by those files, so that the small
fragments don't explode across the address space.  Then we'd get locality
in all four btrees.  Other filesystems have heuristics all over their
allocators to do things like this--btrfs seems to have a very minimal
allocator that could stand much improvement.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13  9:25                                           ` General Zed
  2019-09-13 17:02                                             ` General Zed
@ 2019-09-14  0:59                                             ` Zygo Blaxell
  2019-09-14  1:28                                               ` General Zed
  1 sibling, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14  0:59 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
> 
> Quoting General Zed <general-zed@zedlx.com>:
> 
> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > 
> > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > 
> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > 
> > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> > > > > >
> > > > > > At worst, it just has to completely write-out "all
> > > > > metadata", all the way up
> > > > > > to the super. It needs to be done just once, because what's the point of
> > > > > > writing it 10 times over? Then, the super is updated as
> > > > > the final commit.
> > > > > 
> > > > > This is kind of a silly discussion.  The biggest extent possible on
> > > > > btrfs is 128MB, and the incremental gains of forcing 128MB extents to
> > > > > be consecutive are negligible.  If you're defragging a 10GB file, you're
> > > > > just going to end up doing 80 separate defrag operations.
> > > > 
> > > > Ok, then the max extent is 128 MB, that's fine. Someone here
> > > > previously said
> > > > that it is 2 GB, so he has disinformed me (in order to further his false
> > > > argument).
> > > 
> > > If the 128MB limit is removed, you then hit the block group size limit,
> > > which is some number of GB from 1 to 10 depending on number of disks
> > > available and raid profile selection (the striping raid profiles cap
> > > block group sizes at 10 disks, and single/raid1 profiles always use 1GB
> > > block groups regardless of disk count).  So 2GB is _also_ a valid extent
> > > size limit, just not the first limit that is relevant for defrag.
> > > 
> > > A lot of people get confused by 'filefrag -v' output, which coalesces
> > > physically adjacent but distinct extents.  So if you use that tool,
> > > it can _seem_ like there is a 2.5GB extent in a file, but it is really
> > > 20 distinct 128MB extents that start and end at adjacent addresses.
> > > You can see the true structure in 'btrfs ins dump-tree' output.
> > > 
> > > That also brings up another reason why 10GB defrags are absurd on btrfs:
> > > extent addresses are virtual.  There's no guarantee that a pair of extents
> > > that meet at a block group boundary are physically adjacent, and after
> > > operations like RAID array reorganization or free space defragmentation,
> > > they are typically quite far apart physically.
> > > 
> > > > I didn't ever said that I would force extents larger than 128 MB.
> > > > 
> > > > If you are defragging a 10 GB file, you'll likely have to do it
> > > > in 10 steps,
> > > > because the defrag is usually allowed to only use a limited amount of disk
> > > > space while in operation. That has nothing to do with the extent size.
> > > 
> > > Defrag is literally manipulating the extent size.  Fragments and extents
> > > are the same thing in btrfs.
> > > 
> > > Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
> > > commit metadata updates after each step, so more than 128MB of temporary
> > > space may be used (especially if your disks are fast and empty,
> > > and you start just after the end of the previous commit interval).
> > > There are some opportunities to coalsce metadata updates, occupying up
> > > to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
> > > a flush, whichever comes first), but exploiting those opportunities
> > > requires more space for uncommitted data.
> > > 
> > > If the filesystem starts to get low on space during a defrag, it can
> > > inject commits to force metadata updates to happen more often, which
> > > reduces the amount of temporary space needed (we can't delete the original
> > > fragmented extents until their replacement extent is committed); however,
> > > if the filesystem is so low on space that you're worried about running
> > > out during a defrag, then you probably don't have big enough contiguous
> > > free areas to relocate data into anyway, i.e. the defrag is just going to
> > > push data from one fragmented location to a different fragmented location,
> > > or bail out with "sorry, can't defrag that."
> > 
> > Nope.
> > 
> > Each defrag "cycle" consists of two parts:
> >      1) move-out part
> >      2) move-in part
> > 
> > The move-out part select one contiguous area of the disk. Almost any
> > area will do, but some smart choices are better. It then moves-out all
> > data from that contiguous area into whatever holes there are left empty
> > on the disk. The biggest problem is actually updating the metadata,
> > since the updates are not localized.
> > Anyway, this part can even be skipped.
> > 
> > The move-in part now populates the completely free contiguous area with
> > defragmented data.
> > 
> > In the case that the move-out part needs to be skipped because the
> > defrag estimates that the update to metatada will be too big (like in
> > the pathological case of a disk with 156 GB of metadata), it can
> > sucessfully defrag by performing only the move-in part. In that case,
> > the move-in area is not free of data and "defragmented" data won't be
> > fully defragmented. Also, there should be at least 20% free disk space
> > in this case in order to avoid defrag turning pathological.
> > 
> > But, these are all some pathological cases. They should be considered in
> > some other discussion.
> 
> I know how to do this pathological case. Figured it out!
> 
> Yeah, always ask General Zed, he knows the best!!!
> 
> The move-in phase is not a problem, because this phase generally affects a
> low number of files.
> 
> So, let's consider the move-out phase. The main concern here is that the
> move-out area may contain so many different files and fragments that the
> move-out forces a practically undoable metadata update.
> 
> So, the way to do it is to select files for move-out, one by one (or even
> more granular, by fragments of files), while keeping track of the size of
> the necessary metadata update. When the metadata update exceeds a certain
> amount (let's say 128 MB, an amount that can easily fit into RAM), the
> move-out is performed with only currently selected files (file fragments).
> (The move-out often doesn't affect a whole file since only a part of each
> file lies within the move-out area).

This move-out phase sounds like a reinvention of btrfs balance.  Balance
already does something similar, and python-btrfs gives you a script to
target block groups with high free space fragmentation for balancing.
It moves extents (and their references) away from their block group.
You get GB-sized (or multi-GB-sized) contiguous free space areas into
which you can then allocate big extents.

> Now the defrag has to decide: whether to continue with another round of the
> move-out to get a cleaner move-in area (by repeating the same procedure
> above), or should it continue with a move-in into a partialy dirty area. I
> can't tell you what's better right now, as this can be determined only by
> experiments.
> 
> Lastly, the move-in phase is performed (can be done whether the move-in area
> is dirty or completely clean). Again, the same trick can be used: files can
> be selected one by one until the calculated metadata update exceeds 128 MB.
> However, it is more likely that the size of move-in area will be exhausted
> before this happens.
> 
> This algorithm will work even if you have only 3% free disk space left.

I was thinking more like "you have less than 1GB free on a 1TB filesystem
and you want to defrag 128MB things", i.e. <0.1% free space.  If you don't
have all the metadata block group free space you need allocated already
by that point, you can run out of metadata space and the filesystem goes
read-only.  Happens quite often to people.  They don't like it very much.

> This algorithm will also work if you have metadata of huge size, but in that
> case it is better to have much more free disk space (20%) to avoid
> significantly slowing down the defrag operation.
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  0:59                                             ` Zygo Blaxell
@ 2019-09-14  1:28                                               ` General Zed
  2019-09-14  4:28                                                 ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-14  1:28 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
>>
>> Quoting General Zed <general-zed@zedlx.com>:
>>
>> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> >
>> > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > > >
>> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > >
>> > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>> > > > > >
>> > > > > > At worst, it just has to completely write-out "all
>> > > > > metadata", all the way up
>> > > > > > to the super. It needs to be done just once, because  
>> what's the point of
>> > > > > > writing it 10 times over? Then, the super is updated as
>> > > > > the final commit.
>> > > > >
>> > > > > This is kind of a silly discussion.  The biggest extent possible on
>> > > > > btrfs is 128MB, and the incremental gains of forcing 128MB  
>> extents to
>> > > > > be consecutive are negligible.  If you're defragging a 10GB  
>> file, you're
>> > > > > just going to end up doing 80 separate defrag operations.
>> > > >
>> > > > Ok, then the max extent is 128 MB, that's fine. Someone here
>> > > > previously said
>> > > > that it is 2 GB, so he has disinformed me (in order to  
>> further his false
>> > > > argument).
>> > >
>> > > If the 128MB limit is removed, you then hit the block group size limit,
>> > > which is some number of GB from 1 to 10 depending on number of disks
>> > > available and raid profile selection (the striping raid profiles cap
>> > > block group sizes at 10 disks, and single/raid1 profiles always use 1GB
>> > > block groups regardless of disk count).  So 2GB is _also_ a valid extent
>> > > size limit, just not the first limit that is relevant for defrag.
>> > >
>> > > A lot of people get confused by 'filefrag -v' output, which coalesces
>> > > physically adjacent but distinct extents.  So if you use that tool,
>> > > it can _seem_ like there is a 2.5GB extent in a file, but it is really
>> > > 20 distinct 128MB extents that start and end at adjacent addresses.
>> > > You can see the true structure in 'btrfs ins dump-tree' output.
>> > >
>> > > That also brings up another reason why 10GB defrags are absurd on btrfs:
>> > > extent addresses are virtual.  There's no guarantee that a pair  
>> of extents
>> > > that meet at a block group boundary are physically adjacent, and after
>> > > operations like RAID array reorganization or free space defragmentation,
>> > > they are typically quite far apart physically.
>> > >
>> > > > I didn't ever said that I would force extents larger than 128 MB.
>> > > >
>> > > > If you are defragging a 10 GB file, you'll likely have to do it
>> > > > in 10 steps,
>> > > > because the defrag is usually allowed to only use a limited  
>> amount of disk
>> > > > space while in operation. That has nothing to do with the extent size.
>> > >
>> > > Defrag is literally manipulating the extent size.  Fragments and extents
>> > > are the same thing in btrfs.
>> > >
>> > > Currently a 10GB defragment will work in 80 steps, but doesn't  
>> necessarily
>> > > commit metadata updates after each step, so more than 128MB of temporary
>> > > space may be used (especially if your disks are fast and empty,
>> > > and you start just after the end of the previous commit interval).
>> > > There are some opportunities to coalsce metadata updates, occupying up
>> > > to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
>> > > a flush, whichever comes first), but exploiting those opportunities
>> > > requires more space for uncommitted data.
>> > >
>> > > If the filesystem starts to get low on space during a defrag, it can
>> > > inject commits to force metadata updates to happen more often, which
>> > > reduces the amount of temporary space needed (we can't delete  
>> the original
>> > > fragmented extents until their replacement extent is  
>> committed); however,
>> > > if the filesystem is so low on space that you're worried about running
>> > > out during a defrag, then you probably don't have big enough contiguous
>> > > free areas to relocate data into anyway, i.e. the defrag is  
>> just going to
>> > > push data from one fragmented location to a different  
>> fragmented location,
>> > > or bail out with "sorry, can't defrag that."
>> >
>> > Nope.
>> >
>> > Each defrag "cycle" consists of two parts:
>> >      1) move-out part
>> >      2) move-in part
>> >
>> > The move-out part select one contiguous area of the disk. Almost any
>> > area will do, but some smart choices are better. It then moves-out all
>> > data from that contiguous area into whatever holes there are left empty
>> > on the disk. The biggest problem is actually updating the metadata,
>> > since the updates are not localized.
>> > Anyway, this part can even be skipped.
>> >
>> > The move-in part now populates the completely free contiguous area with
>> > defragmented data.
>> >
>> > In the case that the move-out part needs to be skipped because the
>> > defrag estimates that the update to metatada will be too big (like in
>> > the pathological case of a disk with 156 GB of metadata), it can
>> > sucessfully defrag by performing only the move-in part. In that case,
>> > the move-in area is not free of data and "defragmented" data won't be
>> > fully defragmented. Also, there should be at least 20% free disk space
>> > in this case in order to avoid defrag turning pathological.
>> >
>> > But, these are all some pathological cases. They should be considered in
>> > some other discussion.
>>
>> I know how to do this pathological case. Figured it out!
>>
>> Yeah, always ask General Zed, he knows the best!!!
>>
>> The move-in phase is not a problem, because this phase generally affects a
>> low number of files.
>>
>> So, let's consider the move-out phase. The main concern here is that the
>> move-out area may contain so many different files and fragments that the
>> move-out forces a practically undoable metadata update.
>>
>> So, the way to do it is to select files for move-out, one by one (or even
>> more granular, by fragments of files), while keeping track of the size of
>> the necessary metadata update. When the metadata update exceeds a certain
>> amount (let's say 128 MB, an amount that can easily fit into RAM), the
>> move-out is performed with only currently selected files (file fragments).
>> (The move-out often doesn't affect a whole file since only a part of each
>> file lies within the move-out area).
>
> This move-out phase sounds like a reinvention of btrfs balance.  Balance
> already does something similar, and python-btrfs gives you a script to
> target block groups with high free space fragmentation for balancing.
> It moves extents (and their references) away from their block group.
> You get GB-sized (or multi-GB-sized) contiguous free space areas into
> which you can then allocate big extents.

Perhaps btrfs balance needs to perform something similar, but I can  
assure you that a balance cannot replace the defrag.

The point and the purpose of "move out" is to create a clean  
contiguous free space area, so that defragmented files can be written  
into it.

>> Now the defrag has to decide: whether to continue with another round of the
>> move-out to get a cleaner move-in area (by repeating the same procedure
>> above), or should it continue with a move-in into a partialy dirty area. I
>> can't tell you what's better right now, as this can be determined only by
>> experiments.
>>
>> Lastly, the move-in phase is performed (can be done whether the move-in area
>> is dirty or completely clean). Again, the same trick can be used: files can
>> be selected one by one until the calculated metadata update exceeds 128 MB.
>> However, it is more likely that the size of move-in area will be exhausted
>> before this happens.
>>
>> This algorithm will work even if you have only 3% free disk space left.
>
> I was thinking more like "you have less than 1GB free on a 1TB filesystem
> and you want to defrag 128MB things", i.e. <0.1% free space.  If you don't
> have all the metadata block group free space you need allocated already
> by that point, you can run out of metadata space and the filesystem goes
> read-only.  Happens quite often to people.  They don't like it very much.

The defrag should abort whenever it detects such adverse conditions as  
0.1% free disk space. In fact, it should probably abort as soon as it  
detects less than 3% free disk space. This is normal and expected. If  
the user has a partition with less than 3% free disk space, he/she  
should not defrag it until he/she frees some space, perhaps by  
deleting unnecessary data or by moving out some data to other  
partitions.

This is not autodefrag. The defrag operation is an on-demand  
operation. It has certain requirements in order to function.

>> This algorithm will also work if you have metadata of huge size, but in that
>> case it is better to have much more free disk space (20%) to avoid
>> significantly slowing down the defrag operation.
>>
>>




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  0:56                                           ` Zygo Blaxell
@ 2019-09-14  1:50                                             ` General Zed
  2019-09-14  4:42                                               ` Zygo Blaxell
  2019-09-14  1:56                                             ` General Zed
  1 sibling, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-14  1:50 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > >
>> > > > Don't forget you have to write new checksum and free space tree pages.
>> > > > In the worst case, you'll need about 1GB of new metadata  
>> pages for each
>> > > > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > > > after).
>> > >
>> > > Yes, here we are debating some worst-case scenaraio which is actually
>> > > imposible in practice due to various reasons.
>> >
>> > No, it's quite possible.  A log file written slowly on an active
>> > filesystem above a few TB will do that accidentally.  Every now and then
>> > I hit that case.  It can take several hours to do a logrotate on spinning
>> > arrays because of all the metadata fetches and updates associated with
>> > worst-case file delete.  Long enough to watch the delete happen, and
>> > even follow along in the source code.
>> >
>> > I guess if I did a proactive defrag every few hours, it might take less
>> > time to do the logrotate, but that would mean spreading out all the
>> > seeky IO load during the day instead of getting it all done at night.
>> > Logrotate does the same job as defrag in this case (replacing a file in
>> > thousands of fragments spread across the disk with a few large fragments
>> > close together), except logrotate gets better compression.
>> >
>> > To be more accurate, the example I gave above is the worst case you
>> > can expect from normal user workloads.  If I throw in some reflinks
>> > and snapshots, I can make it arbitrarily worse, until the entire disk
>> > is consumed by the metadata update of a single extent defrag.
>> >
>>
>> I can't believe I am considering this case.
>>
>> So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
>> randomly over the entire disk. We have 512 GB free RAM and 2% free disk
>> space. The file needs to be defragmented.
>>
>> In order to do that, defrag needs to be able to copy-move multiple extents
>> in one batch, and update the metadata.
>>
>> The metadata has a total of at least 256 million entries, each of some size,
>> but each one should hold at least a pointer to the extent (8 bytes) and a
>> checksum (8 bytes): In reality, it could be that there is a lot of other
>> data there per entry.
>
> It's about 48KB per 4K extent, plus a few hundred bytes on average for each
> reference.

Sorry, could you be more clear there? An file fragment/extent that  
holds file data can be any
size up to 128 MB. What metadata is there per every file fragment/extent?

Because "48 KB per 4 K extent" ... cannot decode what you mean.

Another question is: what is the average size of metadata extents?

>> The metadata is organized as a b-tree. Therefore, nearby nodes should
>> contain data of consecutive file extents.
>
> It's 48KB per item.

What's the "item"?

> As you remove the original data extents, you will
> be touching a 16KB page in three trees for each extent that is removed:
> Free space tree, csum tree, and extent tree.  This happens after the
> merged extent is created.  It is part of the cleanup operation that
> gets rid of the original 4K extents.

Ok, but how big are free space tree and csum tree?

Also, when moving a file to defragment it, there should still be some  
locality even in free space tree.

And the csum tree, it should be ordered similar to free space tree, right?

> Because the file was written very slowly on a big filesystem, the extents
> are scattered pessimally all over the virtual address space, not packed
> close together.  If there are a few hundred extent allocations between
> each log extent, then they will all occupy separate metadata pages.

Ok, now you are talking about your pathological case. Let's consider it.

Note that there is very little that can be in this case that you are  
describing. In order to defrag such a file, either the defrag will  
take many small steps and therefore it will be slow (because each step  
needs to perform an update to the metadata), or the defrag can do it  
in one big step and use a huge amount of RAM.

So, the best thing to be done in this situation is to allow the user  
to specify the amount of RAM that defrag is allowed to use, so that  
the user decides which of the two (slow defrag or lots of RAM) he wants.

There is no way around it. There is no better defrag than the one that  
has ALL information at hand, that one will be the fastest and the best  
defrag.

> When it is time to remove them, each of these pages must be updated.
> This can be hit in a number of places in btrfs, including overwrite
> and delete.
>
> There's also 60ish bytes per extent in any subvol trees the file
> actually appears in, but you do get locality in that one (the key is
> inode and offset, so nothing can get between them and space them apart).
> That's 12GB and change (you'll probably completely empty most of the
> updated subvol metadata pages, so we can expect maybe 5 pages to remain
> including root and interior nodes).  I haven't been unlucky enough to
> get a "natural" 12GB, but I got over 1GB a few times recently.

The thing that I figured out (and I have already written it down in  
another post) is that the defrag can CHOOSE AT WILL how large update  
to metadata it wants to perform (within the limit of available RAM).  
The defrag can select, by itself, the most efficient way to proceed  
while still honoring the user-supplied limit on RAM.

> Reflinks can be used to multiply that 12GB arbitrarily--you only get
> locality if the reflinks are consecutive in (inode, offset) space,
> so if the reflinks are scattered across subvols or files, they won't
> share pages.

OK.

Yes, given a sufficiently pathological case, the defrag will take  
forever. There is nothing unexpected there. I agree on that point. The  
defrag always functions within certain prerequisites.

>> The trick, in this case, is to select one part of "ultralog" which is
>> localized in the metadata, and defragment it. Repeating this step will
>> ultimately defragment the entire file.
>>
>> So, the defrag selects some part of metadata which is entirely a descendant
>> of some b-tree node not far from the bottom of b-tree. It selects it such
>> that the required update to the metadata is less than, let's say, 64 MB, and
>> simultaneously the affected "ultralog" file fragments total less han 512 MB
>> (therefore, less than 128 thousand metadata leaf entries, each pointing to a
>> 4 KB fragment). Then it finds all the file extents pointed to by that part
>> of metadata. They are consecutive (as file fragments), because we have
>> selected such part of metadata. Now the defrag can safely copy-move those
>> fragments to a new area and update the metadata.
>>
>> In order to quickly select that small part of metadata, the defrag needs a
>> metatdata cache that can hold somewhat more than 128 thousand localized
>> metadata leaf entries. That fits into 128 MB RAM definitely.
>>
>> Of course, there are many other small issues there, but this outlines the
>> general procedure.
>>
>> Problem solved?

> Problem missed completely.  The forward reference updates were the only
> easy part.

Oh, I'll reply in another mail, this one is getting too tireing.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  0:56                                           ` Zygo Blaxell
  2019-09-14  1:50                                             ` General Zed
@ 2019-09-14  1:56                                             ` General Zed
  1 sibling, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-14  1:56 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > >
>> > > > Don't forget you have to write new checksum and free space tree pages.
>> > > > In the worst case, you'll need about 1GB of new metadata  
>> pages for each
>> > > > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > > > after).
>> > >
>> > > Yes, here we are debating some worst-case scenaraio which is actually
>> > > imposible in practice due to various reasons.
>> >
>> > No, it's quite possible.  A log file written slowly on an active
>> > filesystem above a few TB will do that accidentally.  Every now and then
>> > I hit that case.  It can take several hours to do a logrotate on spinning
>> > arrays because of all the metadata fetches and updates associated with
>> > worst-case file delete.  Long enough to watch the delete happen, and
>> > even follow along in the source code.
>> >
>> > I guess if I did a proactive defrag every few hours, it might take less
>> > time to do the logrotate, but that would mean spreading out all the
>> > seeky IO load during the day instead of getting it all done at night.
>> > Logrotate does the same job as defrag in this case (replacing a file in
>> > thousands of fragments spread across the disk with a few large fragments
>> > close together), except logrotate gets better compression.
>> >
>> > To be more accurate, the example I gave above is the worst case you
>> > can expect from normal user workloads.  If I throw in some reflinks
>> > and snapshots, I can make it arbitrarily worse, until the entire disk
>> > is consumed by the metadata update of a single extent defrag.
>> >
>>
>> I can't believe I am considering this case.
>>
>> So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
>> randomly over the entire disk. We have 512 GB free RAM and 2% free disk
>> space. The file needs to be defragmented.
>>
>> In order to do that, defrag needs to be able to copy-move multiple extents
>> in one batch, and update the metadata.
>>
>> The metadata has a total of at least 256 million entries, each of some size,
>> but each one should hold at least a pointer to the extent (8 bytes) and a
>> checksum (8 bytes): In reality, it could be that there is a lot of other
>> data there per entry.
>
> It's about 48KB per 4K extent, plus a few hundred bytes on average for each
> reference.
>
>> The metadata is organized as a b-tree. Therefore, nearby nodes should
>> contain data of consecutive file extents.
>
> It's 48KB per item.  As you remove the original data extents, you will
> be touching a 16KB page in three trees for each extent that is removed:
> Free space tree, csum tree, and extent tree.  This happens after the
> merged extent is created.  It is part of the cleanup operation that
> gets rid of the original 4K extents.
>
> Because the file was written very slowly on a big filesystem, the extents
> are scattered pessimally all over the virtual address space, not packed
> close together.  If there are a few hundred extent allocations between
> each log extent, then they will all occupy separate metadata pages.
> When it is time to remove them, each of these pages must be updated.
> This can be hit in a number of places in btrfs, including overwrite
> and delete.
>
> There's also 60ish bytes per extent in any subvol trees the file
> actually appears in, but you do get locality in that one (the key is
> inode and offset, so nothing can get between them and space them apart).
> That's 12GB and change (you'll probably completely empty most of the
> updated subvol metadata pages, so we can expect maybe 5 pages to remain
> including root and interior nodes).  I haven't been unlucky enough to
> get a "natural" 12GB, but I got over 1GB a few times recently.
>
> Reflinks can be used to multiply that 12GB arbitrarily--you only get
> locality if the reflinks are consecutive in (inode, offset) space,
> so if the reflinks are scattered across subvols or files, they won't
> share pages.
>
>> The trick, in this case, is to select one part of "ultralog" which is
>> localized in the metadata, and defragment it. Repeating this step will
>> ultimately defragment the entire file.
>>
>> So, the defrag selects some part of metadata which is entirely a descendant
>> of some b-tree node not far from the bottom of b-tree. It selects it such
>> that the required update to the metadata is less than, let's say, 64 MB, and
>> simultaneously the affected "ultralog" file fragments total less han 512 MB
>> (therefore, less than 128 thousand metadata leaf entries, each pointing to a
>> 4 KB fragment). Then it finds all the file extents pointed to by that part
>> of metadata. They are consecutive (as file fragments), because we have
>> selected such part of metadata. Now the defrag can safely copy-move those
>> fragments to a new area and update the metadata.
>>
>> In order to quickly select that small part of metadata, the defrag needs a
>> metatdata cache that can hold somewhat more than 128 thousand localized
>> metadata leaf entries. That fits into 128 MB RAM definitely.
>>
>> Of course, there are many other small issues there, but this outlines the
>> general procedure.
>>
>> Problem solved?
>
> Problem missed completely.  The forward reference updates were the only
> easy part.
>
> My solution is to detect this is happening in real time, and merge the
> extents while they're still too few to be a problem.  Now you might be
> thinking "but doesn't that mean you'll merge the same data blocks over
> and over, wasting iops?" but really it's a perfectly reasonable trade
> considering the interest rates those unspent iops can collect on btrfs.
> If the target minimum extent size is 192K, you turn this 12GB problem into
> a 250MB one, and the 1GB problem that actually occurs becomes trivial.
>
> Another solution would be to get the allocator to reserve some space
> near growing files reserved for use by those files, so that the small
> fragments don't explode across the address space.  Then we'd get locality
> in all four btrees.  Other filesystems have heuristics all over their
> allocators to do things like this--btrfs seems to have a very minimal
> allocator that could stand much improvement.

Ok, a fine solution. Basically, you improve the autodefrag to detect  
this specific situation.

Another way to solve this issue is to run the on-demand defrag  
sufficiently often. You order the defrag to  defragment only that one  
specific file, or you order it to find and defrag only 0.1% of most  
fragmented files (and the pathological file should fall within those  
0.1%).

But, this is a very specific and rare case that we are talking about here.

So, that's it.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-12 21:23                         ` General Zed
@ 2019-09-14  4:12                           ` Zygo Blaxell
  2019-09-16 11:42                             ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14  4:12 UTC (permalink / raw)
  To: General Zed; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Thu, Sep 12, 2019 at 05:23:21PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Wed, Sep 11, 2019 at 07:21:31PM -0400, webmaster@zedlx.com wrote:
> > > 
> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
[...etc...]
> > > > On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmaster@zedlx.com wrote:
> > It's the default for GNU coreutils, and for 'mv' across subvols there
> > is currently no option to turn reflink copies off.  Maybe for 'cp'
> > you still have to explicitly request reflink, but that will presumably
> > change at some point as more filesystems get the CLONE_RANGE ioctl and
> > more users expect it to just work by default.
> 
> Yes, thank you for posting another batch of arguments that support the use
> of my vision of defrag instead of the current one.
> 
> The defrag that I'm proposing will preserve all those reflinks that were
> painstakingly created by the user. Therefore, I take that you agree with me
> on the utmost importance of implementing this new defrag that I'm proposing.

I do not agree that improving the current defrag is of utmost importance,
or indeed of any importance whatsoever.  The current defrag API is a
clumsy, unscalable hack that cannot play well with other filesystem layout
optimization tools no matter what you do to its internal implementation
details.  It's better to start over with a better design, and spend only
the minimal amount of effort required to keep the old one building until
its replacement(s) is (are) proven in use and ready for deployment.

I'm adding extent-merging support to an existing tool that already
performs several other filesystem layout optimizations.  The goal is to
detect degenerate extent layout on filesystems as it appears, and repair
it before it becomes a more severe performance problem, without wasting
resources on parts of the filesystem that do not require intervention.

Your defrag ideas are interesting, but you should spend a lot more
time learning the btrfs fundamentals before continuing.  Right now
you do not understand what btrfs is capable of doing easily, and what
requires such significant rework in btrfs to implement that the result
cannot be considered the same filesystem.  This is impairing the quality
of your design proposals and reducing the value of your contribution
significantly.

> I suggest that btrfs should first try to determine whether it can split an
> extent in-place, or not. If it can't do that, then it should create new
> extents to split the old one.

btrfs cannot split extents in place, so it must always create new
extents by copying data blocks.  It's a hugely annoying and non-trivial
limitation that makes me consider starting over with some other filesystem
quite often.

If you are looking for important btrfs work, consider solving that
problem first.  It would dramatically improve GC (in the sense that
it would eliminate the need to perform a separate GC step at all) and
dedupe performance on btrfs as well as help defrag and other extent
layout optimizers.

> Therefore, the defrag can free unused parts of any extent, and then the
> extent can be split is necessary. In fact, both these operations can be done
> simultaneously.

Sure, but I only call one of these operations "defrag" (the extent merge
operation).  The other operations increase the total number of fragments
in the filesystem, so "defrag" is not an appropriate name for them.
An appropriate name would be something like "enfrag" or "refrag" or
"split".  In some cases the "defrag" can be performed by doing a "dedupe"
operation with a single unfragmented identical source extent replacing
several fragmented destination extents...what do you call that?

> > Dedupe on btrfs also requires the ability to split and merge extents;
> > otherwise, we can't dedupe an extent that contains a combination of
> > unique and duplicate data.  If we try to just move references around
> > without splitting extents into all-duplicate and all-unique extents,
> > the duplicate blocks become unreachable, but are not deallocated.  If we
> > only split extents, fragmentation overhead gets bad.  Before creating
> > thousands of references to an extent, it is worthwhile to merge it with
> > as many of its neighbors as possible, ideally by picking the biggest
> > existing garbage-free extents available so we don't have to do defrag.
> > As we examine each extent in the filesystem, it may be best to send
> > to defrag, dedupe, or garbage collection--sometimes more than one of
> > those.
> 
> This is sovled simply by always running defrag before dedupe.

Defrag and dedupe in separate passes is nonsense on btrfs.

Defrag burns a lot of iops on defrag moving extent data around to create
new size-driven extent boundaries.  These will have to be immediately
moved again by dedupe (except in special cases like full-file matches),
because dedupe needs to create content-driven extent boundaries to work
on btrfs.

Extent splitting in-place is not possible on btrfs, so extent boundary
changes necessarily involve data copies.  Reference counting is done
by extent in btrfs, so it is only possible to free complete extents.
You have to replace the whole extent with references to data from
somewhere else, creating data copies as required to do so where no
duplicate copy of the data is available for reflink.

Note the phrase "on btrfs" appears often here...other filesystems manage
to solve these problems without special effort.  Again, if you're looking
for important btrfs things to work on, maybe start with in-place extent
splitting.

On XFS you can split extents in place and reference counting is by
block, so you can do alternating defrag and dedupe passes.  It's still
suboptimal (you still waste iops to defrag data blocks that are
immediately eliminated by the following dedupe), but it's orders of
magnitude better than btrfs.

> > As extents get bigger, the seeking and overhead to read them gets smaller.
> > I'd want to defrag many consecutive 4K extents, but I wouldn't bother
> > touching 256K extents unless they were in high-traffic files, nor would I
> > bother combining only 2 or 3 4K extents together (there would be around
> > 400K of metadata IO overhead to do so--likely more than what is saved
> > unless the file is very frequently read sequentially).  The incremental
> > gains are inversely proportional to size, while the defrag cost is
> > directly proportional to size.
> 
> "the defrag cost is directly proportional to size" - this is wrong. The
> defrag cost is proportional to file size, not to extent size.

defrag cost is proportional to target extent size (and also reflink
count on btrfs, as btrfs's reflinks are relatively heavy IO-wise).
There are other cost components in extent split and merge operations,
but they are negligible compared to the cost of relocating extent data
blocks or the constant cost of relocating an extent at all.

File size is irrelevant.  We don't need to load all the metadata for
a VM image file into RAM--a 384MB sliding window over the data is more
than enough.  We don't even care which files an extent belongs to--its
logically adjacent neighbors are just neighboring items in the subvol
metadata trees, and at EOF we just reach a branch tip on the extent
adjacency graph.

btrfs tree search can filter out old extents that were processed earlier
and not modified since.  A whole filesystem can be scanned for new
metadata changes in a few milliseconds.  Extents that require defrag
can be identified from the metadata records.  All of these operations
are fast, you can do a few hundred from a single metadata page read.

Once you start copying data blocks from one part of the disk to another,
the copying easily takes 95% of the total IO time.  Since you don't need
to copy extents when they are above the target extent size, they don't
contribute to the defrag cost.  Only the extents below the target extent
size add to the cost, and the number and cost of those are proportional
to the target extent size (on random input, bigger target size = more
extents to defrag, and the average extent is bigger).

> Before a file is defragmented, the defrag should split its extents so that
> each one is sufficiently small, let's say 32 MB at most. That fixes the
> issue. This was mentioned in my answer to Austin S. Hemmelgarn.
> 
> Then, as the final stage of the defrag, the extents should be merged into
> bigger ones of desired size.

You can try that idea on other filesystems.  It doesn't work on btrfs.
There is no in-place extent split or merge operator, and it's non-trivial
to implement one that doesn't ruin basic posix performance, or make
disk format changes so significant that the resulting filesystem can't
be called btrfs any more.

> > Also, quite a few heavily-fragmented files only ever get read once,
> > or are read only during low-cost IO times (e.g. log files during
> > maintenance windows).  For those, a defrag is pure wasted iops.
> 
> You don't know that because you can't predict the future. Therefore, defrag
> is never a waste because the future is unknown.

Sure we do.  If your DBA knows their job, they keep an up-to-date list of
table files that are write-once-read-never (where defrag is pointless)
and another list of files that get hit with full table scans all day long
(where defrag may not be enough, and you have to resort to database table
"cluster" commands).  Sometimes past data doesn't predict future events,
but not nearly often enough to justify the IO cost of blindly defragging
everything.

On machines where write iops are at 35% of capacity, trying to run defrag
will just break the machine (2x more write cost = machine does not keep
up with workload any more and gets fired).  It won't matter what you
want to do in the future if you can't physically do defrag in the present.

Quite often there is sufficient bandwidth for _some_ defrag, and if it's
done correctly, defrag can make more bandwidth available.  A sorted list
of important areas on the filesystem to defrag is a useful tool.

> > That would depend on where those extents are, how big the references
> > are, etc.  In some cases a million references are fine, in other cases
> > even 20 is too many.  After doing the analysis for each extent in a
> > filesystem, you'll probably get an average around some number, but
> > setting the number first is putting the cart before the horse.
> 
> This is a tiny detail not worthy of consideration at this stage of planning.
> It can be solved.

I thought so too, years ago.  Now I know better.  These "tiny details" are
central to doing performant layout optimization *on btrfs*, and if you're
not planning around them (or planning to fix the underlying problems
in btrfs first) then the end result is going to suck.  Or you'll plan
your way off of btrfs and onto a different filesystem, which I suppose
is also a valid way to solve the problem.

> Actually, many of the problems that you wrote about so far in this thread
> are not problems in my imagined implementation of defrag, which can solves
> them all. 

That's because your imagined implemntation of defrag runs on an equally
imaginary filesystem.  Maybe you'd like to try it out on btrfs instead?

> The problems you wrote about are mostly problems of this
> implementation/library of yours.

The library is basically a workaround for an extent-oriented application
to use file-oriented kernel API.  The library doesn't deal with core
btrfs issues, those are handled in the application.

The problem with the current kernel API starts after we determine that
we need to do something to extent E1.  The kernel API deals only with
open FDs and offsets, but E1 is a virtual block address, so we have to:

	- search backrefs for some file F1 that contains one or more
	blocks of E1

	- determine the offset O1 of the relevant parts of E1 within F1

	- find the name of F1, open it, and get a fd

	- pass (fd, O1) to the relevant kernel ioctl

	- the kernel ioctl looks up F1 and O1 to (hopefully) find E1

	- the kernel does whatever we wanted done to E1

	- if we wanted do something to all the references to E1, or if F1
	does not refer to all of E1, repeat the above for each reference
	until all of E1 is covered.

The library does the above API translation.  Ideally the kernel would
just take E1 directly from an ioctl argument, eliminating all but the
"do whatever we wanted done on E1 and/or its refs" steps.  With the
right kernel API, we'd never need to know about O1 or F1 or any of the
extent refs unless we wanted to for some reason (e.g. to pretty-print
an error message with a filename instead of a raw block address).
All the information we care about to generate the commands we issue to
the kernel is in the btrees, which can be read most efficiently directly
from the filesystem.

The rest of the problems I've described are btrfs limitations and
workarounds you will encounter when your imaginary defrag meets the cold,
hard, concrete reality of btrfs.  The library doesn't do very much about
these issues--they're baked deep into btrfs, and can't be solved by just
changing a top-level API.  If the top-level code tries to ignore the
btrfs low-level implementation details, then the resulting performance
will be catastrophically bad (i.e. you'll need to contact a btrfs expert
to recover the filesystem).

> So yes, you can do things that way as in your library, but that is inferior
> to real defrag.

Again, you're using the word "defrag", and now other words like "inferior"
or "real", in strange ways...

> Now, that split and merge just have to be moved into kernel.
>  - I would keep merge and split as separate operations.
>  - If a split cannot be performed due to problems you mention, then it
> should just return and do nothing. Same with merge.

Split and merge can be emulated on btrfs using data copies, but the costs
of the emulated implementations are not equivalent to the non-copying
implementations.  It is unwise to design higher-level code based on the
non-emulated cost model and then run it on an emulated implementation.

> > Add an operation to replace references to data at extent A with
> > references to data at extent B, and an operation to query extent reference
> > structure efficiently, and you have all the ingredients of an integrated
> > dedupe/defrag/garbage collection tool for btrfs (analogous to the XFS
> > "fsr" process).
> 
> Obviously, some very usefull code. That is good, but perhaps it would be
> better for that code to serve as an example of how it can be done.
> In my imagined defrag, this updating-of-references happens as part of
> flushing the "pending operations buffer", so it will have to be rewritten
> such that it fits into that framework.
> 
> The problem of your defrag is that it is not holistic enough. It has a view
> of only small parts of the filesystem, so it can never be as good as a real
> defrag, which also doesn't unshare extents.

The problem with a "holistic" defrag is that there is nothing such a
defrag could achieve that would be worth the cost of running even a
perfect one.  It would take 2.4 days at full wire speed on one modern
datacenter drive just to move its data even with zero metadata IO.
16TB datacenter drives come in arrays of 3.  A holistic defrag would
take weeks, minimum.

I'm explicitly not writing a defrag that considers a whole filesystem.
That's the very first optimization:  find the 10% of the work that
solves 99% of the problem, and ignore the rest due to the exponentially
increasing costs and diminishing returns.  I'm adding capabilities
to an existing collection of physical layout optimization tools to
defragment unacceptably fragmented files as soon as they appear, just
like it currently removes duplicate blocks as soon as they appear.
The framework scans entire filesystems in a single, incremental pass
taking a few milliseconds.  It tracks write activity as it occurs,
examines new metadata structures soon after they appear (hopefully while
they are still cached in host RAM), and leaves parts of the filesystem
that don't need correction alone.

> Another person said that it is complicated to trace backreferences. So now
> you are saying that it is not.

There's an ioctl, you call it and get a list of backrefs.  The in-kernel
version has a callback function that is passed each backref as
an argument.  This can involve a lot of seeking reads, but it's not
complicated to use.  The list of backrefs is _expanded_ though (it's
in the associative-array form, not the more compact tree form the data
has on disk).  If you really want that last 1% of performance, you need
to read the metadata blocks and interpret the backref tree yourself
(or add hooks to the kernel implementation).

If you know the lower bound on the number of backrefs before enumerating
all of them, you can do math and make sane tradeoffs before touching
an extent.  So if you've seen 100 backrefs and you know that the current
subvol has 20 snapshots, you know there's at least 2000 backrefs, and
you can skip processing the current extent.

> Anyway, such a structure must be available to defrag.
> So, just in case to avoid misunderstandings, this "extent-backrefs"
> associative array would be in-memory, it would cover all extents, the entire
> filesystem structure, and it would be kept in-sync with the filesystem
> during the defrag operation.

That would waste a lot of memory for no particular reason, especially
if you use an associative array instead of a DAG.  The conversion would
multiply out all the shared snapshot metadata pages.  (Yet another problem
you only have to solve on btrfs.)  e.g. my 156GB metadata filesystem,
fully expanded as an associative array, would be a PB or two of array
data assuming zero overhead for the array itself.

I did consider building exactly this a few years ago, basically pushing
the 156GB of metadata through a sort, then consuming all the data in
sequential order.  Then I did the math, and realized I don't own enough
disks to complete the sort operation.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  1:28                                               ` General Zed
@ 2019-09-14  4:28                                                 ` Zygo Blaxell
  2019-09-15 18:05                                                   ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14  4:28 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Fri, Sep 13, 2019 at 09:28:49PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
> > > 
> > > Quoting General Zed <general-zed@zedlx.com>:
> > > 
> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > >
> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > > >
> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > > >
> > > > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> > > > > > > >
> > > > > > > > At worst, it just has to completely write-out "all
> > > > > > > metadata", all the way up
> > > > > > > > to the super. It needs to be done just once, because
> > > what's the point of
> > > > > > > > writing it 10 times over? Then, the super is updated as
> > > > > > > the final commit.
> > > > > > >
> > > > > > > This is kind of a silly discussion.  The biggest extent possible on
> > > > > > > btrfs is 128MB, and the incremental gains of forcing 128MB
> > > extents to
> > > > > > > be consecutive are negligible.  If you're defragging a 10GB
> > > file, you're
> > > > > > > just going to end up doing 80 separate defrag operations.
> > > > > >
> > > > > > Ok, then the max extent is 128 MB, that's fine. Someone here
> > > > > > previously said
> > > > > > that it is 2 GB, so he has disinformed me (in order to further
> > > his false
> > > > > > argument).
> > > > >
> > > > > If the 128MB limit is removed, you then hit the block group size limit,
> > > > > which is some number of GB from 1 to 10 depending on number of disks
> > > > > available and raid profile selection (the striping raid profiles cap
> > > > > block group sizes at 10 disks, and single/raid1 profiles always use 1GB
> > > > > block groups regardless of disk count).  So 2GB is _also_ a valid extent
> > > > > size limit, just not the first limit that is relevant for defrag.
> > > > >
> > > > > A lot of people get confused by 'filefrag -v' output, which coalesces
> > > > > physically adjacent but distinct extents.  So if you use that tool,
> > > > > it can _seem_ like there is a 2.5GB extent in a file, but it is really
> > > > > 20 distinct 128MB extents that start and end at adjacent addresses.
> > > > > You can see the true structure in 'btrfs ins dump-tree' output.
> > > > >
> > > > > That also brings up another reason why 10GB defrags are absurd on btrfs:
> > > > > extent addresses are virtual.  There's no guarantee that a pair
> > > of extents
> > > > > that meet at a block group boundary are physically adjacent, and after
> > > > > operations like RAID array reorganization or free space defragmentation,
> > > > > they are typically quite far apart physically.
> > > > >
> > > > > > I didn't ever said that I would force extents larger than 128 MB.
> > > > > >
> > > > > > If you are defragging a 10 GB file, you'll likely have to do it
> > > > > > in 10 steps,
> > > > > > because the defrag is usually allowed to only use a limited
> > > amount of disk
> > > > > > space while in operation. That has nothing to do with the extent size.
> > > > >
> > > > > Defrag is literally manipulating the extent size.  Fragments and extents
> > > > > are the same thing in btrfs.
> > > > >
> > > > > Currently a 10GB defragment will work in 80 steps, but doesn't
> > > necessarily
> > > > > commit metadata updates after each step, so more than 128MB of temporary
> > > > > space may be used (especially if your disks are fast and empty,
> > > > > and you start just after the end of the previous commit interval).
> > > > > There are some opportunities to coalsce metadata updates, occupying up
> > > > > to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
> > > > > a flush, whichever comes first), but exploiting those opportunities
> > > > > requires more space for uncommitted data.
> > > > >
> > > > > If the filesystem starts to get low on space during a defrag, it can
> > > > > inject commits to force metadata updates to happen more often, which
> > > > > reduces the amount of temporary space needed (we can't delete
> > > the original
> > > > > fragmented extents until their replacement extent is committed);
> > > however,
> > > > > if the filesystem is so low on space that you're worried about running
> > > > > out during a defrag, then you probably don't have big enough contiguous
> > > > > free areas to relocate data into anyway, i.e. the defrag is just
> > > going to
> > > > > push data from one fragmented location to a different fragmented
> > > location,
> > > > > or bail out with "sorry, can't defrag that."
> > > >
> > > > Nope.
> > > >
> > > > Each defrag "cycle" consists of two parts:
> > > >      1) move-out part
> > > >      2) move-in part
> > > >
> > > > The move-out part select one contiguous area of the disk. Almost any
> > > > area will do, but some smart choices are better. It then moves-out all
> > > > data from that contiguous area into whatever holes there are left empty
> > > > on the disk. The biggest problem is actually updating the metadata,
> > > > since the updates are not localized.
> > > > Anyway, this part can even be skipped.
> > > >
> > > > The move-in part now populates the completely free contiguous area with
> > > > defragmented data.
> > > >
> > > > In the case that the move-out part needs to be skipped because the
> > > > defrag estimates that the update to metatada will be too big (like in
> > > > the pathological case of a disk with 156 GB of metadata), it can
> > > > sucessfully defrag by performing only the move-in part. In that case,
> > > > the move-in area is not free of data and "defragmented" data won't be
> > > > fully defragmented. Also, there should be at least 20% free disk space
> > > > in this case in order to avoid defrag turning pathological.
> > > >
> > > > But, these are all some pathological cases. They should be considered in
> > > > some other discussion.
> > > 
> > > I know how to do this pathological case. Figured it out!
> > > 
> > > Yeah, always ask General Zed, he knows the best!!!
> > > 
> > > The move-in phase is not a problem, because this phase generally affects a
> > > low number of files.
> > > 
> > > So, let's consider the move-out phase. The main concern here is that the
> > > move-out area may contain so many different files and fragments that the
> > > move-out forces a practically undoable metadata update.
> > > 
> > > So, the way to do it is to select files for move-out, one by one (or even
> > > more granular, by fragments of files), while keeping track of the size of
> > > the necessary metadata update. When the metadata update exceeds a certain
> > > amount (let's say 128 MB, an amount that can easily fit into RAM), the
> > > move-out is performed with only currently selected files (file fragments).
> > > (The move-out often doesn't affect a whole file since only a part of each
> > > file lies within the move-out area).
> > 
> > This move-out phase sounds like a reinvention of btrfs balance.  Balance
> > already does something similar, and python-btrfs gives you a script to
> > target block groups with high free space fragmentation for balancing.
> > It moves extents (and their references) away from their block group.
> > You get GB-sized (or multi-GB-sized) contiguous free space areas into
> > which you can then allocate big extents.
> 
> Perhaps btrfs balance needs to perform something similar, but I can assure
> you that a balance cannot replace the defrag.

Correct, balance is only half of the solution.

The balance is required for two things on btrfs:  "move-out" phase of
free space defragmentation, and to ensure at least one unallocated block
group exists on the filesystem in case metadata expansion is required.

A btrfs can operate without defrag for...well, forever, defrag is not
necessary at all.  I have dozens of multi-year-old btrfs filesystems of
assorted sizes that have never run defrag even once.

By contrast, running out of unallocated space is a significant problem
that should be corrected with the same urgency as RAID entering degraded
mode.  I generally recommend running 'btrfs balance start -dlimit=1' about
once per day to force one block group to always be empty.

Filesystems that don't maintain unallocated space can run into problems
if metadata runs out of space.  These problems can be inconvenient to
recover from.

> The point and the purpose of "move out" is to create a clean contiguous free
> space area, so that defragmented files can be written into it.

> 
> > > Now the defrag has to decide: whether to continue with another round of the
> > > move-out to get a cleaner move-in area (by repeating the same procedure
> > > above), or should it continue with a move-in into a partialy dirty area. I
> > > can't tell you what's better right now, as this can be determined only by
> > > experiments.
> > > 
> > > Lastly, the move-in phase is performed (can be done whether the move-in area
> > > is dirty or completely clean). Again, the same trick can be used: files can
> > > be selected one by one until the calculated metadata update exceeds 128 MB.
> > > However, it is more likely that the size of move-in area will be exhausted
> > > before this happens.
> > > 
> > > This algorithm will work even if you have only 3% free disk space left.
> > 
> > I was thinking more like "you have less than 1GB free on a 1TB filesystem
> > and you want to defrag 128MB things", i.e. <0.1% free space.  If you don't
> > have all the metadata block group free space you need allocated already
> > by that point, you can run out of metadata space and the filesystem goes
> > read-only.  Happens quite often to people.  They don't like it very much.
> 
> The defrag should abort whenever it detects such adverse conditions as 0.1%
> free disk space. In fact, it should probably abort as soon as it detects
> less than 3% free disk space. This is normal and expected. If the user has a
> partition with less than 3% free disk space, he/she should not defrag it
> until he/she frees some space, perhaps by deleting unnecessary data or by
> moving out some data to other partitions.

3% of 45TB is 1.35TB...seems a little harsh.  Recall no extent can be
larger than 128MB, so we're talking about enough space for ten thousand
of defrag's worst-case output extents.  A limit based on absolute numbers
might make more sense, though the only way to really know what the limit is
on any given filesystem is to try to reach it.

> This is not autodefrag. The defrag operation is an on-demand operation. It
> has certain requirements in order to function.
> 
> > > This algorithm will also work if you have metadata of huge size, but in that
> > > case it is better to have much more free disk space (20%) to avoid
> > > significantly slowing down the defrag operation.
> > > 
> > > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  1:50                                             ` General Zed
@ 2019-09-14  4:42                                               ` Zygo Blaxell
  2019-09-14  4:53                                                 ` Zygo Blaxell
  2019-09-15 17:54                                                 ` General Zed
  0 siblings, 2 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14  4:42 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Btrfs BTRFS

On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
> > > 
> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > 
> > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > >
> > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > >
> > > > > > Don't forget you have to write new checksum and free space tree pages.
> > > > > > In the worst case, you'll need about 1GB of new metadata pages
> > > for each
> > > > > > 128MB you defrag (though you get to delete 99.5% of them immediately
> > > > > > after).
> > > > >
> > > > > Yes, here we are debating some worst-case scenaraio which is actually
> > > > > imposible in practice due to various reasons.
> > > >
> > > > No, it's quite possible.  A log file written slowly on an active
> > > > filesystem above a few TB will do that accidentally.  Every now and then
> > > > I hit that case.  It can take several hours to do a logrotate on spinning
> > > > arrays because of all the metadata fetches and updates associated with
> > > > worst-case file delete.  Long enough to watch the delete happen, and
> > > > even follow along in the source code.
> > > >
> > > > I guess if I did a proactive defrag every few hours, it might take less
> > > > time to do the logrotate, but that would mean spreading out all the
> > > > seeky IO load during the day instead of getting it all done at night.
> > > > Logrotate does the same job as defrag in this case (replacing a file in
> > > > thousands of fragments spread across the disk with a few large fragments
> > > > close together), except logrotate gets better compression.
> > > >
> > > > To be more accurate, the example I gave above is the worst case you
> > > > can expect from normal user workloads.  If I throw in some reflinks
> > > > and snapshots, I can make it arbitrarily worse, until the entire disk
> > > > is consumed by the metadata update of a single extent defrag.
> > > >
> > > 
> > > I can't believe I am considering this case.
> > > 
> > > So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
> > > randomly over the entire disk. We have 512 GB free RAM and 2% free disk
> > > space. The file needs to be defragmented.
> > > 
> > > In order to do that, defrag needs to be able to copy-move multiple extents
> > > in one batch, and update the metadata.
> > > 
> > > The metadata has a total of at least 256 million entries, each of some size,
> > > but each one should hold at least a pointer to the extent (8 bytes) and a
> > > checksum (8 bytes): In reality, it could be that there is a lot of other
> > > data there per entry.
> > 
> > It's about 48KB per 4K extent, plus a few hundred bytes on average for each
> > reference.
> 
> Sorry, could you be more clear there? An file fragment/extent that holds
> file data can be any
> size up to 128 MB. What metadata is there per every file fragment/extent?
> 
> Because "48 KB per 4 K extent" ... cannot decode what you mean.

An extent has 3 associated records in btrfs, not including its references.
The first two exist while the extent exists, the third appears after it
is removed.

	- extent tree:  location, size of extent, pointers to backref trees.
	Length is around 60 bytes plus the size of the backref pointer list.

	- csum tree:  location, 1 or more 4-byte csums packed in an array.
	Length of item is number of extent data blocks * 4 bytes plus a
	168-bit header (ish...csums from adjacent extents may be packed
	using a shared header)

	- free space tree:  location, size of free space.  This appears
	when the extent is deleted.  It may be merged with adjacent
	records.  Length is maybe 20 bytes?

Each page contains a few hundred items, so if there are a few hundred
unrelated extents between extents in the log file, each log file extent
gets its own metadata page in each tree.

> Another question is: what is the average size of metadata extents?

Metadata extents are all 16K.

> > > The metadata is organized as a b-tree. Therefore, nearby nodes should
> > > contain data of consecutive file extents.
> > 
> > It's 48KB per item.
> 
> What's the "item"?

Items are the objects stored in the trees.  So one extent item, one csum
item, and one free space tree item, all tied to the 4K extent from the
log file.

> > As you remove the original data extents, you will
> > be touching a 16KB page in three trees for each extent that is removed:
> > Free space tree, csum tree, and extent tree.  This happens after the
> > merged extent is created.  It is part of the cleanup operation that
> > gets rid of the original 4K extents.
> 
> Ok, but how big are free space tree and csum tree?

At least 12GB in the worst-case example.

> Also, when moving a file to defragment it, there should still be some
> locality even in free space tree.

It is only guaranteed in the defrag result, because the defrag result is
the only thing that is necessarily physically contiguous.

> And the csum tree, it should be ordered similar to free space tree, right?

They are all ordered by extent physical address (same physical blocks,
same metadata item key).

> > Because the file was written very slowly on a big filesystem, the extents
> > are scattered pessimally all over the virtual address space, not packed
> > close together.  If there are a few hundred extent allocations between
> > each log extent, then they will all occupy separate metadata pages.
> 
> Ok, now you are talking about your pathological case. Let's consider it.
> 
> Note that there is very little that can be in this case that you are
> describing. In order to defrag such a file, either the defrag will take many
> small steps and therefore it will be slow (because each step needs to
> perform an update to the metadata), or the defrag can do it in one big step
> and use a huge amount of RAM.
> 
> So, the best thing to be done in this situation is to allow the user to
> specify the amount of RAM that defrag is allowed to use, so that the user
> decides which of the two (slow defrag or lots of RAM) he wants.
> 
> There is no way around it. There is no better defrag than the one that has
> ALL information at hand, that one will be the fastest and the best defrag.
> 
> > When it is time to remove them, each of these pages must be updated.
> > This can be hit in a number of places in btrfs, including overwrite
> > and delete.
> > 
> > There's also 60ish bytes per extent in any subvol trees the file
> > actually appears in, but you do get locality in that one (the key is
> > inode and offset, so nothing can get between them and space them apart).
> > That's 12GB and change (you'll probably completely empty most of the
> > updated subvol metadata pages, so we can expect maybe 5 pages to remain
> > including root and interior nodes).  I haven't been unlucky enough to
> > get a "natural" 12GB, but I got over 1GB a few times recently.
> 
> The thing that I figured out (and I have already written it down in another
> post) is that the defrag can CHOOSE AT WILL how large update to metadata it
> wants to perform (within the limit of available RAM). The defrag can select,
> by itself, the most efficient way to proceed while still honoring the
> user-supplied limit on RAM.

Yeah, it can update half the reflinks and pause for a commit, or similar.
If there's a power failure then there will be a duplicate extent with some
of the references to one copy and some to the other, but this is probably
rare enough not to matter.

> > Reflinks can be used to multiply that 12GB arbitrarily--you only get
> > locality if the reflinks are consecutive in (inode, offset) space,
> > so if the reflinks are scattered across subvols or files, they won't
> > share pages.
> 
> OK.
> 
> Yes, given a sufficiently pathological case, the defrag will take forever.
> There is nothing unexpected there. I agree on that point. The defrag always
> functions within certain prerequisites.
> 
> > > The trick, in this case, is to select one part of "ultralog" which is
> > > localized in the metadata, and defragment it. Repeating this step will
> > > ultimately defragment the entire file.
> > > 
> > > So, the defrag selects some part of metadata which is entirely a descendant
> > > of some b-tree node not far from the bottom of b-tree. It selects it such
> > > that the required update to the metadata is less than, let's say, 64 MB, and
> > > simultaneously the affected "ultralog" file fragments total less han 512 MB
> > > (therefore, less than 128 thousand metadata leaf entries, each pointing to a
> > > 4 KB fragment). Then it finds all the file extents pointed to by that part
> > > of metadata. They are consecutive (as file fragments), because we have
> > > selected such part of metadata. Now the defrag can safely copy-move those
> > > fragments to a new area and update the metadata.
> > > 
> > > In order to quickly select that small part of metadata, the defrag needs a
> > > metatdata cache that can hold somewhat more than 128 thousand localized
> > > metadata leaf entries. That fits into 128 MB RAM definitely.
> > > 
> > > Of course, there are many other small issues there, but this outlines the
> > > general procedure.
> > > 
> > > Problem solved?
> 
> > Problem missed completely.  The forward reference updates were the only
> > easy part.
> 
> Oh, I'll reply in another mail, this one is getting too tireing.
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  4:42                                               ` Zygo Blaxell
@ 2019-09-14  4:53                                                 ` Zygo Blaxell
  2019-09-15 17:54                                                 ` General Zed
  1 sibling, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14  4:53 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Btrfs BTRFS

On Sat, Sep 14, 2019 at 12:42:19AM -0400, Zygo Blaxell wrote:
> On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
> > 
> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > 
> > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
> > > > 
> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > 
> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > > >
> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > > >
> > > > > > > Don't forget you have to write new checksum and free space tree pages.
> > > > > > > In the worst case, you'll need about 1GB of new metadata pages
> > > > for each
> > > > > > > 128MB you defrag (though you get to delete 99.5% of them immediately
> > > > > > > after).
> > > > > >
> > > > > > Yes, here we are debating some worst-case scenaraio which is actually
> > > > > > imposible in practice due to various reasons.
> > > > >
> > > > > No, it's quite possible.  A log file written slowly on an active
> > > > > filesystem above a few TB will do that accidentally.  Every now and then
> > > > > I hit that case.  It can take several hours to do a logrotate on spinning
> > > > > arrays because of all the metadata fetches and updates associated with
> > > > > worst-case file delete.  Long enough to watch the delete happen, and
> > > > > even follow along in the source code.
> > > > >
> > > > > I guess if I did a proactive defrag every few hours, it might take less
> > > > > time to do the logrotate, but that would mean spreading out all the
> > > > > seeky IO load during the day instead of getting it all done at night.
> > > > > Logrotate does the same job as defrag in this case (replacing a file in
> > > > > thousands of fragments spread across the disk with a few large fragments
> > > > > close together), except logrotate gets better compression.
> > > > >
> > > > > To be more accurate, the example I gave above is the worst case you
> > > > > can expect from normal user workloads.  If I throw in some reflinks
> > > > > and snapshots, I can make it arbitrarily worse, until the entire disk
> > > > > is consumed by the metadata update of a single extent defrag.
> > > > >
> > > > 
> > > > I can't believe I am considering this case.
> > > > 
> > > > So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
> > > > randomly over the entire disk. We have 512 GB free RAM and 2% free disk
> > > > space. The file needs to be defragmented.
> > > > 
> > > > In order to do that, defrag needs to be able to copy-move multiple extents
> > > > in one batch, and update the metadata.
> > > > 
> > > > The metadata has a total of at least 256 million entries, each of some size,
> > > > but each one should hold at least a pointer to the extent (8 bytes) and a
> > > > checksum (8 bytes): In reality, it could be that there is a lot of other
> > > > data there per entry.
> > > 
> > > It's about 48KB per 4K extent, plus a few hundred bytes on average for each
> > > reference.
> > 
> > Sorry, could you be more clear there? An file fragment/extent that holds
> > file data can be any
> > size up to 128 MB. What metadata is there per every file fragment/extent?
> > 
> > Because "48 KB per 4 K extent" ... cannot decode what you mean.
> 
> An extent has 3 associated records in btrfs, not including its references.
> The first two exist while the extent exists, the third appears after it
> is removed.
> 
> 	- extent tree:  location, size of extent, pointers to backref trees.
> 	Length is around 60 bytes plus the size of the backref pointer list.
> 
> 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
> 	Length of item is number of extent data blocks * 4 bytes plus a
> 	168-bit header (ish...csums from adjacent extents may be packed
> 	using a shared header)
> 
> 	- free space tree:  location, size of free space.  This appears
> 	when the extent is deleted.  It may be merged with adjacent
> 	records.  Length is maybe 20 bytes?
> 
> Each page contains a few hundred items, so if there are a few hundred
> unrelated extents between extents in the log file, each log file extent
> gets its own metadata page in each tree.
> 
> > Another question is: what is the average size of metadata extents?
> 
> Metadata extents are all 16K.
> 
> > > > The metadata is organized as a b-tree. Therefore, nearby nodes should
> > > > contain data of consecutive file extents.
> > > 
> > > It's 48KB per item.
> > 
> > What's the "item"?
> 
> Items are the objects stored in the trees.  So one extent item, one csum
> item, and one free space tree item, all tied to the 4K extent from the
> log file.
> 
> > > As you remove the original data extents, you will
> > > be touching a 16KB page in three trees for each extent that is removed:
> > > Free space tree, csum tree, and extent tree.  This happens after the
> > > merged extent is created.  It is part of the cleanup operation that
> > > gets rid of the original 4K extents.
> > 
> > Ok, but how big are free space tree and csum tree?
> 
> At least 12GB in the worst-case example.

The filesystem where I was hitting the 1GB-metadata issue has 79GB
of metadata.  An average of 500-1000 new extents are created on the
filesystem between log file page writes, maybe up to 5000 during peak
activity periods.

> > Also, when moving a file to defragment it, there should still be some
> > locality even in free space tree.
> 
> It is only guaranteed in the defrag result, because the defrag result is
> the only thing that is necessarily physically contiguous.
> 
> > And the csum tree, it should be ordered similar to free space tree, right?
> 
> They are all ordered by extent physical address (same physical blocks,
> same metadata item key).
> 
> > > Because the file was written very slowly on a big filesystem, the extents
> > > are scattered pessimally all over the virtual address space, not packed
> > > close together.  If there are a few hundred extent allocations between
> > > each log extent, then they will all occupy separate metadata pages.
> > 
> > Ok, now you are talking about your pathological case. Let's consider it.
> > 
> > Note that there is very little that can be in this case that you are
> > describing. In order to defrag such a file, either the defrag will take many
> > small steps and therefore it will be slow (because each step needs to
> > perform an update to the metadata), or the defrag can do it in one big step
> > and use a huge amount of RAM.
> > 
> > So, the best thing to be done in this situation is to allow the user to
> > specify the amount of RAM that defrag is allowed to use, so that the user
> > decides which of the two (slow defrag or lots of RAM) he wants.
> > 
> > There is no way around it. There is no better defrag than the one that has
> > ALL information at hand, that one will be the fastest and the best defrag.
> > 
> > > When it is time to remove them, each of these pages must be updated.
> > > This can be hit in a number of places in btrfs, including overwrite
> > > and delete.
> > > 
> > > There's also 60ish bytes per extent in any subvol trees the file
> > > actually appears in, but you do get locality in that one (the key is
> > > inode and offset, so nothing can get between them and space them apart).
> > > That's 12GB and change (you'll probably completely empty most of the
> > > updated subvol metadata pages, so we can expect maybe 5 pages to remain
> > > including root and interior nodes).  I haven't been unlucky enough to
> > > get a "natural" 12GB, but I got over 1GB a few times recently.
> > 
> > The thing that I figured out (and I have already written it down in another
> > post) is that the defrag can CHOOSE AT WILL how large update to metadata it
> > wants to perform (within the limit of available RAM). The defrag can select,
> > by itself, the most efficient way to proceed while still honoring the
> > user-supplied limit on RAM.
> 
> Yeah, it can update half the reflinks and pause for a commit, or similar.
> If there's a power failure then there will be a duplicate extent with some
> of the references to one copy and some to the other, but this is probably
> rare enough not to matter.
> 
> > > Reflinks can be used to multiply that 12GB arbitrarily--you only get
> > > locality if the reflinks are consecutive in (inode, offset) space,
> > > so if the reflinks are scattered across subvols or files, they won't
> > > share pages.
> > 
> > OK.
> > 
> > Yes, given a sufficiently pathological case, the defrag will take forever.
> > There is nothing unexpected there. I agree on that point. The defrag always
> > functions within certain prerequisites.
> > 
> > > > The trick, in this case, is to select one part of "ultralog" which is
> > > > localized in the metadata, and defragment it. Repeating this step will
> > > > ultimately defragment the entire file.
> > > > 
> > > > So, the defrag selects some part of metadata which is entirely a descendant
> > > > of some b-tree node not far from the bottom of b-tree. It selects it such
> > > > that the required update to the metadata is less than, let's say, 64 MB, and
> > > > simultaneously the affected "ultralog" file fragments total less han 512 MB
> > > > (therefore, less than 128 thousand metadata leaf entries, each pointing to a
> > > > 4 KB fragment). Then it finds all the file extents pointed to by that part
> > > > of metadata. They are consecutive (as file fragments), because we have
> > > > selected such part of metadata. Now the defrag can safely copy-move those
> > > > fragments to a new area and update the metadata.
> > > > 
> > > > In order to quickly select that small part of metadata, the defrag needs a
> > > > metatdata cache that can hold somewhat more than 128 thousand localized
> > > > metadata leaf entries. That fits into 128 MB RAM definitely.
> > > > 
> > > > Of course, there are many other small issues there, but this outlines the
> > > > general procedure.
> > > > 
> > > > Problem solved?
> > 
> > > Problem missed completely.  The forward reference updates were the only
> > > easy part.
> > 
> > Oh, I'll reply in another mail, this one is getting too tireing.
> > 
> > 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 19:40                                     ` General Zed
@ 2019-09-14 15:10                                       ` Jukka Larja
  0 siblings, 0 replies; 111+ messages in thread
From: Jukka Larja @ 2019-09-14 15:10 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

General Zed kirjoitti 13.9.2019 klo 22.40:

> I think 
> everyone here can see that I am versed in programming, and very proficient 
> in solving the high-level issues.

No. I certainly can't. Doesn't matter how good a programmer you are. 
No-one's good enough to get anywhere near my code base with the sort of 
attitude you have.

-- 
      ...Elämälle vierasta toimintaa...
     Jukka Larja, Roskakori@aarghimedes.fi

<saylan> I just set up port forwards to defense.gov
<saylan> anyone scanning me now will be scanning/attacking the DoD :D
<renderbod> O.o
<bolt> that's... not exactly how port forwarding works
<saylan> ?
- Quote Database, http://www.bash.org/?954232 -

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-13 11:04                                     ` Austin S. Hemmelgarn
  2019-09-13 20:43                                       ` Zygo Blaxell
@ 2019-09-14 18:29                                       ` Chris Murphy
  2019-09-14 23:39                                         ` Zygo Blaxell
  1 sibling, 1 reply; 111+ messages in thread
From: Chris Murphy @ 2019-09-14 18:29 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Zygo Blaxell, General Zed, Chris Murphy, Btrfs BTRFS

On Fri, Sep 13, 2019 at 5:04 AM Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
>
> Do you have a source for this claim of a 128MB max extent size?  Because
> everything I've seen indicates the max extent size is a full data chunk
> (so 1GB for the common case, potentially up to about 5GB for really big
> filesystems)

Yeah a block group can be a kind of "super extent". I think the
EXTENT_DATA maxes out at 128M but they are often contiguous, for
example

    item 308 key (5741459 EXTENT_DATA 0) itemoff 39032 itemsize 53
        generation 241638 type 1 (regular)
        extent data disk byte 193851400192 nr 134217728
        extent data offset 0 nr 134217728 ram 134217728
        extent compression 0 (none)
    item 309 key (5741459 EXTENT_DATA 134217728) itemoff 38979 itemsize 53
        generation 241638 type 1 (regular)
        extent data disk byte 193985617920 nr 134217728
        extent data offset 0 nr 134217728 ram 134217728
        extent compression 0 (none)
    item 310 key (5741459 EXTENT_DATA 268435456) itemoff 38926 itemsize 53
        generation 241638 type 1 (regular)
        extent data disk byte 194119835648 nr 134217728
        extent data offset 0 nr 134217728 ram 134217728
        extent compression 0 (none)

Where FIEMAP has a different view (via filefrag -v)

 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  131071:   47327002..  47458073: 131072:
   1:   131072..  294911:   47518701..  47682540: 163840:   47458074:
   2:   294912..  360447:   50279681..  50345216:  65536:   47682541:
   3:   360448..  499871:   50377984..  50517407: 139424:   50345217: last,eof
Fedora-Workstation-Live-x86_64-31_Beta-1.1.iso: 4 extents found

Those extents are all bigger than 128M. But they're each made up of
contiguous EXTENT_DATA items.

Also, the EXTENT_DATA size goes to a 128K max for any compressed
files, so you get an explosive number of EXTENT_DATA items on
compressed file systems, and thus metadata to rewrite.

I wonder if instead of a rewrite of defragmenting, if there could be
improvements to the allocator to write bigger extents. I guess the
problem really comes from file appends? Smarter often means slower but
perhaps it could be a variation on autodefrag?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14 18:29                                       ` Chris Murphy
@ 2019-09-14 23:39                                         ` Zygo Blaxell
  0 siblings, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-14 23:39 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, General Zed, Btrfs BTRFS

On Sat, Sep 14, 2019 at 12:29:09PM -0600, Chris Murphy wrote:
> On Fri, Sep 13, 2019 at 5:04 AM Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> >
> > Do you have a source for this claim of a 128MB max extent size?  Because
> > everything I've seen indicates the max extent size is a full data chunk
> > (so 1GB for the common case, potentially up to about 5GB for really big
> > filesystems)
> 
> Yeah a block group can be a kind of "super extent". I think the
> EXTENT_DATA maxes out at 128M but they are often contiguous, for
> example
> 
>     item 308 key (5741459 EXTENT_DATA 0) itemoff 39032 itemsize 53
>         generation 241638 type 1 (regular)
>         extent data disk byte 193851400192 nr 134217728
>         extent data offset 0 nr 134217728 ram 134217728
>         extent compression 0 (none)
>     item 309 key (5741459 EXTENT_DATA 134217728) itemoff 38979 itemsize 53
>         generation 241638 type 1 (regular)
>         extent data disk byte 193985617920 nr 134217728
>         extent data offset 0 nr 134217728 ram 134217728
>         extent compression 0 (none)
>     item 310 key (5741459 EXTENT_DATA 268435456) itemoff 38926 itemsize 53
>         generation 241638 type 1 (regular)
>         extent data disk byte 194119835648 nr 134217728
>         extent data offset 0 nr 134217728 ram 134217728
>         extent compression 0 (none)
> 
> Where FIEMAP has a different view (via filefrag -v)
> 
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..  131071:   47327002..  47458073: 131072:
>    1:   131072..  294911:   47518701..  47682540: 163840:   47458074:
>    2:   294912..  360447:   50279681..  50345216:  65536:   47682541:
>    3:   360448..  499871:   50377984..  50517407: 139424:   50345217: last,eof
> Fedora-Workstation-Live-x86_64-31_Beta-1.1.iso: 4 extents found
> 
> Those extents are all bigger than 128M. But they're each made up of
> contiguous EXTENT_DATA items.
> 
> Also, the EXTENT_DATA size goes to a 128K max for any compressed
> files, so you get an explosive number of EXTENT_DATA items on
> compressed file systems, and thus metadata to rewrite.

The compressed extents tend to be physically contiguous as well, so
quantitatively they aren't much of a problem.  There's more space used in
metadata, but that is compensated by less space in data.  In the subvol
trees, logically contiguous extents are always logically contiguous by
key order, so they are packed densely in subvol metadata pages.

In extent, csum, and free space trees, when there is physical
contiguity--or just proximity--the extent's items are packed into the
same metadata pages, keeping their costs down.  That's true of all
small extents, not just compressed ones.  Note that contiguity isn't
necessary for metadata space efficiency--the extents just have to be
close together, they don't need to be seamless, or in order (at least
not for this reason).

Writes that are separated in _time_ are a different problem, and
potentially much worse than the compression case.  If you have a file that
consists of lots of extents that were written with significant allocations
to other files between them, that file becomes a metadata monster that
can create massive commit latencies when it is deleted or modified.
If you unpack tarballs or build sources or rsync backup trees or really
any two or more writing tasks at the same time on a big btrfs filesystem,
you can run into cases where the metadata:data ratio goes above 1.0 during
updates _and_ the metadata is randomly distributed physically.  Commits
after a big delete run for hours.

> I wonder if instead of a rewrite of defragmenting, if there could be
> improvements to the allocator to write bigger extents. I guess the
> problem really comes from file appends? Smarter often means slower but
> perhaps it could be a variation on autodefrag?

Physically dispersed files can be fixed by defrag, but directory trees
are a little different.  The current defrag doesn't look at the physical
distances between files, only the extents within a single file, so it
doesn't help when you have a big fragmented directory tree of many small
not-fragmented files.  IOW defrag helps with 'rm -f' performance but not
'rm -rf' performance.

Other filesystems have allocator heuristics that reserve space near
growing files, or try to pre-divide the free space to spread out
files belonging to different directories or created by two processes.
This is an attempt to fix the problem before it occurs, and sometimes
it works; however, the heuristics have to match the reality or it just
makes things worse, and extra complexity breeds bugs, e.g. the fix
recently for a bug which tried to give every thread its own block group
for allocation--i.e. 20 threads writing 4K each could ENOSPC if there
was less than 20GB of unallocated space.

I think the best approach may be to attack the problem quantitatively
with an autodefrag agent:  keep the write path fast and simple, but
detect areas where problems are occurring--i.e. where the ratio of extent
metadata locality to physical locality is low--and clean them up with
some minimal data relocation.  Note that's somewhat different from what
the current kernel autodefrag does.

In absolute terms autodefrag is worse--ideally we'd just put the data
in the right place from the start, not write it in the wrong place
then spend more iops fixing it later--but not all iops have equal cost.
In some cases there is an opportunity to trade cheap iops at one time
for expensive iops at a different time, and a userspace agent can invest
more time, memory, and code complexity on that trade than the kernel.

Some back-of-the-envelope math says we don't need to do very much
post-processing work to deal with the very worst cases:  keep extent
sizes over a few hundred KB, and keep small files not more than about
5-10 metadata items away from their logical neighbors, and we avoid the
worst-case 12.0 metadata-to-data ratios during updates.  Compared to
those, other inefficiencies are trivial.

> 
> -- 
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  4:42                                               ` Zygo Blaxell
  2019-09-14  4:53                                                 ` Zygo Blaxell
@ 2019-09-15 17:54                                                 ` General Zed
  2019-09-16 22:51                                                   ` Zygo Blaxell
  1 sibling, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-15 17:54 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > >
>> > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > > > >
>> > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > > >
>> > > > > > Don't forget you have to write new checksum and free  
>> space tree pages.
>> > > > > > In the worst case, you'll need about 1GB of new metadata pages
>> > > for each
>> > > > > > 128MB you defrag (though you get to delete 99.5% of them  
>> immediately
>> > > > > > after).
>> > > > >
>> > > > > Yes, here we are debating some worst-case scenaraio which  
>> is actually
>> > > > > imposible in practice due to various reasons.
>> > > >
>> > > > No, it's quite possible.  A log file written slowly on an active
>> > > > filesystem above a few TB will do that accidentally.  Every  
>> now and then
>> > > > I hit that case.  It can take several hours to do a logrotate  
>> on spinning
>> > > > arrays because of all the metadata fetches and updates associated with
>> > > > worst-case file delete.  Long enough to watch the delete happen, and
>> > > > even follow along in the source code.
>> > > >
>> > > > I guess if I did a proactive defrag every few hours, it might  
>> take less
>> > > > time to do the logrotate, but that would mean spreading out all the
>> > > > seeky IO load during the day instead of getting it all done at night.
>> > > > Logrotate does the same job as defrag in this case (replacing  
>> a file in
>> > > > thousands of fragments spread across the disk with a few  
>> large fragments
>> > > > close together), except logrotate gets better compression.
>> > > >
>> > > > To be more accurate, the example I gave above is the worst case you
>> > > > can expect from normal user workloads.  If I throw in some reflinks
>> > > > and snapshots, I can make it arbitrarily worse, until the entire disk
>> > > > is consumed by the metadata update of a single extent defrag.
>> > > >
>> > >
>> > > I can't believe I am considering this case.
>> > >
>> > > So, we have a 1TB log file "ultralog" split into 256 million 4  
>> KB extents
>> > > randomly over the entire disk. We have 512 GB free RAM and 2% free disk
>> > > space. The file needs to be defragmented.
>> > >
>> > > In order to do that, defrag needs to be able to copy-move  
>> multiple extents
>> > > in one batch, and update the metadata.
>> > >
>> > > The metadata has a total of at least 256 million entries, each  
>> of some size,
>> > > but each one should hold at least a pointer to the extent (8  
>> bytes) and a
>> > > checksum (8 bytes): In reality, it could be that there is a lot of other
>> > > data there per entry.
>> >
>> > It's about 48KB per 4K extent, plus a few hundred bytes on  
>> average for each
>> > reference.
>>
>> Sorry, could you be more clear there? An file fragment/extent that holds
>> file data can be any
>> size up to 128 MB. What metadata is there per every file fragment/extent?
>>
>> Because "48 KB per 4 K extent" ... cannot decode what you mean.
>
> An extent has 3 associated records in btrfs, not including its references.
> The first two exist while the extent exists, the third appears after it
> is removed.
>
> 	- extent tree:  location, size of extent, pointers to backref trees.
> 	Length is around 60 bytes plus the size of the backref pointer list.

Wait.. and where are the reflinks? Backrefs are there for going up the  
tree, but where are reflinks for going down the tree?

So, you are saying that backrefs are already in the extent tree (or  
reachable from it). I didn't know that, that information makes my  
defrag much simpler to implement and describe. Someone in this thread  
has previously mislead me to believe that backref information is not  
easily available.

>
> 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
> 	Length of item is number of extent data blocks * 4 bytes plus a
> 	168-bit header (ish...csums from adjacent extents may be packed
> 	using a shared header)
>
> 	- free space tree:  location, size of free space.  This appears
> 	when the extent is deleted.  It may be merged with adjacent
> 	records.  Length is maybe 20 bytes?
>
> Each page contains a few hundred items, so if there are a few hundred
> unrelated extents between extents in the log file, each log file extent
> gets its own metadata page in each tree.

As far as I can understand it, the extents in the extent tree are  
indexed (keyed) by inode&offset. Therefore, no matter how many  
unrelated extents there are between (physical locations of data)  
extents in the log file, the log file extent tree entries will  
(generally speaking) be localized, because multiple extent entries  
(extent items) are bunched tohgether in one 16 KB metadata extent node.

>> Another question is: what is the average size of metadata extents?
>
> Metadata extents are all 16K.
>
>> > > The metadata is organized as a b-tree. Therefore, nearby nodes should
>> > > contain data of consecutive file extents.
>> >
>> > It's 48KB per item.
>>
>> What's the "item"?
>
> Items are the objects stored in the trees.  So one extent item, one csum
> item, and one free space tree item, all tied to the 4K extent from the
> log file.
>
>> > As you remove the original data extents, you will
>> > be touching a 16KB page in three trees for each extent that is removed:
>> > Free space tree, csum tree, and extent tree.  This happens after the
>> > merged extent is created.  It is part of the cleanup operation that
>> > gets rid of the original 4K extents.
>>
>> Ok, but how big are free space tree and csum tree?
>
> At least 12GB in the worst-case example.

The "worst case example" is when all file data extents are 4 KB in  
size, with a 4 KB hole between each two extents. Such example doesn't  
need to be considered because it is irrelevant.

But, if you were to consider it, you would quickly figure out that a  
good defrag solution would succeed in defragging this abomination of a  
filesystem, it is just that it would take some extra time to do it.

>> Also, when moving a file to defragment it, there should still be some
>> locality even in free space tree.
>
> It is only guaranteed in the defrag result, because the defrag result is
> the only thing that is necessarily physically contiguous.

In order not to lose sight:
This argument is related to "how big metadata update there needs to  
be" for some pathological cases. The worry is that the metadata update  
is going to be larger than the file data update. The word "update"  
here does NOT refer to in-place operations.

Any my answer is: we digressed so badly that we lost sight of the  
above original question.

And the answer to the original question is: there is no problem. Why?  
Because there is no better solution. Either you don't do defrag, or  
you have a nasty metadata update in some pathological cases. There is  
no magic wand there, no shortcut.

The good thing is that those pathological cases are irrelevent,  
because, if there was too much of them, btrfs wouldn't be able to  
function at all.

I mean, any file write operation could also modify all three btrfs  
trees. So what? It just works.
Even more, if free space is fragmented, a file append can turn into a  
nasty update to the free-space tree. So, there you go, the same  
problem can be found in everyday operation of btrfs.

>> And the csum tree, it should be ordered similar to free space tree, right?
>
> They are all ordered by extent physical address (same physical blocks,
> same metadata item key).
>
>> > Because the file was written very slowly on a big filesystem, the extents
>> > are scattered pessimally all over the virtual address space, not packed
>> > close together.  If there are a few hundred extent allocations between
>> > each log extent, then they will all occupy separate metadata pages.
>>
>> Ok, now you are talking about your pathological case. Let's consider it.
>>
>> Note that there is very little that can be in this case that you are
>> describing. In order to defrag such a file, either the defrag will take many
>> small steps and therefore it will be slow (because each step needs to
>> perform an update to the metadata), or the defrag can do it in one big step
>> and use a huge amount of RAM.
>>
>> So, the best thing to be done in this situation is to allow the user to
>> specify the amount of RAM that defrag is allowed to use, so that the user
>> decides which of the two (slow defrag or lots of RAM) he wants.
>>
>> There is no way around it. There is no better defrag than the one that has
>> ALL information at hand, that one will be the fastest and the best defrag.
>>
>> > When it is time to remove them, each of these pages must be updated.
>> > This can be hit in a number of places in btrfs, including overwrite
>> > and delete.
>> >
>> > There's also 60ish bytes per extent in any subvol trees the file
>> > actually appears in, but you do get locality in that one (the key is
>> > inode and offset, so nothing can get between them and space them apart).
>> > That's 12GB and change (you'll probably completely empty most of the
>> > updated subvol metadata pages, so we can expect maybe 5 pages to remain
>> > including root and interior nodes).  I haven't been unlucky enough to
>> > get a "natural" 12GB, but I got over 1GB a few times recently.
>>
>> The thing that I figured out (and I have already written it down in another
>> post) is that the defrag can CHOOSE AT WILL how large update to metadata it
>> wants to perform (within the limit of available RAM). The defrag can select,
>> by itself, the most efficient way to proceed while still honoring the
>> user-supplied limit on RAM.
>
> Yeah, it can update half the reflinks and pause for a commit, or similar.
> If there's a power failure then there will be a duplicate extent with some
> of the references to one copy and some to the other, but this is probably
> rare enough not to matter.

Exactly! That's the same that I was thinking.

>> > Reflinks can be used to multiply that 12GB arbitrarily--you only get
>> > locality if the reflinks are consecutive in (inode, offset) space,
>> > so if the reflinks are scattered across subvols or files, they won't
>> > share pages.
>>
>> OK.
>>
>> Yes, given a sufficiently pathological case, the defrag will take forever.
>> There is nothing unexpected there. I agree on that point. The defrag always
>> functions within certain prerequisites.
>>
>> > > The trick, in this case, is to select one part of "ultralog" which is
>> > > localized in the metadata, and defragment it. Repeating this step will
>> > > ultimately defragment the entire file.
>> > >
>> > > So, the defrag selects some part of metadata which is entirely  
>> a descendant
>> > > of some b-tree node not far from the bottom of b-tree. It  
>> selects it such
>> > > that the required update to the metadata is less than, let's  
>> say, 64 MB, and
>> > > simultaneously the affected "ultralog" file fragments total  
>> less han 512 MB
>> > > (therefore, less than 128 thousand metadata leaf entries, each  
>> pointing to a
>> > > 4 KB fragment). Then it finds all the file extents pointed to  
>> by that part
>> > > of metadata. They are consecutive (as file fragments), because we have
>> > > selected such part of metadata. Now the defrag can safely  
>> copy-move those
>> > > fragments to a new area and update the metadata.
>> > >
>> > > In order to quickly select that small part of metadata, the  
>> defrag needs a
>> > > metatdata cache that can hold somewhat more than 128 thousand localized
>> > > metadata leaf entries. That fits into 128 MB RAM definitely.
>> > >
>> > > Of course, there are many other small issues there, but this  
>> outlines the
>> > > general procedure.
>> > >
>> > > Problem solved?
>>
>> > Problem missed completely.  The forward reference updates were the only
>> > easy part.
>>
>> Oh, I'll reply in another mail, this one is getting too tireing.
>>
>>




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  4:28                                                 ` Zygo Blaxell
@ 2019-09-15 18:05                                                   ` General Zed
  2019-09-16 23:05                                                     ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-15 18:05 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 09:28:49PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
>> > >
>> > > Quoting General Zed <general-zed@zedlx.com>:
>> > >
>> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > >
>> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > > > > >
>> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > > > >
>> > > > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>> > > > > > > >
>> > > > > > > > At worst, it just has to completely write-out "all
>> > > > > > > metadata", all the way up
>> > > > > > > > to the super. It needs to be done just once, because
>> > > what's the point of
>> > > > > > > > writing it 10 times over? Then, the super is updated as
>> > > > > > > the final commit.
>> > > > > > >
>> > > > > > > This is kind of a silly discussion.  The biggest extent  
>> possible on
>> > > > > > > btrfs is 128MB, and the incremental gains of forcing 128MB
>> > > extents to
>> > > > > > > be consecutive are negligible.  If you're defragging a 10GB
>> > > file, you're
>> > > > > > > just going to end up doing 80 separate defrag operations.
>> > > > > >
>> > > > > > Ok, then the max extent is 128 MB, that's fine. Someone here
>> > > > > > previously said
>> > > > > > that it is 2 GB, so he has disinformed me (in order to further
>> > > his false
>> > > > > > argument).
>> > > > >
>> > > > > If the 128MB limit is removed, you then hit the block group  
>> size limit,
>> > > > > which is some number of GB from 1 to 10 depending on number of disks
>> > > > > available and raid profile selection (the striping raid profiles cap
>> > > > > block group sizes at 10 disks, and single/raid1 profiles  
>> always use 1GB
>> > > > > block groups regardless of disk count).  So 2GB is _also_ a  
>> valid extent
>> > > > > size limit, just not the first limit that is relevant for defrag.
>> > > > >
>> > > > > A lot of people get confused by 'filefrag -v' output, which  
>> coalesces
>> > > > > physically adjacent but distinct extents.  So if you use that tool,
>> > > > > it can _seem_ like there is a 2.5GB extent in a file, but  
>> it is really
>> > > > > 20 distinct 128MB extents that start and end at adjacent addresses.
>> > > > > You can see the true structure in 'btrfs ins dump-tree' output.
>> > > > >
>> > > > > That also brings up another reason why 10GB defrags are  
>> absurd on btrfs:
>> > > > > extent addresses are virtual.  There's no guarantee that a pair
>> > > of extents
>> > > > > that meet at a block group boundary are physically  
>> adjacent, and after
>> > > > > operations like RAID array reorganization or free space  
>> defragmentation,
>> > > > > they are typically quite far apart physically.
>> > > > >
>> > > > > > I didn't ever said that I would force extents larger than 128 MB.
>> > > > > >
>> > > > > > If you are defragging a 10 GB file, you'll likely have to do it
>> > > > > > in 10 steps,
>> > > > > > because the defrag is usually allowed to only use a limited
>> > > amount of disk
>> > > > > > space while in operation. That has nothing to do with the  
>> extent size.
>> > > > >
>> > > > > Defrag is literally manipulating the extent size.   
>> Fragments and extents
>> > > > > are the same thing in btrfs.
>> > > > >
>> > > > > Currently a 10GB defragment will work in 80 steps, but doesn't
>> > > necessarily
>> > > > > commit metadata updates after each step, so more than 128MB  
>> of temporary
>> > > > > space may be used (especially if your disks are fast and empty,
>> > > > > and you start just after the end of the previous commit interval).
>> > > > > There are some opportunities to coalsce metadata updates,  
>> occupying up
>> > > > > to a (arbitrary) limit of 512MB of RAM (or when memory  
>> pressure forces
>> > > > > a flush, whichever comes first), but exploiting those opportunities
>> > > > > requires more space for uncommitted data.
>> > > > >
>> > > > > If the filesystem starts to get low on space during a defrag, it can
>> > > > > inject commits to force metadata updates to happen more often, which
>> > > > > reduces the amount of temporary space needed (we can't delete
>> > > the original
>> > > > > fragmented extents until their replacement extent is committed);
>> > > however,
>> > > > > if the filesystem is so low on space that you're worried  
>> about running
>> > > > > out during a defrag, then you probably don't have big  
>> enough contiguous
>> > > > > free areas to relocate data into anyway, i.e. the defrag is just
>> > > going to
>> > > > > push data from one fragmented location to a different fragmented
>> > > location,
>> > > > > or bail out with "sorry, can't defrag that."
>> > > >
>> > > > Nope.
>> > > >
>> > > > Each defrag "cycle" consists of two parts:
>> > > >      1) move-out part
>> > > >      2) move-in part
>> > > >
>> > > > The move-out part select one contiguous area of the disk. Almost any
>> > > > area will do, but some smart choices are better. It then moves-out all
>> > > > data from that contiguous area into whatever holes there are  
>> left empty
>> > > > on the disk. The biggest problem is actually updating the metadata,
>> > > > since the updates are not localized.
>> > > > Anyway, this part can even be skipped.
>> > > >
>> > > > The move-in part now populates the completely free contiguous  
>> area with
>> > > > defragmented data.
>> > > >
>> > > > In the case that the move-out part needs to be skipped because the
>> > > > defrag estimates that the update to metatada will be too big (like in
>> > > > the pathological case of a disk with 156 GB of metadata), it can
>> > > > sucessfully defrag by performing only the move-in part. In that case,
>> > > > the move-in area is not free of data and "defragmented" data won't be
>> > > > fully defragmented. Also, there should be at least 20% free disk space
>> > > > in this case in order to avoid defrag turning pathological.
>> > > >
>> > > > But, these are all some pathological cases. They should be  
>> considered in
>> > > > some other discussion.
>> > >
>> > > I know how to do this pathological case. Figured it out!
>> > >
>> > > Yeah, always ask General Zed, he knows the best!!!
>> > >
>> > > The move-in phase is not a problem, because this phase  
>> generally affects a
>> > > low number of files.
>> > >
>> > > So, let's consider the move-out phase. The main concern here is that the
>> > > move-out area may contain so many different files and fragments that the
>> > > move-out forces a practically undoable metadata update.
>> > >
>> > > So, the way to do it is to select files for move-out, one by  
>> one (or even
>> > > more granular, by fragments of files), while keeping track of  
>> the size of
>> > > the necessary metadata update. When the metadata update exceeds  
>> a certain
>> > > amount (let's say 128 MB, an amount that can easily fit into RAM), the
>> > > move-out is performed with only currently selected files (file  
>> fragments).
>> > > (The move-out often doesn't affect a whole file since only a  
>> part of each
>> > > file lies within the move-out area).
>> >
>> > This move-out phase sounds like a reinvention of btrfs balance.  Balance
>> > already does something similar, and python-btrfs gives you a script to
>> > target block groups with high free space fragmentation for balancing.
>> > It moves extents (and their references) away from their block group.
>> > You get GB-sized (or multi-GB-sized) contiguous free space areas into
>> > which you can then allocate big extents.
>>
>> Perhaps btrfs balance needs to perform something similar, but I can assure
>> you that a balance cannot replace the defrag.
>
> Correct, balance is only half of the solution.
>
> The balance is required for two things on btrfs:  "move-out" phase of
> free space defragmentation, and to ensure at least one unallocated block
> group exists on the filesystem in case metadata expansion is required.
>
> A btrfs can operate without defrag for...well, forever, defrag is not
> necessary at all.  I have dozens of multi-year-old btrfs filesystems of
> assorted sizes that have never run defrag even once.
>
> By contrast, running out of unallocated space is a significant problem
> that should be corrected with the same urgency as RAID entering degraded
> mode.  I generally recommend running 'btrfs balance start -dlimit=1' about
> once per day to force one block group to always be empty.
>
> Filesystems that don't maintain unallocated space can run into problems
> if metadata runs out of space.  These problems can be inconvenient to
> recover from.
>
>> The point and the purpose of "move out" is to create a clean contiguous free
>> space area, so that defragmented files can be written into it.
>
>>
>> > > Now the defrag has to decide: whether to continue with another  
>> round of the
>> > > move-out to get a cleaner move-in area (by repeating the same procedure
>> > > above), or should it continue with a move-in into a partialy  
>> dirty area. I
>> > > can't tell you what's better right now, as this can be  
>> determined only by
>> > > experiments.
>> > >
>> > > Lastly, the move-in phase is performed (can be done whether the  
>> move-in area
>> > > is dirty or completely clean). Again, the same trick can be  
>> used: files can
>> > > be selected one by one until the calculated metadata update  
>> exceeds 128 MB.
>> > > However, it is more likely that the size of move-in area will  
>> be exhausted
>> > > before this happens.
>> > >
>> > > This algorithm will work even if you have only 3% free disk space left.
>> >
>> > I was thinking more like "you have less than 1GB free on a 1TB filesystem
>> > and you want to defrag 128MB things", i.e. <0.1% free space.  If you don't
>> > have all the metadata block group free space you need allocated already
>> > by that point, you can run out of metadata space and the filesystem goes
>> > read-only.  Happens quite often to people.  They don't like it very much.
>>
>> The defrag should abort whenever it detects such adverse conditions as 0.1%
>> free disk space. In fact, it should probably abort as soon as it detects
>> less than 3% free disk space. This is normal and expected. If the user has a
>> partition with less than 3% free disk space, he/she should not defrag it
>> until he/she frees some space, perhaps by deleting unnecessary data or by
>> moving out some data to other partitions.
>
> 3% of 45TB is 1.35TB...seems a little harsh.  Recall no extent can be
> larger than 128MB, so we're talking about enough space for ten thousand
> of defrag's worst-case output extents.  A limit based on absolute numbers
> might make more sense, though the only way to really know what the limit is
> on any given filesystem is to try to reach it.

Nah.

The free space minimum limit must, unfortunately, be based on absolute  
percentages. There is no better way. The problem is that, in order for  
defrag to work, it has to (partially) consolidate some of the free  
space, in order to produce a contiguous free area which will be the  
destination for defrag data.

In order to be able to produce this contiguous free space area, it is  
of utmost importance that there is sufficient free space left on the  
partition. Otherwise, this free space consolidation operation will  
take too much time (too much disk I/O). There is no good way around it  
the common cases of free space fragmentation.

If you reduce the free space minimum limit below 3%, you are likely to  
spend 2x more I/O in consolidating free space than what is needed to  
actually defrag the data. I mean, the defrag will still work, but I  
think that the slowdown is unacceptable.

I mean, the user should just free some space! The filesystems should  
not be left with less than 10% free space, that's simply bad  
management from the user's part, and the user should accept the  
consequences.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-14  4:12                           ` Zygo Blaxell
@ 2019-09-16 11:42                             ` General Zed
  2019-09-17  0:49                               ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-16 11:42 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 05:23:21PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Wed, Sep 11, 2019 at 07:21:31PM -0400, webmaster@zedlx.com wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> [...etc...]
>> > > > On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmaster@zedlx.com wrote:
>> > It's the default for GNU coreutils, and for 'mv' across subvols there
>> > is currently no option to turn reflink copies off.  Maybe for 'cp'
>> > you still have to explicitly request reflink, but that will presumably
>> > change at some point as more filesystems get the CLONE_RANGE ioctl and
>> > more users expect it to just work by default.
>>
>> Yes, thank you for posting another batch of arguments that support the use
>> of my vision of defrag instead of the current one.
>>
>> The defrag that I'm proposing will preserve all those reflinks that were
>> painstakingly created by the user. Therefore, I take that you agree with me
>> on the utmost importance of implementing this new defrag that I'm proposing.
>
> I do not agree that improving the current defrag is of utmost importance,
> or indeed of any importance whatsoever.  The current defrag API is a
> clumsy, unscalable hack that cannot play well with other filesystem layout
> optimization tools no matter what you do to its internal implementation
> details.  It's better to start over with a better design, and spend only
> the minimal amount of effort required to keep the old one building until
> its replacement(s) is (are) proven in use and ready for deployment.
>
> I'm adding extent-merging support to an existing tool that already
> performs several other filesystem layout optimizations.  The goal is to
> detect degenerate extent layout on filesystems as it appears, and repair
> it before it becomes a more severe performance problem, without wasting
> resources on parts of the filesystem that do not require intervention.

Oh, I get it. So, the current defrag isn't particularly good, so you  
are going to produce a solution which mitigates the fragmentation  
problem in some cases (but not all of them). Well, that's a good quick  
fix, but not a true solution.

> Your defrag ideas are interesting, but you should spend a lot more
> time learning the btrfs fundamentals before continuing.  Right now
> you do not understand what btrfs is capable of doing easily, and what
> requires such significant rework in btrfs to implement that the result
> cannot be considered the same filesystem.  This is impairing the quality
> of your design proposals and reducing the value of your contribution
> significantly.

Ok, that was a shot at me; and I admit, guilty as charged. I barely  
have a clue about btrfs.
Now it's my turn to shoot. Apparently, the people which are  
implementing the btrfs defrag, or at least the ones that responded to  
my post, seem to have no clue about how on-demand defrag solutions  
typically work. I had to explain the usual tricks involved in the  
defragmentation, and it was like talking to complete rookies. None of  
you even considered a full-featured defrag solution, all that you are  
doing are some partial solutions.

And, you all got lost in implementation details. How many times have I  
been told here that some operation cannot be performed, and then it  
turned out the opposite. You have all sunk into some strange state of  
mind where every possible excuse is being made in order not to start  
working on a better, hollistic defrag solution.

And you even misunderstood me when I said "hollistic defrag", you  
thought I was talking about a full defrag. No. A full defrag is a  
defrag performed on all the data. A holistic defrag can be performed  
on only some data, but it is hollistic in the sense that it uses whole  
information about a filesystem, not just a partial view of it. A  
holistic defrag is better than a partial defrag: it is faster and  
produces better results, and it can defrag a wider spectrum of cases.  
Why? Because a holistic defrag takes everything into account.

So I think you should all inform yourself a little better about  
various defrag algorithms and solutions that exist. Apparently, you  
all lost the sight of the big picture. You can't see the wood from the  
trees.

>> I suggest that btrfs should first try to determine whether it can split an
>> extent in-place, or not. If it can't do that, then it should create new
>> extents to split the old one.
>
> btrfs cannot split extents in place, so it must always create new
> extents by copying data blocks.  It's a hugely annoying and non-trivial
> limitation that makes me consider starting over with some other filesystem
> quite often.

Actually, this has no repercussions for the defrag. The defrag will  
always copy the data to a new place. So, if brtfs can't split  
in-place, that is just fine.

> If you are looking for important btrfs work, consider solving that
> problem first.  It would dramatically improve GC (in the sense that
> it would eliminate the need to perform a separate GC step at all) and
> dedupe performance on btrfs as well as help defrag and other extent
> layout optimizers.

There is no problem there.

>> Therefore, the defrag can free unused parts of any extent, and then the
>> extent can be split is necessary. In fact, both these operations can be done
>> simultaneously.
>
> Sure, but I only call one of these operations "defrag" (the extent merge
> operation).  The other operations increase the total number of fragments
> in the filesystem, so "defrag" is not an appropriate name for them.
> An appropriate name would be something like "enfrag" or "refrag" or
> "split".  In some cases the "defrag" can be performed by doing a "dedupe"
> operation with a single unfragmented identical source extent replacing
> several fragmented destination extents...what do you call that?

Well, no. Perhaps the word "defrag" can have a wider and narrower  
sense. So in a narrower sense, "defrag" means what you just wrote. In  
that sense, the word "defrag" means practically the same as "merge",  
so why not just use the word "merge" to remove any ambiguities. The  
"merge" is the only operation that decreases the number of fragments  
(besides "delete"). Perhaps you meant move&merge. But, commonly, the  
word "defrag" is used in a wider sense, which is not the one you  
described.

In a wider sense, the defrag involves the preparation, analysis, free  
space consolidation, multiple phases, splitting and merging, and final  
passes.

Try looking on Wikipedia for "defrag".

>> > Dedupe on btrfs also requires the ability to split and merge extents;
>> > otherwise, we can't dedupe an extent that contains a combination of
>> > unique and duplicate data.  If we try to just move references around
>> > without splitting extents into all-duplicate and all-unique extents,
>> > the duplicate blocks become unreachable, but are not deallocated.  If we
>> > only split extents, fragmentation overhead gets bad.  Before creating
>> > thousands of references to an extent, it is worthwhile to merge it with
>> > as many of its neighbors as possible, ideally by picking the biggest
>> > existing garbage-free extents available so we don't have to do defrag.
>> > As we examine each extent in the filesystem, it may be best to send
>> > to defrag, dedupe, or garbage collection--sometimes more than one of
>> > those.
>>
>> This is sovled simply by always running defrag before dedupe.
>
> Defrag and dedupe in separate passes is nonsense on btrfs.

Defrag can be run without dedupe.

Now, how to organize dedupe? I didn't think about it yet. I'll leave  
it to you, but it seems to me that defrag should be involved there.  
And, my defrag solution would help there very, very much.

> Defrag burns a lot of iops on defrag moving extent data around to create
> new size-driven extent boundaries.  These will have to be immediately
> moved again by dedupe (except in special cases like full-file matches),
> because dedupe needs to create content-driven extent boundaries to work
> on btrfs.

Defrag can be run without dedupe.

Dedupe probably requires some kind of defrag to produce a good result   
(a result without heavy fragmentation).

> Extent splitting in-place is not possible on btrfs, so extent boundary
> changes necessarily involve data copies.  Reference counting is done
> by extent in btrfs, so it is only possible to free complete extents.

Great, there is reference counting in btrfs. That helps. Good design.

> You have to replace the whole extent with references to data from
> somewhere else, creating data copies as required to do so where no
> duplicate copy of the data is available for reflink.
>
> Note the phrase "on btrfs" appears often here...other filesystems manage
> to solve these problems without special effort.  Again, if you're looking
> for important btrfs things to work on, maybe start with in-place extent
> splitting.

I think that I'll start with "software design document for on-demand  
defrag which preserves sharing structure". I have figure out that you  
don't have it yet. And, how can you even start working on a defrag  
without a software design document?

So I volunteer to write it. Apparently, I'm already half way done.

> On XFS you can split extents in place and reference counting is by
> block, so you can do alternating defrag and dedupe passes.  It's still
> suboptimal (you still waste iops to defrag data blocks that are
> immediately eliminated by the following dedupe), but it's orders of
> magnitude better than btrfs.

I'll reply to the rest of this marathonic post in another reply (when  
I find the time to read it). Because I'm writing the software design  
document.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-15 17:54                                                 ` General Zed
@ 2019-09-16 22:51                                                   ` Zygo Blaxell
  2019-09-17  1:03                                                     ` General Zed
  2019-09-17  3:10                                                     ` General Zed
  0 siblings, 2 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-16 22:51 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> 
> > On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
> > > 
> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > 
> > > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
> > > > >
> > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > >
> > > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > > > >
> > > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > > > >
> > > > > > > > Don't forget you have to write new checksum and free space
> > > tree pages.
> > > > > > > > In the worst case, you'll need about 1GB of new metadata pages
> > > > > for each
> > > > > > > > 128MB you defrag (though you get to delete 99.5% of them
> > > immediately
> > > > > > > > after).
> > > > > > >
> > > > > > > Yes, here we are debating some worst-case scenaraio which is
> > > actually
> > > > > > > imposible in practice due to various reasons.
> > > > > >
> > > > > > No, it's quite possible.  A log file written slowly on an active
> > > > > > filesystem above a few TB will do that accidentally.  Every
> > > now and then
> > > > > > I hit that case.  It can take several hours to do a logrotate
> > > on spinning
> > > > > > arrays because of all the metadata fetches and updates associated with
> > > > > > worst-case file delete.  Long enough to watch the delete happen, and
> > > > > > even follow along in the source code.
> > > > > >
> > > > > > I guess if I did a proactive defrag every few hours, it might
> > > take less
> > > > > > time to do the logrotate, but that would mean spreading out all the
> > > > > > seeky IO load during the day instead of getting it all done at night.
> > > > > > Logrotate does the same job as defrag in this case (replacing
> > > a file in
> > > > > > thousands of fragments spread across the disk with a few large
> > > fragments
> > > > > > close together), except logrotate gets better compression.
> > > > > >
> > > > > > To be more accurate, the example I gave above is the worst case you
> > > > > > can expect from normal user workloads.  If I throw in some reflinks
> > > > > > and snapshots, I can make it arbitrarily worse, until the entire disk
> > > > > > is consumed by the metadata update of a single extent defrag.
> > > > > >
> > > > >
> > > > > I can't believe I am considering this case.
> > > > >
> > > > > So, we have a 1TB log file "ultralog" split into 256 million 4
> > > KB extents
> > > > > randomly over the entire disk. We have 512 GB free RAM and 2% free disk
> > > > > space. The file needs to be defragmented.
> > > > >
> > > > > In order to do that, defrag needs to be able to copy-move
> > > multiple extents
> > > > > in one batch, and update the metadata.
> > > > >
> > > > > The metadata has a total of at least 256 million entries, each
> > > of some size,
> > > > > but each one should hold at least a pointer to the extent (8
> > > bytes) and a
> > > > > checksum (8 bytes): In reality, it could be that there is a lot of other
> > > > > data there per entry.
> > > >
> > > > It's about 48KB per 4K extent, plus a few hundred bytes on average
> > > for each
> > > > reference.
> > > 
> > > Sorry, could you be more clear there? An file fragment/extent that holds
> > > file data can be any
> > > size up to 128 MB. What metadata is there per every file fragment/extent?
> > > 
> > > Because "48 KB per 4 K extent" ... cannot decode what you mean.
> > 
> > An extent has 3 associated records in btrfs, not including its references.
> > The first two exist while the extent exists, the third appears after it
> > is removed.
> > 
> > 	- extent tree:  location, size of extent, pointers to backref trees.
> > 	Length is around 60 bytes plus the size of the backref pointer list.
> 
> Wait.. and where are the reflinks? Backrefs are there for going up the tree,
> but where are reflinks for going down the tree?

Reflinks are the forward references--there is no other kind of forward
reference in btrfs (contrast with other filesystems which use one data
structure for single references and another for multiple references).

There are two distinct objects with similar names:  extent data items,
and extent ref items.

A file consists of an inode item followed by extent ref items (aka
reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
pages can be shared with other subvol trees to make snapshots.

Extent data items are stored in a single tree (with other trees using
the same keys) that just lists which parts of the filesystem are occupied,
how long they are, and what data/metadata they contain.  Each extent
item contains a list of references to one of four kinds of object that
refers to the extent item (aka backrefs).  The free space tree is the
inverse of the extent data tree.

Each extent ref item is a reference to an extent data item, but it
also contains all the information required to access the data.  For
normal read operations the extent data tree can be ignored (though
you still need to do a lookup in the csum tree to verify csums.

> So, you are saying that backrefs are already in the extent tree (or
> reachable from it). I didn't know that, that information makes my defrag
> much simpler to implement and describe. Someone in this thread has
> previously mislead me to believe that backref information is not easily
> available.

The backref isn't a precise location--it just tells you which metadata
blocks are holding at least one reference to the extent.  Some CPU
and linear searching has to be done to resolve that fully to an (inode,
offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
go faster, because you don't need to update the extent tree again when
you do some operations on the forward ref side, even though they add or
remove references.  e.g. creating a snapshot does not change the backrefs
list on individual extents--it creates two roots sharing a subset of the
subvol trees' branches.

> > 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
> > 	Length of item is number of extent data blocks * 4 bytes plus a
> > 	168-bit header (ish...csums from adjacent extents may be packed
> > 	using a shared header)
> > 
> > 	- free space tree:  location, size of free space.  This appears
> > 	when the extent is deleted.  It may be merged with adjacent
> > 	records.  Length is maybe 20 bytes?
> > 
> > Each page contains a few hundred items, so if there are a few hundred
> > unrelated extents between extents in the log file, each log file extent
> > gets its own metadata page in each tree.
> 
> As far as I can understand it, the extents in the extent tree are indexed
> (keyed) by inode&offset. Therefore, no matter how many unrelated extents
> there are between (physical locations of data) extents in the log file, the
> log file extent tree entries will (generally speaking) be localized, because
> multiple extent entries (extent items) are bunched tohgether in one 16 KB
> metadata extent node.

No, extents in the extent tree are indexed by virtual address (roughly the
same as physical address over small scales, let's leave the device tree
out of it for now).  The subvol trees are organized the way you are
thinking of.

> > > Another question is: what is the average size of metadata extents?
> > 
> > Metadata extents are all 16K.
> > 
> > > > > The metadata is organized as a b-tree. Therefore, nearby nodes should
> > > > > contain data of consecutive file extents.
> > > >
> > > > It's 48KB per item.
> > > 
> > > What's the "item"?
> > 
> > Items are the objects stored in the trees.  So one extent item, one csum
> > item, and one free space tree item, all tied to the 4K extent from the
> > log file.
> > 
> > > > As you remove the original data extents, you will
> > > > be touching a 16KB page in three trees for each extent that is removed:
> > > > Free space tree, csum tree, and extent tree.  This happens after the
> > > > merged extent is created.  It is part of the cleanup operation that
> > > > gets rid of the original 4K extents.
> > > 
> > > Ok, but how big are free space tree and csum tree?
> > 
> > At least 12GB in the worst-case example.
> 
> The "worst case example" is when all file data extents are 4 KB in size,
> with a 4 KB hole between each two extents. Such example doesn't need to be
> considered because it is irrelevant.

This is wrong, but it is consistent with the misunderstanding above.

> But, if you were to consider it, you would quickly figure out that a good
> defrag solution would succeed in defragging this abomination of a
> filesystem, it is just that it would take some extra time to do it.
> 
> > > Also, when moving a file to defragment it, there should still be some
> > > locality even in free space tree.
> > 
> > It is only guaranteed in the defrag result, because the defrag result is
> > the only thing that is necessarily physically contiguous.
> 
> In order not to lose sight:
> This argument is related to "how big metadata update there needs to be" for
> some pathological cases. The worry is that the metadata update is going to
> be larger than the file data update. The word "update" here does NOT refer
> to in-place operations.
> 
> Any my answer is: we digressed so badly that we lost sight of the above
> original question.
> 
> And the answer to the original question is: there is no problem. Why?
> Because there is no better solution. Either you don't do defrag, or you have
> a nasty metadata update in some pathological cases. There is no magic wand
> there, no shortcut.
> 
> The good thing is that those pathological cases are irrelevent, because, if
> there was too much of them, btrfs wouldn't be able to function at all.

Yes.  There is a rapidly diminishing returns curve, where most of the
filesystem cannot be made more than a few percent more efficient; however,
there's a few percent of a typical filesystem that ends up being orders
of magnitude worse, to the point where it causes noticeable problems
at scale.  Find those parts of the filesystem and apply a quantitatively
justified remediation.

> I mean, any file write operation could also modify all three btrfs trees. So
> what? It just works.
> Even more, if free space is fragmented, a file append can turn into a nasty
> update to the free-space tree. So, there you go, the same problem can be
> found in everyday operation of btrfs.

File append just uses the next sufficiently-sized entry in the free
space tree.  It only goes pathological when it's time to insert a lot
of non-consecutive nodes at once (e.g. when deleting a big file in many
small pieces).

> > > And the csum tree, it should be ordered similar to free space tree, right?
> > 
> > They are all ordered by extent physical address (same physical blocks,
> > same metadata item key).
> > 
> > > > Because the file was written very slowly on a big filesystem, the extents
> > > > are scattered pessimally all over the virtual address space, not packed
> > > > close together.  If there are a few hundred extent allocations between
> > > > each log extent, then they will all occupy separate metadata pages.
> > > 
> > > Ok, now you are talking about your pathological case. Let's consider it.
> > > 
> > > Note that there is very little that can be in this case that you are
> > > describing. In order to defrag such a file, either the defrag will take many
> > > small steps and therefore it will be slow (because each step needs to
> > > perform an update to the metadata), or the defrag can do it in one big step
> > > and use a huge amount of RAM.
> > > 
> > > So, the best thing to be done in this situation is to allow the user to
> > > specify the amount of RAM that defrag is allowed to use, so that the user
> > > decides which of the two (slow defrag or lots of RAM) he wants.
> > > 
> > > There is no way around it. There is no better defrag than the one that has
> > > ALL information at hand, that one will be the fastest and the best defrag.
> > > 
> > > > When it is time to remove them, each of these pages must be updated.
> > > > This can be hit in a number of places in btrfs, including overwrite
> > > > and delete.
> > > >
> > > > There's also 60ish bytes per extent in any subvol trees the file
> > > > actually appears in, but you do get locality in that one (the key is
> > > > inode and offset, so nothing can get between them and space them apart).
> > > > That's 12GB and change (you'll probably completely empty most of the
> > > > updated subvol metadata pages, so we can expect maybe 5 pages to remain
> > > > including root and interior nodes).  I haven't been unlucky enough to
> > > > get a "natural" 12GB, but I got over 1GB a few times recently.
> > > 
> > > The thing that I figured out (and I have already written it down in another
> > > post) is that the defrag can CHOOSE AT WILL how large update to metadata it
> > > wants to perform (within the limit of available RAM). The defrag can select,
> > > by itself, the most efficient way to proceed while still honoring the
> > > user-supplied limit on RAM.
> > 
> > Yeah, it can update half the reflinks and pause for a commit, or similar.
> > If there's a power failure then there will be a duplicate extent with some
> > of the references to one copy and some to the other, but this is probably
> > rare enough not to matter.
> 
> Exactly! That's the same that I was thinking.
> 
> > > > Reflinks can be used to multiply that 12GB arbitrarily--you only get
> > > > locality if the reflinks are consecutive in (inode, offset) space,
> > > > so if the reflinks are scattered across subvols or files, they won't
> > > > share pages.
> > > 
> > > OK.
> > > 
> > > Yes, given a sufficiently pathological case, the defrag will take forever.
> > > There is nothing unexpected there. I agree on that point. The defrag always
> > > functions within certain prerequisites.
> > > 
> > > > > The trick, in this case, is to select one part of "ultralog" which is
> > > > > localized in the metadata, and defragment it. Repeating this step will
> > > > > ultimately defragment the entire file.
> > > > >
> > > > > So, the defrag selects some part of metadata which is entirely a
> > > descendant
> > > > > of some b-tree node not far from the bottom of b-tree. It
> > > selects it such
> > > > > that the required update to the metadata is less than, let's
> > > say, 64 MB, and
> > > > > simultaneously the affected "ultralog" file fragments total less
> > > han 512 MB
> > > > > (therefore, less than 128 thousand metadata leaf entries, each
> > > pointing to a
> > > > > 4 KB fragment). Then it finds all the file extents pointed to by
> > > that part
> > > > > of metadata. They are consecutive (as file fragments), because we have
> > > > > selected such part of metadata. Now the defrag can safely
> > > copy-move those
> > > > > fragments to a new area and update the metadata.
> > > > >
> > > > > In order to quickly select that small part of metadata, the
> > > defrag needs a
> > > > > metatdata cache that can hold somewhat more than 128 thousand localized
> > > > > metadata leaf entries. That fits into 128 MB RAM definitely.
> > > > >
> > > > > Of course, there are many other small issues there, but this
> > > outlines the
> > > > > general procedure.
> > > > >
> > > > > Problem solved?
> > > 
> > > > Problem missed completely.  The forward reference updates were the only
> > > > easy part.
> > > 
> > > Oh, I'll reply in another mail, this one is getting too tireing.
> > > 
> > > 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-15 18:05                                                   ` General Zed
@ 2019-09-16 23:05                                                     ` Zygo Blaxell
  0 siblings, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-16 23:05 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Sun, Sep 15, 2019 at 02:05:47PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > 3% of 45TB is 1.35TB...seems a little harsh.  Recall no extent can be
> > larger than 128MB, so we're talking about enough space for ten thousand
> > of defrag's worst-case output extents.  A limit based on absolute numbers
> > might make more sense, though the only way to really know what the limit is
> > on any given filesystem is to try to reach it.
> 
> Nah.
> 
> The free space minimum limit must, unfortunately, be based on absolute
> percentages. There is no better way. The problem is that, in order for
> defrag to work, it has to (partially) consolidate some of the free space, in
> order to produce a contiguous free area which will be the destination for
> defrag data.

One quirk of btrfs is that it has two levels of allocation:  it
divides disks into multi-GB block groups, then allocates extents in
the block groups.  Any unallocated space on the disks ("unallocated"
meaning "not allocated to a block group") is contiguous, so as long
as there is unallocated space, there are guaranteed to be contiguous
areas a minimum of 8 times the maximum extent to defrag into.  So 3%
free space on a big disk ("big" meaning "relative to the maximum extent
size") can mean a lot of contiguous space left, more than enough room
to defrag while moving each extent exactly once.

Not necessarily, of course:  if you fill all the way to 100%, there's no
unallocated space any more, and if you then delete 3% of it at random,
you have a severe fragmentation problem (97% of all the block groups are
occupied) and no space to fix it (no unallocated block groups available).

> In order to be able to produce this contiguous free space area, it is of
> utmost importance that there is sufficient free space left on the partition.
> Otherwise, this free space consolidation operation will take too much time
> (too much disk I/O). There is no good way around it the common cases of free
> space fragmentation.
> 
> If you reduce the free space minimum limit below 3%, you are likely to spend
> 2x more I/O in consolidating free space than what is needed to actually
> defrag the data. I mean, the defrag will still work, but I think that the
> slowdown is unacceptable.
> 
> I mean, the user should just free some space! The filesystems should not be
> left with less than 10% free space, that's simply bad management from the
> user's part, and the user should accept the consequences.

Well, yes, the performance of the allocator drops exponentially once
you go past 90% usage of the allocated block groups (there's no
optimization like a free-space btree with lengths as keys).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-16 11:42                             ` General Zed
@ 2019-09-17  0:49                               ` Zygo Blaxell
  2019-09-17  2:30                                 ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-17  0:49 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

On Mon, Sep 16, 2019 at 07:42:51AM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > Your defrag ideas are interesting, but you should spend a lot more
> > time learning the btrfs fundamentals before continuing.  Right now
> > you do not understand what btrfs is capable of doing easily, and what
> > requires such significant rework in btrfs to implement that the result
> > cannot be considered the same filesystem.  This is impairing the quality
> > of your design proposals and reducing the value of your contribution
> > significantly.
> 
> Ok, that was a shot at me; and I admit, guilty as charged. I barely have a
> clue about btrfs.
> Now it's my turn to shoot. Apparently, the people which are implementing the
> btrfs defrag, or at least the ones that responded to my post, seem to have
> no clue about how on-demand defrag solutions typically work. I had to
> explain the usual tricks involved in the defragmentation, and it was like
> talking to complete rookies. None of you even considered a full-featured
> defrag solution, all that you are doing are some partial solutions.

Take a look at btrfs RAID5/6 some time, if you want to see rookie mistakes...

> And, you all got lost in implementation details. How many times have I been
> told here that some operation cannot be performed, and then it turned out
> the opposite. You have all sunk into some strange state of mind where every
> possible excuse is being made in order not to start working on a better,
> hollistic defrag solution.
> 
> And you even misunderstood me when I said "hollistic defrag", you thought I
> was talking about a full defrag. No. A full defrag is a defrag performed on
> all the data. A holistic defrag can be performed on only some data, but it
> is hollistic in the sense that it uses whole information about a filesystem,
> not just a partial view of it. A holistic defrag is better than a partial
> defrag: it is faster and produces better results, and it can defrag a wider
> spectrum of cases. Why? Because a holistic defrag takes everything into
> account.

What I'm looking for is a quantitative approach.  Sort the filesystem
regions by how bad they are (in terms of measurable negative outcomes
like poor read performance, pathological metadata updates, and future
allocation performance), then apply mitigation in increasing order of
cost-benefit ratio (or at least filter by cost-benefit ratio if you can't
sort without reading the whole filesystem) until a minimum threshold
is reached, then stop.  This lets the mitigation scale according to
the available maintenance window, i.e. if you have 5% of a day for
maintenance, you attack the worst 5% of the filesystem, then stop.

In that respect I think we might be coming toward the same point, but
from different directions:  you seem to think the problem is easy to
solve at scale, and I think that's impossible so I start from designs
that make forward progress with a fixed allocation of resources.

> So I think you should all inform yourself a little better about various
> defrag algorithms and solutions that exist. Apparently, you all lost the
> sight of the big picture. You can't see the wood from the trees.

I can see the woods, but any solution that starts with "enumerate all
the trees" will be met with extreme skepticism, unless it can do that
enumeration incrementally.

> Well, no. Perhaps the word "defrag" can have a wider and narrower sense. So
> in a narrower sense, "defrag" means what you just wrote. In that sense, the
> word "defrag" means practically the same as "merge", so why not just use the
> word "merge" to remove any ambiguities. The "merge" is the only operation
> that decreases the number of fragments (besides "delete"). Perhaps you meant
> move&merge. But, commonly, the word "defrag" is used in a wider sense, which
> is not the one you described.

This is fairly common on btrfs:  the btrfs words don't mean the same as
other words, causing confusion.  How many copies are there in a btrfs
4-disk raid1 array?

> > > > Dedupe on btrfs also requires the ability to split and merge extents;
> > > > otherwise, we can't dedupe an extent that contains a combination of
> > > > unique and duplicate data.  If we try to just move references around
> > > > without splitting extents into all-duplicate and all-unique extents,
> > > > the duplicate blocks become unreachable, but are not deallocated.  If we
> > > > only split extents, fragmentation overhead gets bad.  Before creating
> > > > thousands of references to an extent, it is worthwhile to merge it with
> > > > as many of its neighbors as possible, ideally by picking the biggest
> > > > existing garbage-free extents available so we don't have to do defrag.
> > > > As we examine each extent in the filesystem, it may be best to send
> > > > to defrag, dedupe, or garbage collection--sometimes more than one of
> > > > those.
> > > 
> > > This is sovled simply by always running defrag before dedupe.
> > 
> > Defrag and dedupe in separate passes is nonsense on btrfs.
> 
> Defrag can be run without dedupe.

Yes, but if you're planning to run both on the same filesystem, they
had better be aware of each other.

> Now, how to organize dedupe? I didn't think about it yet. I'll leave it to
> you, but it seems to me that defrag should be involved there. And, my defrag
> solution would help there very, very much.

I can't see defrag in isolation as anything but counterproductive to
dedupe (and vice versa).

A critical feature of the dedupe is to do extent splits along duplicate
content boundaries, so that you're not limited to deduping only
whole-extent matches.  This is especially necessary on btrfs because
you can't split an extent in place--if you find a partial match,
you have to find a new home for the unique data, which means you
get a lot of little fragments that are inevitably distant from their
logically adjacent neighbors which themselves were recently replaced
with a physically distant identical extent.

Sometimes both copies of the data suck (both have many fragments
or uncollected garbage), and at that point you want to do some
preprocessing--copy the data to make the extent you want, then use
dedupe to replace both bad extents with your new good one.  That's an
opportunistic extent merge and it needs some defrag logic to do proper
cost estimation.

If you have to copy 64MB of unique data to dedupe a 512K match, the extent
split cost is far higher than if you have a 2MB extent with 512K match.
So there should be sysadmin-tunable parameters that specify how much
to spend on diminishing returns:  maybe you don't deduplicate anything
that saves less than 1% of the required copy bytes, because you have
lots of duplicates in the filesystem and you are willing to spend 1% of
your disk space to not be running dedupe all day.  Similarly you don't
defragment (or move for any reason) extents unless that move gives you
significantly better read performance or consolidate diffuse allocations
across metadata pages, because there are millions of extents to choose
from and it's not necessary to optimize them all.

On the other hand, if you find you _must_ move the 64MB of data for
other reasons (e.g. to consolidate free space) then you do want to do
the dedupe because it will make the extent move slightly faster (63.5MB
of data + reflink instead of 64MB copy).  So you definitely want one
program looking at both things.

Maybe there's a way to plug opportunistic dedupe into a defrag algorithm
the same way there's a way to plug opportunistic defrag into a dedupe
algorithm.  I don't know, I'm coming at this from the dedupe side.
If the solution looks anything like "run both separately" then I'm
not interested.

> > Extent splitting in-place is not possible on btrfs, so extent boundary
> > changes necessarily involve data copies.  Reference counting is done
> > by extent in btrfs, so it is only possible to free complete extents.
> 
> Great, there is reference counting in btrfs. That helps. Good design.

Well, I say "reference counting" because I'm simplifying for an audience
that does not yet all know the low-level details.  The counter, such as
it is, gives values "zero" or "more than zero."  You never know exactly
how many references there are without doing the work to enumerate them.
The "is extent unique" function in btrfs runs the enumeration loop until
the second reference is found or the supply of references is exhausted,
whichever comes first.  It's a tradeoff to make snapshots fast.

When a reference is created to a new extent, it refers to the entire
extent.  References can refer to parts of extents (the reference has an
offset and length field), so when an extent is partially overwritten, the
extent is not modified.  Only the reference is modified, to make it refer
to a subset of the extent (references in other snapshots are not changed,
and the extent data itself is immutable).  This makes POSIX fast, but it
creates some headaches related to garbage collection, dedupe, defrag, etc.

> > You have to replace the whole extent with references to data from
> > somewhere else, creating data copies as required to do so where no
> > duplicate copy of the data is available for reflink.
> > 
> > Note the phrase "on btrfs" appears often here...other filesystems manage
> > to solve these problems without special effort.  Again, if you're looking
> > for important btrfs things to work on, maybe start with in-place extent
> > splitting.
> 
> I think that I'll start with "software design document for on-demand defrag
> which preserves sharing structure". I have figure out that you don't have it
> yet. And, how can you even start working on a defrag without a software
> design document?
> 
> So I volunteer to write it. Apparently, I'm already half way done.
> 
> > On XFS you can split extents in place and reference counting is by
> > block, so you can do alternating defrag and dedupe passes.  It's still
> > suboptimal (you still waste iops to defrag data blocks that are
> > immediately eliminated by the following dedupe), but it's orders of
> > magnitude better than btrfs.
> 
> I'll reply to the rest of this marathonic post in another reply (when I find
> the time to read it). Because I'm writing the software design document.
> 
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-16 22:51                                                   ` Zygo Blaxell
@ 2019-09-17  1:03                                                     ` General Zed
  2019-09-17  1:34                                                       ` General Zed
                                                                         ` (2 more replies)
  2019-09-17  3:10                                                     ` General Zed
  1 sibling, 3 replies; 111+ messages in thread
From: General Zed @ 2019-09-17  1:03 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > >
>> > > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>> > > > >
>> > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > > >
>> > > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > > > > > >
>> > > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > > > > >
>> > > > > > > > Don't forget you have to write new checksum and free space
>> > > tree pages.
>> > > > > > > > In the worst case, you'll need about 1GB of new metadata pages
>> > > > > for each
>> > > > > > > > 128MB you defrag (though you get to delete 99.5% of them
>> > > immediately
>> > > > > > > > after).
>> > > > > > >
>> > > > > > > Yes, here we are debating some worst-case scenaraio which is
>> > > actually
>> > > > > > > imposible in practice due to various reasons.
>> > > > > >
>> > > > > > No, it's quite possible.  A log file written slowly on an active
>> > > > > > filesystem above a few TB will do that accidentally.  Every
>> > > now and then
>> > > > > > I hit that case.  It can take several hours to do a logrotate
>> > > on spinning
>> > > > > > arrays because of all the metadata fetches and updates  
>> associated with
>> > > > > > worst-case file delete.  Long enough to watch the delete  
>> happen, and
>> > > > > > even follow along in the source code.
>> > > > > >
>> > > > > > I guess if I did a proactive defrag every few hours, it might
>> > > take less
>> > > > > > time to do the logrotate, but that would mean spreading  
>> out all the
>> > > > > > seeky IO load during the day instead of getting it all  
>> done at night.
>> > > > > > Logrotate does the same job as defrag in this case (replacing
>> > > a file in
>> > > > > > thousands of fragments spread across the disk with a few large
>> > > fragments
>> > > > > > close together), except logrotate gets better compression.
>> > > > > >
>> > > > > > To be more accurate, the example I gave above is the  
>> worst case you
>> > > > > > can expect from normal user workloads.  If I throw in  
>> some reflinks
>> > > > > > and snapshots, I can make it arbitrarily worse, until the  
>> entire disk
>> > > > > > is consumed by the metadata update of a single extent defrag.
>> > > > > >
>> > > > >
>> > > > > I can't believe I am considering this case.
>> > > > >
>> > > > > So, we have a 1TB log file "ultralog" split into 256 million 4
>> > > KB extents
>> > > > > randomly over the entire disk. We have 512 GB free RAM and  
>> 2% free disk
>> > > > > space. The file needs to be defragmented.
>> > > > >
>> > > > > In order to do that, defrag needs to be able to copy-move
>> > > multiple extents
>> > > > > in one batch, and update the metadata.
>> > > > >
>> > > > > The metadata has a total of at least 256 million entries, each
>> > > of some size,
>> > > > > but each one should hold at least a pointer to the extent (8
>> > > bytes) and a
>> > > > > checksum (8 bytes): In reality, it could be that there is a  
>> lot of other
>> > > > > data there per entry.
>> > > >
>> > > > It's about 48KB per 4K extent, plus a few hundred bytes on average
>> > > for each
>> > > > reference.
>> > >
>> > > Sorry, could you be more clear there? An file fragment/extent that holds
>> > > file data can be any
>> > > size up to 128 MB. What metadata is there per every file  
>> fragment/extent?
>> > >
>> > > Because "48 KB per 4 K extent" ... cannot decode what you mean.
>> >
>> > An extent has 3 associated records in btrfs, not including its references.
>> > The first two exist while the extent exists, the third appears after it
>> > is removed.
>> >
>> > 	- extent tree:  location, size of extent, pointers to backref trees.
>> > 	Length is around 60 bytes plus the size of the backref pointer list.
>>
>> Wait.. and where are the reflinks? Backrefs are there for going up the tree,
>> but where are reflinks for going down the tree?
>
> Reflinks are the forward references--there is no other kind of forward
> reference in btrfs (contrast with other filesystems which use one data
> structure for single references and another for multiple references).
>
> There are two distinct objects with similar names:  extent data items,
> and extent ref items.
>
> A file consists of an inode item followed by extent ref items (aka
> reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
> pages can be shared with other subvol trees to make snapshots.

Ok, so a reflink contains a virtual address. Did I get that right?

All extent ref items are reflinks which contain a 4 KB aligned address  
because the extents have that same alignment. Did I get that right?

Virtual addresses are 8-bytes in size?

I hope that virtual addresses are not wasteful of address space (that  
is, many top bits in an 8 bit virtual address are all zero).

> Extent data items are stored in a single tree (with other trees using
> the same keys) that just lists which parts of the filesystem are occupied,
> how long they are, and what data/metadata they contain.  Each extent
> item contains a list of references to one of four kinds of object that
> refers to the extent item (aka backrefs).  The free space tree is the
> inverse of the extent data tree.

Ok, so there is an "extent tree" keyed by virtual addresses. Items  
there contain extent data.

But, how are nodes in this extent tree addressed (how do you travel  
from the parent to the child)? I guess by full virtual address, i.e.  
by a reflink, but this reflink can point within-extent, meaning its  
address is not 4 KB aligned.

Or, an alternative explanation:
each whole metadata extent is a single node. Each node is often  
half-full to allow for various tree operations to be performed. Due to  
there being many items per each node, there is additional CPU  
processing effort required when updating a node.

> Each extent ref item is a reference to an extent data item, but it
> also contains all the information required to access the data.  For
> normal read operations the extent data tree can be ignored (though
> you still need to do a lookup in the csum tree to verify csums.

So, for normal reads, the information in subvol tree is sufficient.

>> So, you are saying that backrefs are already in the extent tree (or
>> reachable from it). I didn't know that, that information makes my defrag
>> much simpler to implement and describe. Someone in this thread has
>> previously mislead me to believe that backref information is not easily
>> available.
>
> The backref isn't a precise location--it just tells you which metadata
> blocks are holding at least one reference to the extent.  Some CPU
> and linear searching has to be done to resolve that fully to an (inode,
> offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
> go faster, because you don't need to update the extent tree again when
> you do some operations on the forward ref side, even though they add or
> remove references.  e.g. creating a snapshot does not change the backrefs
> list on individual extents--it creates two roots sharing a subset of the
> subvol trees' branches.

This reads like a mayor fu**** to me.

I don't get it. If a backref doesn't point to an exact item, than CPU  
has to scan the entire 16 KB metadata extent to find the matching  
reflink. However, this would imply that all the items in a metadata  
extent are always valid (not stale from older versions of metadata).  
This then implies that, when an item of a metadata extent is updated,  
all the parents of all the items in the same extent have to be  
updated. Now, that would be such a waste, wouldn't it? Especially if  
the metadata extent is allowed to contain stale items.

An alternative explanation: all the b-trees have 16 KB nodes, where  
each node matches a metadata extent. Therefore, the entire node has a  
single parent in a particular tree.

This means all virtual addresses are always 4 K aligned, furthermore,  
all virtual addresses that point to metadata extents are 16 K aligned.

16 KB is a pretty big for a tree node. I wonder why was this size  
selected vs. 4 KB nodes? But, it doesn't matter.

>> > 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
>> > 	Length of item is number of extent data blocks * 4 bytes plus a
>> > 	168-bit header (ish...csums from adjacent extents may be packed
>> > 	using a shared header)
>> >
>> > 	- free space tree:  location, size of free space.  This appears
>> > 	when the extent is deleted.  It may be merged with adjacent
>> > 	records.  Length is maybe 20 bytes?
>> >
>> > Each page contains a few hundred items, so if there are a few hundred
>> > unrelated extents between extents in the log file, each log file extent
>> > gets its own metadata page in each tree.
>>
>> As far as I can understand it, the extents in the extent tree are indexed
>> (keyed) by inode&offset. Therefore, no matter how many unrelated extents
>> there are between (physical locations of data) extents in the log file, the
>> log file extent tree entries will (generally speaking) be localized, because
>> multiple extent entries (extent items) are bunched tohgether in one 16 KB
>> metadata extent node.
>
> No, extents in the extent tree are indexed by virtual address (roughly the
> same as physical address over small scales, let's leave the device tree
> out of it for now).  The subvol trees are organized the way you are
> thinking of.

So, I guess that the virtual-to-physical address translation tables  
are always loaded in memory and that this translation is very fast?  
And the translation in the opposite direction, too.

Anyway, thanks for explaining this all to me, makes it all much more clear.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  1:03                                                     ` General Zed
@ 2019-09-17  1:34                                                       ` General Zed
  2019-09-17  1:44                                                       ` Chris Murphy
  2019-09-17  4:19                                                       ` Zygo Blaxell
  2 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-17  1:34 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting General Zed <general-zed@zedlx.com>:

> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>
>> On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
>>>
>>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>
>>>> On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>>>> >
>>>> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> >
>>>> > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>>>> > > >
>>>> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> > > >
>>>> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>>> > > > > >
>>>> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> > > > > >
>>>> > > > > > > Don't forget you have to write new checksum and free space
>>>> > tree pages.
>>>> > > > > > > In the worst case, you'll need about 1GB of new metadata pages
>>>> > > > for each
>>>> > > > > > > 128MB you defrag (though you get to delete 99.5% of them
>>>> > immediately
>>>> > > > > > > after).
>>>> > > > > >
>>>> > > > > > Yes, here we are debating some worst-case scenaraio which is
>>>> > actually
>>>> > > > > > imposible in practice due to various reasons.
>>>> > > > >
>>>> > > > > No, it's quite possible.  A log file written slowly on an active
>>>> > > > > filesystem above a few TB will do that accidentally.  Every
>>>> > now and then
>>>> > > > > I hit that case.  It can take several hours to do a logrotate
>>>> > on spinning
>>>> > > > > arrays because of all the metadata fetches and updates  
>>>> associated with
>>>> > > > > worst-case file delete.  Long enough to watch the delete  
>>>> happen, and
>>>> > > > > even follow along in the source code.
>>>> > > > >
>>>> > > > > I guess if I did a proactive defrag every few hours, it might
>>>> > take less
>>>> > > > > time to do the logrotate, but that would mean spreading  
>>>> out all the
>>>> > > > > seeky IO load during the day instead of getting it all  
>>>> done at night.
>>>> > > > > Logrotate does the same job as defrag in this case (replacing
>>>> > a file in
>>>> > > > > thousands of fragments spread across the disk with a few large
>>>> > fragments
>>>> > > > > close together), except logrotate gets better compression.
>>>> > > > >
>>>> > > > > To be more accurate, the example I gave above is the  
>>>> worst case you
>>>> > > > > can expect from normal user workloads.  If I throw in  
>>>> some reflinks
>>>> > > > > and snapshots, I can make it arbitrarily worse, until the  
>>>> entire disk
>>>> > > > > is consumed by the metadata update of a single extent defrag.
>>>> > > > >
>>>> > > >
>>>> > > > I can't believe I am considering this case.
>>>> > > >
>>>> > > > So, we have a 1TB log file "ultralog" split into 256 million 4
>>>> > KB extents
>>>> > > > randomly over the entire disk. We have 512 GB free RAM and  
>>>> 2% free disk
>>>> > > > space. The file needs to be defragmented.
>>>> > > >
>>>> > > > In order to do that, defrag needs to be able to copy-move
>>>> > multiple extents
>>>> > > > in one batch, and update the metadata.
>>>> > > >
>>>> > > > The metadata has a total of at least 256 million entries, each
>>>> > of some size,
>>>> > > > but each one should hold at least a pointer to the extent (8
>>>> > bytes) and a
>>>> > > > checksum (8 bytes): In reality, it could be that there is a  
>>>> lot of other
>>>> > > > data there per entry.
>>>> > >
>>>> > > It's about 48KB per 4K extent, plus a few hundred bytes on average
>>>> > for each
>>>> > > reference.
>>>> >
>>>> > Sorry, could you be more clear there? An file fragment/extent that holds
>>>> > file data can be any
>>>> > size up to 128 MB. What metadata is there per every file  
>>>> fragment/extent?
>>>> >
>>>> > Because "48 KB per 4 K extent" ... cannot decode what you mean.
>>>>
>>>> An extent has 3 associated records in btrfs, not including its references.
>>>> The first two exist while the extent exists, the third appears after it
>>>> is removed.
>>>>
>>>> 	- extent tree:  location, size of extent, pointers to backref trees.
>>>> 	Length is around 60 bytes plus the size of the backref pointer list.
>>>
>>> Wait.. and where are the reflinks? Backrefs are there for going up  
>>> the tree,
>>> but where are reflinks for going down the tree?
>>
>> Reflinks are the forward references--there is no other kind of forward
>> reference in btrfs (contrast with other filesystems which use one data
>> structure for single references and another for multiple references).
>>
>> There are two distinct objects with similar names:  extent data items,
>> and extent ref items.
>>
>> A file consists of an inode item followed by extent ref items (aka
>> reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
>> pages can be shared with other subvol trees to make snapshots.
>
> Ok, so a reflink contains a virtual address. Did I get that right?
>
> All extent ref items are reflinks which contain a 4 KB aligned  
> address because the extents have that same alignment. Did I get that  
> right?
>
> Virtual addresses are 8-bytes in size?
>
> I hope that virtual addresses are not wasteful of address space  
> (that is, many top bits in an 8 bit virtual address are all zero).
>
>> Extent data items are stored in a single tree (with other trees using
>> the same keys) that just lists which parts of the filesystem are occupied,
>> how long they are, and what data/metadata they contain.  Each extent
>> item contains a list of references to one of four kinds of object that
>> refers to the extent item (aka backrefs).  The free space tree is the
>> inverse of the extent data tree.
>
> Ok, so there is an "extent tree" keyed by virtual addresses. Items  
> there contain extent data.
>
> But, how are nodes in this extent tree addressed (how do you travel  
> from the parent to the child)? I guess by full virtual address, i.e.  
> by a reflink, but this reflink can point within-extent, meaning its  
> address is not 4 KB aligned.
>
> Or, an alternative explanation:
> each whole metadata extent is a single node. Each node is often  
> half-full to allow for various tree operations to be performed. Due  
> to there being many items per each node, there is additional CPU  
> processing effort required when updating a node.
>
>> Each extent ref item is a reference to an extent data item, but it
>> also contains all the information required to access the data.  For
>> normal read operations the extent data tree can be ignored (though
>> you still need to do a lookup in the csum tree to verify csums.
>
> So, for normal reads, the information in subvol tree is sufficient.
>
>>> So, you are saying that backrefs are already in the extent tree (or
>>> reachable from it). I didn't know that, that information makes my defrag
>>> much simpler to implement and describe. Someone in this thread has
>>> previously mislead me to believe that backref information is not easily
>>> available.
>>
>> The backref isn't a precise location--it just tells you which metadata
>> blocks are holding at least one reference to the extent.  Some CPU
>> and linear searching has to be done to resolve that fully to an (inode,
>> offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
>> go faster, because you don't need to update the extent tree again when
>> you do some operations on the forward ref side, even though they add or
>> remove references.  e.g. creating a snapshot does not change the backrefs
>> list on individual extents--it creates two roots sharing a subset of the
>> subvol trees' branches.
>
> This reads like a mayor fu**** to me.
>
> I don't get it. If a backref doesn't point to an exact item, than  
> CPU has to scan the entire 16 KB metadata extent to find the  
> matching reflink. However, this would imply that all the items in a  
> metadata extent are always valid (not stale from older versions of  
> metadata). This then implies that, when an item of a metadata extent  
> is updated, all the parents of all the items in the same extent have  
> to be updated. Now, that would be such a waste, wouldn't it?  
> Especially if the metadata extent is allowed to contain stale items.
>
> An alternative explanation: all the b-trees have 16 KB nodes, where  
> each node matches a metadata extent. Therefore, the entire node has  
> a single parent in a particular tree.
>
> This means all virtual addresses are always 4 K aligned,  
> furthermore, all virtual addresses that point to metadata extents  
> are 16 K aligned.
>
> 16 KB is a pretty big for a tree node. I wonder why was this size  
> selected vs. 4 KB nodes? But, it doesn't matter.
>
>>>> 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
>>>> 	Length of item is number of extent data blocks * 4 bytes plus a
>>>> 	168-bit header (ish...csums from adjacent extents may be packed
>>>> 	using a shared header)
>>>>
>>>> 	- free space tree:  location, size of free space.  This appears
>>>> 	when the extent is deleted.  It may be merged with adjacent
>>>> 	records.  Length is maybe 20 bytes?
>>>>
>>>> Each page contains a few hundred items, so if there are a few hundred
>>>> unrelated extents between extents in the log file, each log file extent
>>>> gets its own metadata page in each tree.
>>>
>>> As far as I can understand it, the extents in the extent tree are indexed
>>> (keyed) by inode&offset. Therefore, no matter how many unrelated extents
>>> there are between (physical locations of data) extents in the log file, the
>>> log file extent tree entries will (generally speaking) be  
>>> localized, because
>>> multiple extent entries (extent items) are bunched tohgether in one 16 KB
>>> metadata extent node.
>>
>> No, extents in the extent tree are indexed by virtual address (roughly the
>> same as physical address over small scales, let's leave the device tree
>> out of it for now).  The subvol trees are organized the way you are
>> thinking of.
>
> So, I guess that the virtual-to-physical address translation tables  
> are always loaded in memory and that this translation is very fast?  
> And the translation in the opposite direction, too.
>
> Anyway, thanks for explaining this all to me, makes it all much more clear.

Taking into account the explanation that b-tree nodes are sized 16 KB,  
this is what my imagined defrag would do:

This defrag makes batched updates to metadata. A most common cases, a  
bathched update (with a single commit) will have the following  
properties:

- it moves many file data extents from one place to another. However,  
the update is mostly localized in space (for common fragmentation  
cases), meaning that only a low number of fragments per update will be  
significantly scattered. Therefore, updates to extent-tree are  
localized.

- the free space tree - mostly localized updates, for same reason as above

- the checksum tree - the same as above

- the subvol tree - a (potentially big) number of files is affected  
per update. In most cases of fragmentation, the dispersion of  
fragments will not be high. However, due to number of files, updating  
this tree looks like the most performance-troubling part of all. The  
problem is that while there is only a small dispersion per each file,  
when this gets multiplied by the number of files, it can get bad. Some  
dispersion in this tree is to be expected. The update size can  
effectively be limited by the size of changes to the subvol tree (in  
complex cases).

- the expected dispersion in the subvol tree can get better after  
multiple defrags, if the defrag is made to approximately order the  
files on disk according to their position in filesystem directory  
tree. So, the files of the same directory should be grouped together  
in disk sectors in order to speed up the future defrags.

So, as each b-tree node is 16 K, but I guess most nodes are only  
half-full, the average node size is 8 KB. If the defrag reserves 128  
MB for a metadata update computation, it could update 16000 (16  
thousand) metadata nodes per one commit. It would be interesting to  
try to predict what amount of file data, on average, is referred by  
such a 16000 nodes commit?

Ok, let's do some pure guessing. For a home user, with 256 GB root or  
home partition, I would guess the average will be well over 500 MB  
file data per commit. It depends on how long has it been since his  
last defrag. If it is a weekly or a daily defrag, then I guess well  
over 1 GB file data per commit.

And for a server, there are so many different kinds of servers, so  
I'll just skip the guessing.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  1:03                                                     ` General Zed
  2019-09-17  1:34                                                       ` General Zed
@ 2019-09-17  1:44                                                       ` Chris Murphy
  2019-09-17  4:55                                                         ` Zygo Blaxell
  2019-09-17  4:19                                                       ` Zygo Blaxell
  2 siblings, 1 reply; 111+ messages in thread
From: Chris Murphy @ 2019-09-17  1:44 UTC (permalink / raw)
  To: General Zed; +Cc: Zygo Blaxell, Chris Murphy, Btrfs BTRFS

On Mon, Sep 16, 2019 at 7:03 PM General Zed <general-zed@zedlx.com> wrote:
>
> Ok, so a reflink contains a virtual address. Did I get that right?
>
> All extent ref items are reflinks which contain a 4 KB aligned address
> because the extents have that same alignment. Did I get that right?
>
> Virtual addresses are 8-bytes in size?
>
> I hope that virtual addresses are not wasteful of address space (that
> is, many top bits in an 8 bit virtual address are all zero).

All addresses in Btrfs are in linear address space. This is actually a
lot easier for everyone if you've familiarized yourself with some
things:

https://btrfs.wiki.kernel.org/index.php/On-disk_Format

You probably don't need to know the literal on disk format at a sector
level. There is a human readable form available with 'btrfs
inspect-internal dump-tree'. Create a Btrfs file system, and dump the
tree so you can see what it looks like totally empty. Mount it. Copy
over a file. Unmount it. Dump tree. You don't actually have to unmount
it, but it will keep Btrfs from regular commit intervals making things
move around. Make a directory. Dump tree. Add file to directory. Dump
tree. Move the file. Dump tree.

You'll see the relationship between the superblock, and all the trees.
You'll see nodes and leaves, and figure out the difference between
them.

Make a reflink. Dump tree. Make a snapshot. Dump tree.


> > Extent data items are stored in a single tree (with other trees using
> > the same keys) that just lists which parts of the filesystem are occupied,
> > how long they are, and what data/metadata they contain.  Each extent
> > item contains a list of references to one of four kinds of object that
> > refers to the extent item (aka backrefs).  The free space tree is the
> > inverse of the extent data tree.
>
> Ok, so there is an "extent tree" keyed by virtual addresses. Items
> there contain extent data.
>
> But, how are nodes in this extent tree addressed (how do you travel
> from the parent to the child)? I guess by full virtual address, i.e.
> by a reflink, but this reflink can point within-extent, meaning its
> address is not 4 KB aligned.
>
> Or, an alternative explanation:
> each whole metadata extent is a single node. Each node is often
> half-full to allow for various tree operations to be performed. Due to
> there being many items per each node, there is additional CPU
> processing effort required when updating a node.

Reflinks are like a file based snapshot, they're a file with its own
inode and other metadata you expect for a file, but points to the same
extents as another file. Off hand I'm not sure other than age if
there's any difference between the structure of the original file and
a reflink of that file. Find out. Make a reflink, dump tree. Delete
the original file. Dump tree.




>
> > Each extent ref item is a reference to an extent data item, but it
> > also contains all the information required to access the data.  For
> > normal read operations the extent data tree can be ignored (though
> > you still need to do a lookup in the csum tree to verify csums.
>
> So, for normal reads, the information in subvol tree is sufficient.

A subvolume is a file tree. A snapshot is a prepopulated subvolume.
It's interesting to snapshot a subvolume. Dump tree. Modify just one
thing in the snapshot. Dump tree.



>
> >> So, you are saying that backrefs are already in the extent tree (or
> >> reachable from it). I didn't know that, that information makes my defrag
> >> much simpler to implement and describe. Someone in this thread has
> >> previously mislead me to believe that backref information is not easily
> >> available.
> >
> > The backref isn't a precise location--it just tells you which metadata
> > blocks are holding at least one reference to the extent.  Some CPU
> > and linear searching has to be done to resolve that fully to an (inode,
> > offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
> > go faster, because you don't need to update the extent tree again when
> > you do some operations on the forward ref side, even though they add or
> > remove references.  e.g. creating a snapshot does not change the backrefs
> > list on individual extents--it creates two roots sharing a subset of the
> > subvol trees' branches.
>
> This reads like a mayor fu**** to me.
>
> I don't get it. If a backref doesn't point to an exact item, than CPU
> has to scan the entire 16 KB metadata extent to find the matching
> reflink. However, this would imply that all the items in a metadata
> extent are always valid (not stale from older versions of metadata).
> This then implies that, when an item of a metadata extent is updated,
> all the parents of all the items in the same extent have to be
> updated. Now, that would be such a waste, wouldn't it? Especially if
> the metadata extent is allowed to contain stale items.
>
> An alternative explanation: all the b-trees have 16 KB nodes, where
> each node matches a metadata extent. Therefore, the entire node has a
> single parent in a particular tree.
>
> This means all virtual addresses are always 4 K aligned, furthermore,
> all virtual addresses that point to metadata extents are 16 K aligned.
>
> 16 KB is a pretty big for a tree node. I wonder why was this size
> selected vs. 4 KB nodes? But, it doesn't matter.

4KiB used to be the default. It was benchmarked and found to be
faster. They can optionally be 32K or 64k on x86.

Btrfs filesystem sector size must match arch pagesize. And nodesize
can't be smaller than filesystem sector size. And leaf size must be
the same as nodesize.

> So, I guess that the virtual-to-physical address translation tables
> are always loaded in memory and that this translation is very fast?
> And the translation in the opposite direction, too.

That's the job of the chunk tree and device tree. And that's how
multiple device support magic happens where files and extent
information don't have to deal with where the data is, that's the job
of other trees.

Add device. Dump tree. Do a balance convert to change to raid1 for
data and metadata. Dump tree.

It's sometimes also useful to dump the super which is in a sense the
top most part of the tree.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  0:49                               ` Zygo Blaxell
@ 2019-09-17  2:30                                 ` General Zed
  2019-09-17  5:30                                   ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-17  2:30 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Mon, Sep 16, 2019 at 07:42:51AM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > Your defrag ideas are interesting, but you should spend a lot more
>> > time learning the btrfs fundamentals before continuing.  Right now
>> > you do not understand what btrfs is capable of doing easily, and what
>> > requires such significant rework in btrfs to implement that the result
>> > cannot be considered the same filesystem.  This is impairing the quality
>> > of your design proposals and reducing the value of your contribution
>> > significantly.
>>
>> Ok, that was a shot at me; and I admit, guilty as charged. I barely have a
>> clue about btrfs.
>> Now it's my turn to shoot. Apparently, the people which are implementing the
>> btrfs defrag, or at least the ones that responded to my post, seem to have
>> no clue about how on-demand defrag solutions typically work. I had to
>> explain the usual tricks involved in the defragmentation, and it was like
>> talking to complete rookies. None of you even considered a full-featured
>> defrag solution, all that you are doing are some partial solutions.
>
> Take a look at btrfs RAID5/6 some time, if you want to see rookie mistakes...
>
>> And, you all got lost in implementation details. How many times have I been
>> told here that some operation cannot be performed, and then it turned out
>> the opposite. You have all sunk into some strange state of mind where every
>> possible excuse is being made in order not to start working on a better,
>> hollistic defrag solution.
>>
>> And you even misunderstood me when I said "hollistic defrag", you thought I
>> was talking about a full defrag. No. A full defrag is a defrag performed on
>> all the data. A holistic defrag can be performed on only some data, but it
>> is hollistic in the sense that it uses whole information about a filesystem,
>> not just a partial view of it. A holistic defrag is better than a partial
>> defrag: it is faster and produces better results, and it can defrag a wider
>> spectrum of cases. Why? Because a holistic defrag takes everything into
>> account.
>
> What I'm looking for is a quantitative approach.  Sort the filesystem
> regions by how bad they are (in terms of measurable negative outcomes
> like poor read performance, pathological metadata updates, and future
> allocation performance), then apply mitigation in increasing order of
> cost-benefit ratio (or at least filter by cost-benefit ratio if you can't
> sort without reading the whole filesystem) until a minimum threshold
> is reached, then stop.  This lets the mitigation scale according to
> the available maintenance window, i.e. if you have 5% of a day for
> maintenance, you attack the worst 5% of the filesystem, then stop.

Any good defrag solution should be able to prioritize to work on most  
fragmented parts of the filesystem. That's a given.

> In that respect I think we might be coming toward the same point, but
> from different directions:  you seem to think the problem is easy to
> solve at scale,

Yes!

> and I think that's impossible so I start from designs
> that make forward progress with a fixed allocation of resources.

Well, that's not useless, but it's kind of meh. Waste of time. Solve  
the problem like a real man! Shoot with thermonuclear weapons only!

>> So I think you should all inform yourself a little better about various
>> defrag algorithms and solutions that exist. Apparently, you all lost the
>> sight of the big picture. You can't see the wood from the trees.
>
> I can see the woods, but any solution that starts with "enumerate all
> the trees" will be met with extreme skepticism, unless it can do that
> enumeration incrementally.

I think that I'm close to a solution that only needs to scan the  
free-space tree in the entirety at start. All other trees can be only  
partially scanned. I mean, at start. As the defrag progresses, it will  
go through all the trees (except in case of defragging only a part of  
the partition). If a partition is to be only partially defragged, then  
the trees do not need to be red in entirety. Only the free space tree  
needs to be red in entirety at start (and the virtual-physical address  
translation trees, which are small, I guess).

>> Well, no. Perhaps the word "defrag" can have a wider and narrower sense. So
>> in a narrower sense, "defrag" means what you just wrote. In that sense, the
>> word "defrag" means practically the same as "merge", so why not just use the
>> word "merge" to remove any ambiguities. The "merge" is the only operation
>> that decreases the number of fragments (besides "delete"). Perhaps you meant
>> move&merge. But, commonly, the word "defrag" is used in a wider sense, which
>> is not the one you described.
>
> This is fairly common on btrfs:  the btrfs words don't mean the same as
> other words, causing confusion.  How many copies are there in a btrfs
> 4-disk raid1 array?

2 copies of everything, except the superblock which has 2-6 copies.
>
>> > > > Dedupe on btrfs also requires the ability to split and merge extents;
>> > > > otherwise, we can't dedupe an extent that contains a combination of
>> > > > unique and duplicate data.  If we try to just move references around
>> > > > without splitting extents into all-duplicate and all-unique extents,
>> > > > the duplicate blocks become unreachable, but are not  
>> deallocated.  If we
>> > > > only split extents, fragmentation overhead gets bad.  Before creating
>> > > > thousands of references to an extent, it is worthwhile to  
>> merge it with
>> > > > as many of its neighbors as possible, ideally by picking the biggest
>> > > > existing garbage-free extents available so we don't have to do defrag.
>> > > > As we examine each extent in the filesystem, it may be best to send
>> > > > to defrag, dedupe, or garbage collection--sometimes more than one of
>> > > > those.
>> > >
>> > > This is sovled simply by always running defrag before dedupe.
>> >
>> > Defrag and dedupe in separate passes is nonsense on btrfs.
>>
>> Defrag can be run without dedupe.
>
> Yes, but if you're planning to run both on the same filesystem, they
> had better be aware of each other.

On-demand defrag doesn't need to be aware of on-demand dedupe. Or,  
only in the sense that dedupe should be shut down while defrag is  
running.

Perhaps you were referring to an on-the-fly dedupe. In that case, yes.

>> Now, how to organize dedupe? I didn't think about it yet. I'll leave it to
>> you, but it seems to me that defrag should be involved there. And, my defrag
>> solution would help there very, very much.
>
> I can't see defrag in isolation as anything but counterproductive to
> dedupe (and vice versa).

Share-preserving defrag can't be harmful to dedupe.

> A critical feature of the dedupe is to do extent splits along duplicate
> content boundaries, so that you're not limited to deduping only
> whole-extent matches.  This is especially necessary on btrfs because
> you can't split an extent in place--if you find a partial match,
> you have to find a new home for the unique data, which means you
> get a lot of little fragments that are inevitably distant from their
> logically adjacent neighbors which themselves were recently replaced
> with a physically distant identical extent.
>
> Sometimes both copies of the data suck (both have many fragments
> or uncollected garbage), and at that point you want to do some
> preprocessing--copy the data to make the extent you want, then use
> dedupe to replace both bad extents with your new good one.  That's an
> opportunistic extent merge and it needs some defrag logic to do proper
> cost estimation.
>
> If you have to copy 64MB of unique data to dedupe a 512K match, the extent
> split cost is far higher than if you have a 2MB extent with 512K match.
> So there should be sysadmin-tunable parameters that specify how much
> to spend on diminishing returns:  maybe you don't deduplicate anything
> that saves less than 1% of the required copy bytes, because you have
> lots of duplicates in the filesystem and you are willing to spend 1% of
> your disk space to not be running dedupe all day.  Similarly you don't
> defragment (or move for any reason) extents unless that move gives you
> significantly better read performance or consolidate diffuse allocations
> across metadata pages, because there are millions of extents to choose
> from and it's not necessary to optimize them all.
>
> On the other hand, if you find you _must_ move the 64MB of data for
> other reasons (e.g. to consolidate free space) then you do want to do
> the dedupe because it will make the extent move slightly faster (63.5MB
> of data + reflink instead of 64MB copy).  So you definitely want one
> program looking at both things.
>
> Maybe there's a way to plug opportunistic dedupe into a defrag algorithm
> the same way there's a way to plug opportunistic defrag into a dedupe
> algorithm.  I don't know, I'm coming at this from the dedupe side.
> If the solution looks anything like "run both separately" then I'm
> not interested.

I would suggest one of the two following simple solutions:
    a) the on-demand defrag should be run BEFORE AND AFTER the  
on-demand dedupe.
or b) the on-demand defrag should be run BEFORE the on-demand dedupe,  
and on-demand dedupe uses defrag functionality to defrag while dedupe  
is in progress.

So I guess you were thinking about the solution b) all the time when  
you said that dedupe and defrag need to be related.

>> > Extent splitting in-place is not possible on btrfs, so extent boundary
>> > changes necessarily involve data copies.  Reference counting is done
>> > by extent in btrfs, so it is only possible to free complete extents.
>>
>> Great, there is reference counting in btrfs. That helps. Good design.
>
> Well, I say "reference counting" because I'm simplifying for an audience
> that does not yet all know the low-level details.  The counter, such as
> it is, gives values "zero" or "more than zero."  You never know exactly
> how many references there are without doing the work to enumerate them.
> The "is extent unique" function in btrfs runs the enumeration loop until
> the second reference is found or the supply of references is exhausted,
> whichever comes first.  It's a tradeoff to make snapshots fast.

Well, that's a disappointment.

> When a reference is created to a new extent, it refers to the entire
> extent.  References can refer to parts of extents (the reference has an
> offset and length field), so when an extent is partially overwritten, the
> extent is not modified.  Only the reference is modified, to make it refer
> to a subset of the extent (references in other snapshots are not changed,
> and the extent data itself is immutable).  This makes POSIX fast, but it
> creates some headaches related to garbage collection, dedupe, defrag, etc.

Ok, got it. Thaks.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-16 22:51                                                   ` Zygo Blaxell
  2019-09-17  1:03                                                     ` General Zed
@ 2019-09-17  3:10                                                     ` General Zed
  2019-09-17  4:05                                                       ` General Zed
  1 sibling, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-17  3:10 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > >
>> > > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>> > > > >
>> > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > > >
>> > > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > > > > > >
>> > > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

I have a few questions for you.

1. Does the btrfs driver already keep a cache off all the mentioned  
trees (entent tree, subvol tree, free space tree, csum tree) in RAM?

   I guess the answer is YES. This makes defrag easier to implement.

2. Is the size of those caches dynamically configurable?

   I guess the answer is YES. This makes defrag easier to implement.

3. Can the cache size of csum tree be increased to approx. 128 MB. ,  
for example?

   I guess the answer is YES. This makes defrag easier to implement.

4. This is a part of format of subvol tree that I don't understand. You said:

"A file consists of an inode item followed by extent ref items (aka
reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
pages can be shared with other subvol trees to make snapshots."

Those "extent ref items" must also be forming a tree. But how are  
nodes of this tree addressed? The inode&offset is the key, but how do  
you get an address of a child node?

Do nodes in the subvol tree equal the metadata extents? Or is this  
tree special in that regard? Because, it seems a bit ridiculous to  
have (for example) a 16 KB metadata extent for each directory or a 16  
K metadata extent in the subvol tree for a file that is only a few  
bytes in length.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  3:10                                                     ` General Zed
@ 2019-09-17  4:05                                                       ` General Zed
  0 siblings, 0 replies; 111+ messages in thread
From: General Zed @ 2019-09-17  4:05 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS


Quoting General Zed <general-zed@zedlx.com>:

> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>
>> On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
>>>
>>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>
>>>> On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>>>> >
>>>> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> >
>>>> > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>>>> > > >
>>>> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> > > >
>>>> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>>> > > > > >
>>>> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>
> I have a few questions for you.
>
> 1. Does the btrfs driver already keep a cache off all the mentioned  
> trees (entent tree, subvol tree, free space tree, csum tree) in RAM?
>
>   I guess the answer is YES. This makes defrag easier to implement.
>
> 2. Is the size of those caches dynamically configurable?
>
>   I guess the answer is YES. This makes defrag easier to implement.
>
> 3. Can the cache size of csum tree be increased to approx. 128 MB. ,  
> for example?
>
>   I guess the answer is YES. This makes defrag easier to implement.
>
> 4. This is a part of format of subvol tree that I don't understand. You said:
>
> "A file consists of an inode item followed by extent ref items (aka
> reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
> pages can be shared with other subvol trees to make snapshots."
>
> Those "extent ref items" must also be forming a tree. But how are  
> nodes of this tree addressed? The inode&offset is the key, but how  
> do you get an address of a child node?
>
> Do nodes in the subvol tree equal the metadata extents? Or is this  
> tree special in that regard? Because, it seems a bit ridiculous to  
> have (for example) a 16 KB metadata extent for each directory or a  
> 16 K metadata extent in the subvol tree for a file that is only a  
> few bytes in length.

Oh, I think I get the answer to 4.

There can be multiple directory items per metadata extent and multiple  
inode items per metadata extent. Bu t they all have the same parent.  
So a file will just have an inode item, but this inode item will not  
occupy the entire metadata extent. Also, there could be some smart  
optimizations there so that the inode reflinks-to-file-extents are  
pack together with inode header.

That means there are no stale entries in any metadata extents. There  
are only stale metadata extents i.e. unused metadata extents. But,  
they would be enumerated in the free extents tree, right? Did I get  
that right?



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  1:03                                                     ` General Zed
  2019-09-17  1:34                                                       ` General Zed
  2019-09-17  1:44                                                       ` Chris Murphy
@ 2019-09-17  4:19                                                       ` Zygo Blaxell
  2 siblings, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-17  4:19 UTC (permalink / raw)
  To: General Zed; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Sep 16, 2019 at 09:03:17PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > Reflinks are the forward references--there is no other kind of forward
> > reference in btrfs (contrast with other filesystems which use one data
> > structure for single references and another for multiple references).
> > 
> > There are two distinct objects with similar names:  extent data items,
> > and extent ref items.
> > 
> > A file consists of an inode item followed by extent ref items (aka
> > reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
> > pages can be shared with other subvol trees to make snapshots.
> 
> Ok, so a reflink contains a virtual address. Did I get that right?

Virtual address of the extent (i.e. the beginning of the original contiguous
write) plus offset and length within the extent.  This allows an extent ref
to reference part of an extent.

> All extent ref items are reflinks which contain a 4 KB aligned address
> because the extents have that same alignment. Did I get that right?
> 
> Virtual addresses are 8-bytes in size?

The addresses are 8 bytes.  There's an offset and length in the ref which
IIRC are 4 bytes each (extents have several limits on their size, all below
2GB).  For compressed extents there's also a decompressed length.

> I hope that virtual addresses are not wasteful of address space (that is,
> many top bits in an 8 bit virtual address are all zero).

Block groups that are deleted are never reissued the same virtual address,
though I know of no reason they couldn't be.

> > Extent data items are stored in a single tree (with other trees using
> > the same keys) that just lists which parts of the filesystem are occupied,
> > how long they are, and what data/metadata they contain.  Each extent
> > item contains a list of references to one of four kinds of object that
> > refers to the extent item (aka backrefs).  The free space tree is the
> > inverse of the extent data tree.
> 
> Ok, so there is an "extent tree" keyed by virtual addresses. Items there
> contain extent data.
> 
> But, how are nodes in this extent tree addressed (how do you travel from the
> parent to the child)? I guess by full virtual address, i.e. by a reflink,
> but this reflink can point within-extent, meaning its address is not 4 KB
> aligned.

Metadata items are also stored in extents.  There's a device tree to bootstrap.

End addresses of extent refs can be not 4K-aligned (i.e. the trailing
bytes at EOF).  All other addresses must be 4K-aligned.

> Or, an alternative explanation:
> each whole metadata extent is a single node. Each node is often half-full to
> allow for various tree operations to be performed. Due to there being many
> items per each node, there is additional CPU processing effort required when
> updating a node.
> 
> > Each extent ref item is a reference to an extent data item, but it
> > also contains all the information required to access the data.  For
> > normal read operations the extent data tree can be ignored (though
> > you still need to do a lookup in the csum tree to verify csums.
> 
> So, for normal reads, the information in subvol tree is sufficient.
> 
> > > So, you are saying that backrefs are already in the extent tree (or
> > > reachable from it). I didn't know that, that information makes my defrag
> > > much simpler to implement and describe. Someone in this thread has
> > > previously mislead me to believe that backref information is not easily
> > > available.
> > 
> > The backref isn't a precise location--it just tells you which metadata
> > blocks are holding at least one reference to the extent.  Some CPU
> > and linear searching has to be done to resolve that fully to an (inode,
> > offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
> > go faster, because you don't need to update the extent tree again when
> > you do some operations on the forward ref side, even though they add or
> > remove references.  e.g. creating a snapshot does not change the backrefs
> > list on individual extents--it creates two roots sharing a subset of the
> > subvol trees' branches.
> 
> This reads like a mayor fu**** to me.
> 
> I don't get it. If a backref doesn't point to an exact item, than CPU has to
> scan the entire 16 KB metadata extent to find the matching reflink. However,
> this would imply that all the items in a metadata extent are always valid
> (not stale from older versions of metadata). This then implies that, when an
> item of a metadata extent is updated, all the parents of all the items in
> the same extent have to be updated. Now, that would be such a waste,
> wouldn't it? Especially if the metadata extent is allowed to contain stale
> items.

All parents are updated in all trees that are updated.  It's the
wandering-trees filesystem.  The extent tree gets a _lot_ of updates,
because its items do change for every new subvol item that is created
or deleted.

But you're getting into the depths I'm not clear on.  There are 4 kinds of
parent node (or 4 branches in the case statement that follows backrefs).
Some of them (like snapshot leaf nodes) are very static, not changing
without creating new nodes all the way up to the top of a subvol tree.
Others refer fairly directly and specifically to a (subvol, inode, offset)
tuple.

> An alternative explanation: all the b-trees have 16 KB nodes, where each
> node matches a metadata extent. Therefore, the entire node has a single
> parent in a particular tree.
> 
> This means all virtual addresses are always 4 K aligned, furthermore, all
> virtual addresses that point to metadata extents are 16 K aligned.
> 
> 16 KB is a pretty big for a tree node. I wonder why was this size selected
> vs. 4 KB nodes? But, it doesn't matter.

Probably due to a performance benchmark someone ran around 2007-2009...?

It is useful for subvol trees where small files (2K or less after
compression) are stored inline.

> > > > 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
> > > > 	Length of item is number of extent data blocks * 4 bytes plus a
> > > > 	168-bit header (ish...csums from adjacent extents may be packed
> > > > 	using a shared header)
> > > >
> > > > 	- free space tree:  location, size of free space.  This appears
> > > > 	when the extent is deleted.  It may be merged with adjacent
> > > > 	records.  Length is maybe 20 bytes?
> > > >
> > > > Each page contains a few hundred items, so if there are a few hundred
> > > > unrelated extents between extents in the log file, each log file extent
> > > > gets its own metadata page in each tree.
> > > 
> > > As far as I can understand it, the extents in the extent tree are indexed
> > > (keyed) by inode&offset. Therefore, no matter how many unrelated extents
> > > there are between (physical locations of data) extents in the log file, the
> > > log file extent tree entries will (generally speaking) be localized, because
> > > multiple extent entries (extent items) are bunched tohgether in one 16 KB
> > > metadata extent node.
> > 
> > No, extents in the extent tree are indexed by virtual address (roughly the
> > same as physical address over small scales, let's leave the device tree
> > out of it for now).  The subvol trees are organized the way you are
> > thinking of.
> 
> So, I guess that the virtual-to-physical address translation tables are
> always loaded in memory and that this translation is very fast? And the
> translation in the opposite direction, too.

Block groups typically consist of 1GB chunks from each disk, so the
maximum number of block groups is typically total disk size / 1GB.
(It's possible to create smaller chunks if you do a lot of small
incrementing resizes but nobody does that).  So a 45TB filesystem might
have a 3.5MB translation btree.

Translation in the other direction is rare--everything in btrfs uses the virtual
addresses.  IO error reports do the reverse transformation to turn it into a
(device id, sector) pair...that's the only use case I can think of.  Well,
I guess the block group allocator must deal with block addresses too.

> Anyway, thanks for explaining this all to me, makes it all much more clear.
> 
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  1:44                                                       ` Chris Murphy
@ 2019-09-17  4:55                                                         ` Zygo Blaxell
  0 siblings, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-17  4:55 UTC (permalink / raw)
  To: Chris Murphy; +Cc: General Zed, Btrfs BTRFS

On Mon, Sep 16, 2019 at 07:44:31PM -0600, Chris Murphy wrote:
> Reflinks are like a file based snapshot, they're a file with its own
> inode and other metadata you expect for a file, but points to the same
> extents as another file. Off hand I'm not sure other than age if
> there's any difference between the structure of the original file and
> a reflink of that file. Find out. Make a reflink, dump tree. Delete
> the original file. Dump tree.

Ehhhh...not really.

The clone-file ioctl is a wrapper around clone-file-range that fills
in 0 as the offset and the size of the file as the length when creating
reflinks.  clone-file-range copies all the extent reference items from
the src range to the dst range, replacing any extent reference items that
were present before.  It's O(n) in the number of source range extent refs,
and it doesn't have snapshot-style atomicity.

The offset for src and dst can both be non-zero, and src and dst inode
can be the same.  Src and dst cannot overlap, but they can be logically
adjacent, creating a logical-neighbor loop for physical extents.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  2:30                                 ` General Zed
@ 2019-09-17  5:30                                   ` Zygo Blaxell
  2019-09-17 10:07                                     ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-17  5:30 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

On Mon, Sep 16, 2019 at 10:30:39PM -0400, General Zed wrote:
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > and I think that's impossible so I start from designs
> > that make forward progress with a fixed allocation of resources.
> 
> Well, that's not useless, but it's kind of meh. Waste of time. Solve the
> problem like a real man! Shoot with thermonuclear weapons only!

I have thermonuclear weapons:  the metadata trees in my filesystems.  ;)

> > > So I think you should all inform yourself a little better about various
> > > defrag algorithms and solutions that exist. Apparently, you all lost the
> > > sight of the big picture. You can't see the wood from the trees.
> > 
> > I can see the woods, but any solution that starts with "enumerate all
> > the trees" will be met with extreme skepticism, unless it can do that
> > enumeration incrementally.
> 
> I think that I'm close to a solution that only needs to scan the free-space
> tree in the entirety at start. All other trees can be only partially
> scanned. I mean, at start. As the defrag progresses, it will go through all
> the trees (except in case of defragging only a part of the partition). If a
> partition is to be only partially defragged, then the trees do not need to
> be red in entirety. Only the free space tree needs to be red in entirety at
> start (and the virtual-physical address translation trees, which are small,
> I guess).

I doubt that on a 50TB filesystem you need to read the whole tree...are
you going to globally optimize 50TB at once?  That will take a while.
Start with a 100GB sliding window, maybe.

> > This is fairly common on btrfs:  the btrfs words don't mean the same as
> > other words, causing confusion.  How many copies are there in a btrfs
> > 4-disk raid1 array?
> 
> 2 copies of everything, except the superblock which has 2-6 copies.

Good, you can enter the clubhouse.  A lot of new btrfs users are surprised
it's less than 4.

> > > > > This is sovled simply by always running defrag before dedupe.
> > > > Defrag and dedupe in separate passes is nonsense on btrfs.
> > > Defrag can be run without dedupe.
> > Yes, but if you're planning to run both on the same filesystem, they
> > had better be aware of each other.
> 
> On-demand defrag doesn't need to be aware of on-demand dedupe. Or, only in
> the sense that dedupe should be shut down while defrag is running.
> 
> Perhaps you were referring to an on-the-fly dedupe. In that case, yes.

My dedupe runs continuously (well, polling with incremental scan).
It doesn't shut down.

> > > Now, how to organize dedupe? I didn't think about it yet. I'll leave it to
> > > you, but it seems to me that defrag should be involved there. And, my defrag
> > > solution would help there very, very much.
> > 
> > I can't see defrag in isolation as anything but counterproductive to
> > dedupe (and vice versa).
> 
> Share-preserving defrag can't be harmful to dedupe.

Sure it can.  Dedupe needs to split extents by content, and btrfs only
supports that by copying.  If defrag is making new extents bigger before
dedupe gets to them, there is more work for dedupe when it needs to make
extents smaller again.
 
> I would suggest one of the two following simple solutions:
>    a) the on-demand defrag should be run BEFORE AND AFTER the on-demand
> dedupe.
> or b) the on-demand defrag should be run BEFORE the on-demand dedupe, and
> on-demand dedupe uses defrag functionality to defrag while dedupe is in
> progress.
> 
> So I guess you were thinking about the solution b) all the time when you
> said that dedupe and defrag need to be related.

Well, both would be running continuously in the same process, so
they would negotiate with each other as required.  Dedupe runs first
on new extents to create a plan for increasing extent sharing, then
defrag creates a plan for sufficient logical/physical contiguity of
those extents after dedupe has cut them into content-aligned pieces.
Extents that are entirely duplicate simply disappear and do not form
part of the defrag workload (at least until it is time to defragment
free space...).  Both plans are combined and optimized, then the final
data relocation command sequence is sent to the filesystem.

> > > > Extent splitting in-place is not possible on btrfs, so extent boundary
> > > > changes necessarily involve data copies.  Reference counting is done
> > > > by extent in btrfs, so it is only possible to free complete extents.
> > > 
> > > Great, there is reference counting in btrfs. That helps. Good design.
> > 
> > Well, I say "reference counting" because I'm simplifying for an audience
> > that does not yet all know the low-level details.  The counter, such as
> > it is, gives values "zero" or "more than zero."  You never know exactly
> > how many references there are without doing the work to enumerate them.
> > The "is extent unique" function in btrfs runs the enumeration loop until
> > the second reference is found or the supply of references is exhausted,
> > whichever comes first.  It's a tradeoff to make snapshots fast.
> 
> Well, that's a disappointment.
> 
> > When a reference is created to a new extent, it refers to the entire
> > extent.  References can refer to parts of extents (the reference has an
> > offset and length field), so when an extent is partially overwritten, the
> > extent is not modified.  Only the reference is modified, to make it refer
> > to a subset of the extent (references in other snapshots are not changed,
> > and the extent data itself is immutable).  This makes POSIX fast, but it
> > creates some headaches related to garbage collection, dedupe, defrag, etc.
> 
> Ok, got it. Thaks.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17  5:30                                   ` Zygo Blaxell
@ 2019-09-17 10:07                                     ` General Zed
  2019-09-17 23:40                                       ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-17 10:07 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Mon, Sep 16, 2019 at 10:30:39PM -0400, General Zed wrote:
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > and I think that's impossible so I start from designs
>> > that make forward progress with a fixed allocation of resources.
>>
>> Well, that's not useless, but it's kind of meh. Waste of time. Solve the
>> problem like a real man! Shoot with thermonuclear weapons only!
>
> I have thermonuclear weapons:  the metadata trees in my filesystems.  ;)
>
>> > > So I think you should all inform yourself a little better about various
>> > > defrag algorithms and solutions that exist. Apparently, you all lost the
>> > > sight of the big picture. You can't see the wood from the trees.
>> >
>> > I can see the woods, but any solution that starts with "enumerate all
>> > the trees" will be met with extreme skepticism, unless it can do that
>> > enumeration incrementally.
>>
>> I think that I'm close to a solution that only needs to scan the free-space
>> tree in the entirety at start. All other trees can be only partially
>> scanned. I mean, at start. As the defrag progresses, it will go through all
>> the trees (except in case of defragging only a part of the partition). If a
>> partition is to be only partially defragged, then the trees do not need to
>> be red in entirety. Only the free space tree needs to be red in entirety at
>> start (and the virtual-physical address translation trees, which are small,
>> I guess).
>
> I doubt that on a 50TB filesystem you need to read the whole tree...are
> you going to globally optimize 50TB at once?  That will take a while.

I need to read the whole free-space tree to find a few regions with  
most free space. Those will be used as destinations for defragmented  
data.

If a mostly free region of sufficient size (a few GB) can be found  
faster, then there is no need to read the entire free-space tree. But,  
on a disk with less than 15% free space, it would be advisable to read  
the entire free space tree to find the less-crowded regions of the  
filesystem.

> Start with a 100GB sliding window, maybe.

There will be something similar to a sliding window (in virtual  
address space). The likely size of the window for "typical" desktops  
is just around 1 GB, no more. In complicated filesystems, it will be  
smaller. Really, you don't need a very big sliding window (little can  
be gained by enlarging it), for a 256 GB drive a small sliding window  
is quite fine. The size of the sliding window can be dynamically  
adjusted, depending on several factors, but mostly: available RAM and  
filesystem complexity (number of extents and reflinks in the sliding  
window).

So, the defrag will be tunable by supplying the amount of RAM to use.  
If you supply it with insufficient RAM, it will slow down  
considerably. 400 MB minimum RAM usage recommended for typical  
desktops. But, this should be tested on an actual implementation, I'm  
just guessing at this point. Could be better or worse.

This sliding window won't be a perfect one (it can have  
discontinuities, fragments), and also a small amount of data which is  
not in the sliding window but in logically adjacent areas will also be  
scanned.

So, I'm designing a defrag that is fast, can use little RAM, and can  
work in low free-space conditions. Can work on huge filesystems and  
can take on a good amount of pathological cases. Preserves all file  
data sharing.

I hope that at least someone will be satisfied.

>> > This is fairly common on btrfs:  the btrfs words don't mean the same as
>> > other words, causing confusion.  How many copies are there in a btrfs
>> > 4-disk raid1 array?
>>
>> 2 copies of everything, except the superblock which has 2-6 copies.
>
> Good, you can enter the clubhouse.  A lot of new btrfs users are surprised
> it's less than 4.
>
>> > > > > This is sovled simply by always running defrag before dedupe.
>> > > > Defrag and dedupe in separate passes is nonsense on btrfs.
>> > > Defrag can be run without dedupe.
>> > Yes, but if you're planning to run both on the same filesystem, they
>> > had better be aware of each other.
>>
>> On-demand defrag doesn't need to be aware of on-demand dedupe. Or, only in
>> the sense that dedupe should be shut down while defrag is running.
>>
>> Perhaps you were referring to an on-the-fly dedupe. In that case, yes.
>
> My dedupe runs continuously (well, polling with incremental scan).
> It doesn't shut down.

Ah... so I suggest that the defrag should temporarily shut down  
dedupe, at least in the initial versions of defrag. Once both defrag  
and dedupe are working standalone, the merging effort can begin.

>> > > Now, how to organize dedupe? I didn't think about it yet. I'll  
>> leave it to
>> > > you, but it seems to me that defrag should be involved there.  
>> And, my defrag
>> > > solution would help there very, very much.
>> >
>> > I can't see defrag in isolation as anything but counterproductive to
>> > dedupe (and vice versa).
>>
>> Share-preserving defrag can't be harmful to dedupe.
>
> Sure it can.  Dedupe needs to split extents by content, and btrfs only
> supports that by copying.  If defrag is making new extents bigger before
> dedupe gets to them, there is more work for dedupe when it needs to make
> extents smaller again.
>
>> I would suggest one of the two following simple solutions:
>>    a) the on-demand defrag should be run BEFORE AND AFTER the on-demand
>> dedupe.
>> or b) the on-demand defrag should be run BEFORE the on-demand dedupe, and
>> on-demand dedupe uses defrag functionality to defrag while dedupe is in
>> progress.
>>
>> So I guess you were thinking about the solution b) all the time when you
>> said that dedupe and defrag need to be related.
>
> Well, both would be running continuously in the same process, so
> they would negotiate with each other as required.  Dedupe runs first
> on new extents to create a plan for increasing extent sharing, then
> defrag creates a plan for sufficient logical/physical contiguity of
> those extents after dedupe has cut them into content-aligned pieces.
> Extents that are entirely duplicate simply disappear and do not form
> part of the defrag workload (at least until it is time to defragment
> free space...).  Both plans are combined and optimized, then the final
> data relocation command sequence is sent to the filesystem.

I think that this kind of close dedupe-defrag integration should  
mostly be left to dedupe developers. First, both defrag and dedupe  
should work perfectly on their own. Then, an interface to defrag  
should be made available to dedupe developers. In particular, I think  
that the batch-update functionality (it takes lots of extents and an  
empty free space region, then writes defragmented extents to the given  
region) is of particular interest to dedupe.

>> > > > Extent splitting in-place is not possible on btrfs, so extent boundary
>> > > > changes necessarily involve data copies.  Reference counting is done
>> > > > by extent in btrfs, so it is only possible to free complete extents.
>> > >
>> > > Great, there is reference counting in btrfs. That helps. Good design.
>> >
>> > Well, I say "reference counting" because I'm simplifying for an audience
>> > that does not yet all know the low-level details.  The counter, such as
>> > it is, gives values "zero" or "more than zero."  You never know exactly
>> > how many references there are without doing the work to enumerate them.
>> > The "is extent unique" function in btrfs runs the enumeration loop until
>> > the second reference is found or the supply of references is exhausted,
>> > whichever comes first.  It's a tradeoff to make snapshots fast.
>>
>> Well, that's a disappointment.
>>
>> > When a reference is created to a new extent, it refers to the entire
>> > extent.  References can refer to parts of extents (the reference has an
>> > offset and length field), so when an extent is partially overwritten, the
>> > extent is not modified.  Only the reference is modified, to make it refer
>> > to a subset of the extent (references in other snapshots are not changed,
>> > and the extent data itself is immutable).  This makes POSIX fast, but it
>> > creates some headaches related to garbage collection, dedupe, defrag, etc.
>>
>> Ok, got it. Thaks.
>>
>>
>>
>>




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17 10:07                                     ` General Zed
@ 2019-09-17 23:40                                       ` Zygo Blaxell
  2019-09-18  4:37                                         ` General Zed
  0 siblings, 1 reply; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-17 23:40 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

On Tue, Sep 17, 2019 at 06:07:24AM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > I doubt that on a 50TB filesystem you need to read the whole tree...are
> > you going to globally optimize 50TB at once?  That will take a while.
> 
> I need to read the whole free-space tree to find a few regions with most
> free space. Those will be used as destinations for defragmented data.

Hmmm...I have some filesystems with 2-4 million FST entries, but
they can be read in 30 seconds or less, even on the busy machines.

> If a mostly free region of sufficient size (a few GB) can be found faster,
> then there is no need to read the entire free-space tree. But, on a disk
> with less than 15% free space, it would be advisable to read the entire free
> space tree to find the less-crowded regions of the filesystem.

A filesystem with 15% free space still has 6 TB of contiguous.
Not hard to find some room!  You can just look in the chunk tree, all
the unallocated space there is multi-GB contiguous chunks.  Every btrfs
is guaranteed to have a chunk tree, but some don't have free space trees
(though the ones that don't should probably be strongly encouraged to
enable that feature).  So you probably don't even need to look for free
space if there is unallocated space.

On the other hand, you do need to measure fragmentation of the existing
free space, in order to identify the highest-priority areas for defrag.
So maybe you read the whole FST anyway, sort, and spit out a short list.
It's not nearly as expensive as I thought.

I did notice one thing while looking at filesystem metadata vs commit
latency the other day:  btrfs's allocation performance seems to be
correlated to the amount of free space _in the block groups_, not _on the
filesystem_.  So after deleting 2% of the files on a 50% full filesystem,
it runs as slowly a 98% full one.  Then when you add 3% more data to fill
the free space and allocate some new block groups, it goes fast again.
Then you delete things and it gets slow again.  Rinse and repeat.

> > My dedupe runs continuously (well, polling with incremental scan).
> > It doesn't shut down.
> 
> Ah... so I suggest that the defrag should temporarily shut down dedupe, at
> least in the initial versions of defrag. Once both defrag and dedupe are
> working standalone, the merging effort can begin.

Pausing dedupe just postpones the inevitable.  The first thing dedupe
will do when it resumes is a new-data scan that will find all the new
defrag extents, because dedupe has to discover what's in them and update
the now out-of-date physical location data in the hash table.  When defrag
and dedupe are integrated, the hash table gets updated by defrag in
place--same table entries in a different location.

I have that problem _right now_ when using balance to defragment free
space in block groups.  Dedupe performance drops while the old relocated
data is rescanned--since it's old, we already found all the duplicates,
so the iops of the rescan are just repairing the damage to the hash
table that balance did.

That said...I also have a plan to make dedupe's new-data scans about
two orders of magnitude faster under common conditions.  So maybe in the
future dedupe won't care as much about rereading stuff, as rereading will
add at most 1% redundant read iops.  That still requires running dedupe
first (or in the kernel so it's in the write path), or have some way for
defrag to avoid touching recently added data before dedupe gets to it,
due to the extent-splitting duplicate work problem.

> I think that this kind of close dedupe-defrag integration should mostly be
> left to dedupe developers. 

That's reasonable--dedupe creates a specific form of fragmentation
problems.  Not fixing those is bad for dedupe performance (and performance
in general) so it's a logical extension of the dedupe function to take
care of them as we go.  I was working on it already.

> First, both defrag and dedupe should work
> perfectly on their own. 

You use the word "perfectly" in a strange way...

There are lots of btrfs dedupers that are optimized for different cases:
some are very fast for ad-hoc full-file dedupe, others are traditional
block-oriented filesystem-tree scanners that run on multiple filesystems.
There's an in-kernel one under development that...runs in the kernel
(honestly, running in the kernel is not as much of an advantage as you
might think).  I wrote a deduper that was designed to merely not die when
presented with a large filesystem and a 50%+ dupe hit rate (it ended
up being faster and more efficient than the others purely by accident,
but maybe that says more about the state of the btrfs dedupe art than
about the quality of my implementation).  I wouldn't call any of these
"perfect"--there are always some subset of users for which any of them
are unusable or there is a more suitable tool that performs better for
some special case.

There is similar specialization and variation among defrag algorithms
as well.  At best, any of them is "a good result given some choice
of constraints."

> Then, an interface to defrag should be made
> available to dedupe developers. In particular, I think that the batch-update
> functionality (it takes lots of extents and an empty free space region, then
> writes defragmented extents to the given region) is of particular interest
> to dedupe.

Yeah, I have a wishlist item for a kernel call that takes a list of
(virtual address, length) pairs and produces a single contiguous physical
extent containing the content of those pairs, updating all the reflinks
in the process.  Same for dedupe, but that one replaces all the extents
with reflinks to the first entry in the list instead of making a copy.

I guess the extent-merge call could be augmented with an address hint
for allocation, but experiments so far have indicated that the possible
gains are marginal at best given the current btrfs allocator behavior,
so I haven't bothered pursuing that.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-17 23:40                                       ` Zygo Blaxell
@ 2019-09-18  4:37                                         ` General Zed
  2019-09-18 18:00                                           ` Zygo Blaxell
  0 siblings, 1 reply; 111+ messages in thread
From: General Zed @ 2019-09-18  4:37 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Tue, Sep 17, 2019 at 06:07:24AM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > I doubt that on a 50TB filesystem you need to read the whole tree...are
>> > you going to globally optimize 50TB at once?  That will take a while.
>>
>> I need to read the whole free-space tree to find a few regions with most
>> free space. Those will be used as destinations for defragmented data.
>
> Hmmm...I have some filesystems with 2-4 million FST entries, but
> they can be read in 30 seconds or less, even on the busy machines.
>
>> If a mostly free region of sufficient size (a few GB) can be found faster,
>> then there is no need to read the entire free-space tree. But, on a disk
>> with less than 15% free space, it would be advisable to read the entire free
>> space tree to find the less-crowded regions of the filesystem.
>
> A filesystem with 15% free space still has 6 TB of contiguous.
> Not hard to find some room!  You can just look in the chunk tree, all
> the unallocated space there is multi-GB contiguous chunks.

If there is a free chunk, defrag can take it. If there isn't, then it can't.

> Every btrfs
> is guaranteed to have a chunk tree, but some don't have free space trees
> (though the ones that don't should probably be strongly encouraged to
> enable that feature).  So you probably don't even need to look for free
> space if there is unallocated space.

Yes, but only in that specific case. The free space tree scan can be  
skipped in that lucky situation.

We are talking about pre-defrag situation. It has to be assumed that  
the free space is badly fragmented.

> On the other hand, you do need to measure fragmentation of the existing
> free space, in order to identify the highest-priority areas for defrag.
> So maybe you read the whole FST anyway, sort, and spit out a short list.
> It's not nearly as expensive as I thought.

The primary purpose of the free-space tree scan is in the low-free  
space situation (<10% free space) to find an above-average empty area.

> I did notice one thing while looking at filesystem metadata vs commit
> latency the other day:  btrfs's allocation performance seems to be
> correlated to the amount of free space _in the block groups_, not _on the
> filesystem_.  So after deleting 2% of the files on a 50% full filesystem,
> it runs as slowly a 98% full one.  Then when you add 3% more data to fill
> the free space and allocate some new block groups, it goes fast again.
> Then you delete things and it gets slow again.  Rinse and repeat.

Obviously, more work has to be done to improve that allocator.

>> > My dedupe runs continuously (well, polling with incremental scan).
>> > It doesn't shut down.
>>
>> Ah... so I suggest that the defrag should temporarily shut down dedupe, at
>> least in the initial versions of defrag. Once both defrag and dedupe are
>> working standalone, the merging effort can begin.
>
> Pausing dedupe just postpones the inevitable.  The first thing dedupe
> will do when it resumes is a new-data scan that will find all the new
> defrag extents, because dedupe has to discover what's in them and update
> the now out-of-date physical location data in the hash table.

It just postpones the inevitable, but you missed the point. The point  
of shutting down dedupe is to avoid nasty bugs caused by dedupe-defrag  
interaction.

> When defrag
> and dedupe are integrated, the hash table gets updated by defrag in
> place--same table entries in a different location.
>
> I have that problem _right now_ when using balance to defragment free
> space in block groups.  Dedupe performance drops while the old relocated
> data is rescanned--since it's old, we already found all the duplicates,
> so the iops of the rescan are just repairing the damage to the hash
> table that balance did.
>
> That said...I also have a plan to make dedupe's new-data scans about
> two orders of magnitude faster under common conditions.  So maybe in the
> future dedupe won't care as much about rereading stuff, as rereading will
> add at most 1% redundant read iops.  That still requires running dedupe
> first (or in the kernel so it's in the write path), or have some way for
> defrag to avoid touching recently added data before dedupe gets to it,
> due to the extent-splitting duplicate work problem.

The share-preserving defrag shouldn't interfere with dedupe because  
defrag is run on-demand, and it should then shut down dedupe until it  
has completed. Therefore, the issues it causes to dedupe are only  
temporary (and minor, really).

>> I think that this kind of close dedupe-defrag integration should mostly be
>> left to dedupe developers.
>
> That's reasonable--dedupe creates a specific form of fragmentation
> problems.  Not fixing those is bad for dedupe performance (and performance
> in general) so it's a logical extension of the dedupe function to take
> care of them as we go.  I was working on it already.
>
>> First, both defrag and dedupe should work
>> perfectly on their own.
>
> You use the word "perfectly" in a strange way...

What I meant by "perfectly" is that there are no serious bugs and  
issues in either of them. They can work sub-optimally, but they must  
work, not crash or hang. So, perhaps I should have said "both defrag  
and dedupe should work without issues on their own".

Previously, I used the word "perfectly" in a different sense, but I  
thought that this time the modified meaning will be understood from  
the context.

> There are lots of btrfs dedupers that are optimized for different cases:
> some are very fast for ad-hoc full-file dedupe, others are traditional
> block-oriented filesystem-tree scanners that run on multiple filesystems.
> There's an in-kernel one under development that...runs in the kernel
> (honestly, running in the kernel is not as much of an advantage as you
> might think).  I wrote a deduper that was designed to merely not die when
> presented with a large filesystem and a 50%+ dupe hit rate (it ended
> up being faster and more efficient than the others purely by accident,
> but maybe that says more about the state of the btrfs dedupe art than
> about the quality of my implementation).  I wouldn't call any of these
> "perfect"--there are always some subset of users for which any of them
> are unusable or there is a more suitable tool that performs better for
> some special case.

Oh, I didn't mean "perfect" in the sense "best possible results". So,  
just a slight misunderstanding there.
The point is that dedupe-defrag integration should be attempted only  
after it is determined that both defrag and dedupe are working  
*without issues* on their own.

> There is similar specialization and variation among defrag algorithms
> as well.  At best, any of them is "a good result given some choice
> of constraints."

>> Then, an interface to defrag should be made
>> available to dedupe developers. In particular, I think that the batch-update
>> functionality (it takes lots of extents and an empty free space region, then
>> writes defragmented extents to the given region) is of particular interest
>> to dedupe.
>
> Yeah, I have a wishlist item for a kernel call that takes a list of
> (virtual address, length) pairs and produces a single contiguous physical
> extent containing the content of those pairs, updating all the reflinks
> in the process.  Same for dedupe, but that one replaces all the extents
> with reflinks to the first entry in the list instead of making a copy.
>
> I guess the extent-merge call could be augmented with an address hint
> for allocation, but experiments so far have indicated that the possible
> gains are marginal at best given the current btrfs allocator behaviour,
> so I haven't bothered pursuing that.

The "batch-update" from defrag should certainly trumps any  
"extent-merge". The defrag will do it all for you, you just supply the  
defrag with a list of extents that need to be defragmented.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: Feature requests: online backup - defrag - change RAID level
  2019-09-18  4:37                                         ` General Zed
@ 2019-09-18 18:00                                           ` Zygo Blaxell
  0 siblings, 0 replies; 111+ messages in thread
From: Zygo Blaxell @ 2019-09-18 18:00 UTC (permalink / raw)
  To: General Zed; +Cc: linux-btrfs

On Wed, Sep 18, 2019 at 12:37:42AM -0400, General Zed wrote:
> It just postpones the inevitable, but you missed the point. The point of
> shutting down dedupe is to avoid nasty bugs caused by dedupe-defrag
> interaction.

They are the same bugs you'll have to fix anyway.  Dedupe isn't
particularly special or different from the way that other btrfs write
operations work, and nobody wants to be locked out of their filesystem
during a big data relocation operation.  Even balance doesn't prevent
concurrent filesystem modification--it allows concurrent changes, and
restarts processing to take care of them when necessary.

> The share-preserving defrag shouldn't interfere with dedupe because defrag
> is run on-demand, and it should then shut down dedupe until it has
> completed. Therefore, the issues it causes to dedupe are only temporary (and
> minor, really).

On big filesystems fragmentation is a _continuous_ problem that gets
more expensive to fix the longer it is neglected.  So I'd naturally
expect both dedupe and defrag agents to be active at the same time,
or very closely interleaved, with concurrent data updates as well.
VM images and database files are hotspots for all three activities.

I'd expect a sysadmin interface more like balance, where the interface
looks like "run defrag now, relocating at most 20GB of data" and defrag
decides (possibly with hints from the admin) which 20GB of data out of
50TB of filesystem that should be.  Or I'd set the defrag limit to 1GB
per run, and run it as many times as possible in a maintenance window
until either the time runs out or defrag says "no further optimization
is practical or possible, you can stop looping now."

For small systems it doesn't matter--admin says "relocate at most
1TB of data" and new-and-improved defrag says "easy, that's the whole
filesystem."  But small systems are not very interesting.  On small
systems I can just run the current reflink-breaking file-oriented
recursive defrag and dedupe at the same time.  Dedupe can put the sharing
back almost as fast as current reflink-breaking defrag can break it,
the whole thing finishes in minutes, and I don't even bother calculating
how many iops were wasted because they were all free.

> > I guess the extent-merge call could be augmented with an address hint
> > for allocation, but experiments so far have indicated that the possible
> > gains are marginal at best given the current btrfs allocator behaviour,
> > so I haven't bothered pursuing that.
> 
> The "batch-update" from defrag should certainly trumps any "extent-merge".
> The defrag will do it all for you, you just supply the defrag with a list of
> extents that need to be defragmented.

Sure...either way, it's an ioctl that takes a list of extents to be
defragmented as an argument, and produces an output locally optimized
for minimal fragmentation, updating and removing references to the old
locations, etc.  Maybe it's a single list input and single extent output;
maybe it's a list of lists which produces multiple extent outputs; maybe
you can call it multiple times per commit and the kernel batches them up
when it does a flush; maybe it takes allocation hints in the argument
because the user already knows the best free space location; maybe it
finds its own contiguous space; maybe you pass it a file descriptor and
offset to a O_TMPFILE file where you already allocated the destination
area with fallocate; maybe there's two separate ioctls, one sets up the
destination area and the other moves extent data into it.  You can call
that ioctl(s) 'batch-update' or 'extent-merge' or whatever you like.
Figure out the requirements and make a proposal, we can see where the
proposals overlap or decompose them into common components.

Other questions you've asked indicate you are thinking of doing the bulk
of the defrag work in the kernel.  This is not a good idea.  It's hard
to allocate and effectively use large amounts of memory in the kernel
(for developers and users alike), and the in-kernel btrfs maintenance
tools so far have proven to make desirable administrative functions like
IO bandwidth management difficult to impossible.  Once code is integrated
into the kernel, it has to be kept around more or less forever, even if
it is half-finished, obviously awful, and nobody can use it seriously
without major redesign.  The current defrag ioctl is a prime example
of that, but balance and send are also good examples of things that
would have been better outside the kernel if the right interfaces had
been available at the time they were designed.  (Balance could also be
implemented in userspace with a batch-update data relocation ioctl, and
could easily avoid several problems with the current kernel implementation
in the process.)

It's much easier to do big-memory operations in userspace (not just the
technical operations themselves, but also getting the code accepted
in the kernel).  There are already interfaces for rapid ingestion
of the necessary metadata from userspace, so defrag's needs are
covered there.  What is currently missing is good kernel interfaces
for rapid implementation of the output of data relocation algorithms
(i.e.  ioctls to move extent data precisely as instructed and don't
break references).  Leave the kernel to physically move the data and
update metadata once userspace knows where it should go, but don't make
the kernel try to plan stuff or make decisions on its own--that doesn't
end well.  Minimizing the kernel code footprint will also make it much
easier to avoid crashes and deadlocks (or at least make them harmless
should they occur).  btrfs has had more deadlock bugs than any two
other filesystems on Linux combined, so designs that minimize the risk
of adding new deadlocks will be preferred.

I don't think you can gain much by throwing kernel memory at
predetermining extent moves, given the current on-disk structure of btrfs.
Existing sharing-preserving extent move operations in btrfs are currently
dominated by _read_ IO times--the writes are fast mostly-contiguous
flushes, while the reads are slow random accesses.  The kernel already
caches reads well already, there are just a lot of them to do when
doing anything related to extent backrefs.  On small filesystems the
whole metadata fits comfortably in 10% of RAM (below default page cache
eviction thresholds), and on filesystems of that size you can already
do full balances in minutes, there is no problem to be solved there.
You can prefetch metadata into cache RAM by running TREE_SEARCH_V2 on it,
if you think that will help.

Consider the XFS kernel interfaces for defrag.  It may be possible to
implement those on btrfs, then make minor modifications to xfs_fsr so it
can do defrag on btrfs (probably not efficiently--xfs_fsr likely assumes
extent splits are possible without moving data--but it's a good example
of a kernel interface designed for defrag nonetheless).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Feature requests: online backup - defrag - change RAID level
@ 2019-09-09  3:12 webmaster
  0 siblings, 0 replies; 111+ messages in thread
From: webmaster @ 2019-09-09  3:12 UTC (permalink / raw)
  To: linux-btrfs


Hello everyone!

I have been programming for a long time (over 20 years), and I am  
quite interested in a lot of low-level stuff. But in reality I have  
never done anything related to kernels or filesystems. But I did a lot  
of assembly, C, OS stuff etc...

Looking at your project status page (at  
https://btrfs.wiki.kernel.org/index.php/Status), I must say that your  
priorities don't quite match mine. Of course, the opinions usually  
differ. It is my opinion that there are some quite essential features  
which btrfs is, unfortunately, still missing.

So here is a list of features which I would rate as very important  
(for a modern COW filesystem like btrfs is), so perhaps you can think  
about it at least a little bit.

1) Full online backup (or copy, whatever you want to call it)
btrfs backup <filesystem name> <partition name> [-f]
- backups a btrfs filesystem given by <filesystem name> to a partition  
<partition name> (with all subvolumes).

- To be performed by creating a new btrfs filesystem in the  
destination partition <partition name>, with a new GUID.
- All data from the source filesystem <filesystem name> is than copied  
to the destination partition, similar to how RAID1 works.
- The size of the destination partition must be sufficient to hold the  
used data from the source filesystem, otherwise the operation fails.  
The point is that the destination doesn't have to be as large as  
source, just sufficient to hold the data (of course, many details and  
concerns are skipped in this short proposal)
- When the operation completes, the destination partition contains a  
fully featured, mountable and unmountable btrfs filesystem, which is  
an exact copy of the source filesystem at some point in time, with all  
the snapshots and subvolumes of the source filesystem.
- There are two possible implementations about how this operation is  
to be performed, depending on whether the destination drive is slower  
than source drive(s) or not (like, when the destination is HDD and the  
source is SDD). If the source and the destination are of similar  
speed, than a RAID1-alike algorithm can be used (all writes  
simultaneously go to the source and the destination). This mode can  
also be used if the user/admin is willing to tolerate a performance  
hit for some relatively short period of time.
The second possible implementation is a bit more complex, it can be  
done by creating a temporary snapshot or by buffering all the current  
writes until they can be written to the destination drive, but this  
implementation is of lesser priority (see if you can make the RAID1  
implementation work first).

2) Sensible defrag
The defrag is currently a joke. If you use defrag than you better not  
use subvolumes/snapshots. That's... very… hard to tolerate. Quite a  
necessary feature. I mean, defrag is an operation that should be  
performed in many circumstances, and in many cases it is even  
automatically initiated. But, btrfs defrag is virtually unusable. And,  
it is unusable where it is most needed, as the presence of subvolumes  
will, predictably, increase fragmentation by quite a lot.

How to do it:
- The extents must not be unshared, but just shuffled a bit. Unsharing  
the extents is, in most situations, not tolerable.

- The defrag should work by doing a full defrag of one 'selected  
subvolume' (which can be selected by user, or it can be guessed  
because the user probably wants to defrag the currently mounted  
subvolume, or default subvolume). The other subvolumes should than  
share data (shared extents) with the 'selected subvolume' (as much as  
possible).

- If you want it even more feature-full and complicated, then you  
could allow the user to specify a list of selected subvolumes, like:  
subvol1, subvol2, subvol3… etc. and the defrag algorithm than defrags  
subvol1 in full, than subvol2 as much as possible while not changing  
subvol1 and at the same time sharing extents with subvol1, than defrag  
subvol3 while not changing subvol1 and subvol2… etc.

- I think it would be wrong to use a general deduplication algorithm  
for this. Instead, the information about the shared extents should be  
analyzed given the starting state of the filesystem, and than the  
algorithm should produce an optimal solution based on the currently  
shared extents.

Deduplication is a different task.

3) Downgrade to 'single' or 'DUP' (also, general easy way to switch  
between RAID levels)

Currently, as much as I gather, user has to do a "btrfs balance start  
-dconvert=single -mconvert=single
", than delete a drive, which is a bit ridiculous sequence of operations.

Can you do something like "btrfs delete", but such that it also  
simultaneously converts to 'single', or some other chosen RAID level?

## I hope that you will consider my suggestions, I hope that I'm  
helpful (although, I guess, the short time I spent working with btrfs  
and writing this mail can not compare with the amount of work you are  
putting into it). Perhaps, teams sometimes need a different  
perspective, outsiders perspective, in order to better understand the  
situation.

So long!


^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2019-09-18 18:00 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-09  2:55 Feature requests: online backup - defrag - change RAID level zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).