Linux-BTRFS Archive on
 help / color / Atom feed
To: "Austin S. Hemmelgarn" <>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Thu, 12 Sep 2019 15:18:41 -0400
Message-ID: <> (raw)
In-Reply-To: <>

Quoting "Austin S. Hemmelgarn" <>:

> On 2019-09-11 17:37, wrote:
>> Quoting "Austin S. Hemmelgarn" <>:
>>> On 2019-09-11 13:20, wrote:
>>>> Quoting "Austin S. Hemmelgarn" <>:
>>>>> On 2019-09-10 19:32, wrote:
>>>>>> Quoting "Austin S. Hemmelgarn" <>:
>>>>> Given this, defrag isn't willfully unsharing anything, it's just  
>>>>> a side-effect of how it works (since it's rewriting the block  
>>>>> layout of the file in-place).
>>>> The current defrag has to unshare because, as you said, because  
>>>> it is unaware of the full reflink structure. If it doesn't know  
>>>> about all reflinks, it has to unshare, there is no way around that.
>>>>> Now factor in that _any_ write will result in unsharing the  
>>>>> region being written to, rounded to the nearest full filesystem  
>>>>> block in both directions (this is mandatory, it's a side effect  
>>>>> of the copy-on-write nature of BTRFS, and is why files that  
>>>>> experience heavy internal rewrites get fragmented very heavily  
>>>>> and very quickly on BTRFS).
>>>> You mean: when defrag performs a write, the new data is unshared  
>>>> because every write is unshared? Really?
>>>> Consider there is an extent E55 shared by two files A and B. The  
>>>> defrag has to move E55 to another location. In order to do that,  
>>>> defrag creates a new extent E70. It makes it belong to file A by  
>>>> changing the reflink of extent E55 in file A to point to E70.
>>>> Now, to retain the original sharing structure, the defrag has to  
>>>> change the reflink of extent E55 in file B to point to E70. You  
>>>> are telling me this is not possible? Bullshit!
>>>> Please explain to me how this 'defrag has to unshare' story of  
>>>> yours isn't an intentional attempt to mislead me.
>>> As mentioned in the previous email, we actually did have a  
>>> (mostly) working reflink-aware defrag a few years back.  It got  
>>> removed because it had serious performance issues.  Note that  
>>> we're not talking a few seconds of extra time to defrag a full  
>>> tree here, we're talking double-digit _minutes_ of extra time to  
>>> defrag a moderate sized (low triple digit GB) subvolume with  
>>> dozens of snapshots, _if you were lucky_ (if you weren't, you  
>>> would be looking at potentially multiple _hours_ of runtime for  
>>> the defrag).  The performance scaled inversely proportionate to  
>>> the number of reflinks involved and the total amount of data in  
>>> the subvolume being defragmented, and was pretty bad even in the  
>>> case of only a couple of snapshots.
>> You cannot ever make the worst program, because an even worse  
>> program can be made by slowing down the original by a factor of 2.
>> So, you had a badly implemented defrag. At least you got some  
>> experience. Let's see what went wrong.
>>> Ultimately, there are a couple of issues at play here:
>>> * Online defrag has to maintain consistency during operation.  The  
>>> current implementation does this by rewriting the regions being  
>>> defragmented (which causes them to become a single new extent  
>>> (most of the time)), which avoids a whole lot of otherwise  
>>> complicated logic required to make sure things happen correctly,  
>>> and also means that only the file being operated on is impacted  
>>> and only the parts being modified need to be protected against  
>>> concurrent writes.  Properly handling reflinks means that _every_  
>>> file that shares some part of an extent with the file being  
>>> operated on needs to have the reflinked regions locked for the  
>>> defrag operation, which has a huge impact on performance. Using  
>>> your example, the update to E55 in both files A and B has to  
>>> happen as part of the same commit, which can contain no other  
>>> writes in that region of the file, otherwise you run the risk of  
>>> losing writes to file B that occur while file A is being  
>>> defragmented.
>> Nah. I think there is a workaround. You can first (atomically)  
>> update A, then whatever, then you can update B later. I know, your  
>> yelling "what if E55 gets updated in B". Doesn't matter. The defrag  
>> continues later by searching for reflink to E55 in B. Then it  
>> checks the data contained in E55. If the data matches the E70, then  
>> it can safely update the reflink in B. Or the defrag can just  
>> verify that neither E55 nor E70 have been written to in the  
>> meantime. That means they still have the same data.

> So, IOW, you don't care if the total space used by the data is  
> instantaneously larger than what you started with?  That seems to be  
> at odds with your previous statements, but OK, if we allow for that  
> then this is indeed a non-issue.

It is normal and common for defrag operation to use some disk space  
while it is running. I estimate that a reasonable limit would be to  
use up to 1% of total partition size. So, if a partition size is 100  
GB, the defrag can use 1 GB. Lets call this "defrag operation space".

The defrag should, when started, verify that there is "sufficient free  
space" on the partition. In the case that there is no sufficient free  
space, the defrag should output the message to the user and abort. The  
size of "sufficient free space" must be larger than the "defrag  
operation space". I would estimate that a good limit would be 2% of  
the partition size. "defrag operation space" is a part of "sufficient  
free space" while defrag operation is in progress.

If, during defrag operation, sufficient free space drops below 2%, the  
defrag should output a message and abort. Another possibility is for  
defrag to pause until the user frees some disk space, but this is not  
common in other defrag implementations AFAIK.

>>> It's not horrible when it's just a small region in two files, but  
>>> it becomes a big issue when dealing with lots of files and/or  
>>> particularly large extents (extents in BTRFS can get into the GB  
>>> range in terms of size when dealing with really big files).
>> You must just split large extents in a smart way. So, in the  
>> beginning, the defrag can split large extents (2GB) into smaller  
>> ones (32MB) to facilitate more responsive and easier defrag.
>> If you have lots of files, update them one-by one. It is possible.  
>> Or you can update in big batches. Whatever is faster.

> Neither will solve this though.  Large numbers of files are an issue  
> because the operation is expensive and has to be done on each file,  
> not because the number of files somehow makes the operation more  
> espensive. It's O(n) relative to files, not higher time complexity.

I would say that updating in big batches helps a lot, to the point  
that it gets almost as fast as defragging any other file system. What  
defrag needs to do is to write a big bunch of defragged file (data)  
extents to the disk, and then update the b-trees. What happens is that  
many of the updates to the b-trees would fall into the same disk  
sector/extent, so instead of many writes there will be just one write.

Here is the general outline for implementation:
     - write a big bunch of defragged file extents to disk
         - a minimal set of updates of the b-trees that cannot be  
delayed is performed (this is nothing or almost nothing in most  
         - put the rest of required updates of b-trees into "pending  
operations buffer"
     - analyze the "pending operations buffer", and find out  
(approximately) the biggest part of it that can be flushed out by  
doing minimal number of disk writes
         - flush out that part of "pending operations buffer"
     - repeat

>> The point is that the defrag can keep a buffer of a "pending  
>> operations". Pending operations are those that should be performed  
>> in order to keep the original sharing structure. If the defrag gets  
>> interrupted, then files in "pending operations" will be unshared.  
>> But this should really be some important and urgent interrupt, as  
>> the "pending operations" buffer needs at most a second or two to  
>> complete its operations.

> Depending on the exact situation, it can take well more than a few  
> seconds to complete stuff. Especially if there are lots of reflinks.

Nope. You are quite wrong there.
In the worst case, the "pending operations buffer" will update (write  
to disk) all the b-trees. So, the upper limit on time to flush the  
"pending operations buffer" equals the time to write the entire b-tree  
structure to the disk (into new extents). I estimate that takes at  
most a few seconds.

>>> * Reflinks can reference partial extents.  This means, ultimately,  
>>> that you may end up having to split extents in odd ways during  
>>> defrag if you want to preserve reflinks, and might have to split  
>>> extents _elsewhere_ that are only tangentially related to the  
>>> region being defragmented. See the example in my previous email  
>>> for a case like this, maintaining the shared regions as being  
>>> shared when you defragment either file to a single extent will  
>>> require splitting extents in the other file (in either case,  
>>> whichever file you don't defragment to a single extent will end up  
>>> having 7 extents if you try to force the one that's been  
>>> defragmented to be the canonical version).  Once you consider that  
>>> a given extent can have multiple ranges reflinked from multiple  
>>> other locations, it gets even more complicated.
>> I think that this problem can be solved, and that it can be solved  
>> perfectly (the result is a perfectly-defragmented file). But, if it  
>> is so hard to do, just skip those problematic extents in initial  
>> version of defrag.
>> Ultimately, in the super-duper defrag, those partially-referenced  
>> extents should be split up by defrag.
>>> * If you choose to just not handle the above point by not letting  
>>> defrag split extents, you put a hard lower limit on the amount of  
>>> fragmentation present in a file if you want to preserve reflinks.   
>>> IOW, you can't defragment files past a certain point.  If we go  
>>> this way, neither of the two files in the example from my previous  
>>> email could be defragmented any further than they already are,  
>>> because doing so would require splitting extents.
>> Oh, you're reading my thoughts. That's good.
>> Initial implementation of defrag might be not-so-perfect. It would  
>> still be better than the current defrag.
>> This is not a one-way street. Handling of partially-used extents  
>> can be improved in later versions.
>>> * Determining all the reflinks to a given region of a given extent  
>>> is not a cheap operation, and the information may immediately be  
>>> stale (because an operation right after you fetch the info might  
>>> change things).  We could work around this by locking the extent  
>>> somehow, but doing so would be expensive because you would have to  
>>> hold the lock for the entire defrag operation.
>> Instead, you have to create a hook in every function that updates  
>> the reflink structure or extents (for exaple, write-to-file  
>> operation). So, when a reflink gets changed, the defrag is  
>> immediately notified about this. That way the defrag can keep its  
>> data about reflinks in-sync with the filesystem.

> This doesn't get around the fact that it's still an expensive  
> operation to enumerate all the reflinks for a given region of a file  
> or extent.

No, you are wrong.

In order to enumerate all the reflinks in a region, the defrag needs  
to have another array, which is also kept in memory and in sync with  
the filesystem. It is the easiest to divide the disk into regions of  
equal size, where each region is a few MB large. Lets call this array  
"regions-to-extents" array. This array doesn't need to be associative,  
it is a plain array.
This in-memory array links regions of disk to extents that are in the  
region. The array in initialized when defrag starts.

This array makes the operation of finding all extents of a region  
extremely fast.

> It also allows a very real possibility of a user functionally  
> delaying the defrag operation indefinitely (by triggering a  
> continuous stream of operations that would cause reflink changes for  
> a file being operated on by defrag) if not implemented very carefully.

Yes, if a user does something like that, the defrag can be paused or  
even aborted. That is normal.

There are many ways around this problem, but it really doesn't matter,  
those are just details. The initial version of defrag can just abort.  
The more mature versions of defrag can have a better handling of this  

  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster [this message]
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on

Archives are clonable:
	git clone --mirror linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ \
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone