Linux-BTRFS Archive on
 help / color / Atom feed
From: General Zed <>
To: "Austin S. Hemmelgarn" <>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Thu, 12 Sep 2019 18:21:59 -0400
Message-ID: <> (raw)
In-Reply-To: <>

Quoting "Austin S. Hemmelgarn" <>:

> On 2019-09-12 15:18, wrote:
>> Quoting "Austin S. Hemmelgarn" <>:
>>> On 2019-09-11 17:37, wrote:
>>>> Quoting "Austin S. Hemmelgarn" <>:
>>>>> On 2019-09-11 13:20, wrote:
>>>>>> Quoting "Austin S. Hemmelgarn" <>:
>>>>>>> On 2019-09-10 19:32, wrote:
>>>>>>>> Quoting "Austin S. Hemmelgarn" <>:
>>>>>>> Given this, defrag isn't willfully unsharing anything, it's  
>>>>>>> just a side-effect of how it works (since it's rewriting the  
>>>>>>> block layout of the file in-place).
>>>>>> The current defrag has to unshare because, as you said, because  
>>>>>> it is unaware of the full reflink structure. If it doesn't know  
>>>>>> about all reflinks, it has to unshare, there is no way around  
>>>>>> that.
>>>>>>> Now factor in that _any_ write will result in unsharing the  
>>>>>>> region being written to, rounded to the nearest full  
>>>>>>> filesystem block in both directions (this is mandatory, it's a  
>>>>>>> side effect of the copy-on-write nature of BTRFS, and is why  
>>>>>>> files that experience heavy internal rewrites get fragmented  
>>>>>>> very heavily and very quickly on BTRFS).
>>>>>> You mean: when defrag performs a write, the new data is  
>>>>>> unshared because every write is unshared? Really?
>>>>>> Consider there is an extent E55 shared by two files A and B.  
>>>>>> The defrag has to move E55 to another location. In order to do  
>>>>>> that, defrag creates a new extent E70. It makes it belong to  
>>>>>> file A by changing the reflink of extent E55 in file A to point  
>>>>>> to E70.
>>>>>> Now, to retain the original sharing structure, the defrag has  
>>>>>> to change the reflink of extent E55 in file B to point to E70.  
>>>>>> You are telling me this is not possible? Bullshit!
>>>>>> Please explain to me how this 'defrag has to unshare' story of  
>>>>>> yours isn't an intentional attempt to mislead me.
>>>>> As mentioned in the previous email, we actually did have a  
>>>>> (mostly) working reflink-aware defrag a few years back.  It got  
>>>>> removed because it had serious performance issues.  Note that  
>>>>> we're not talking a few seconds of extra time to defrag a full  
>>>>> tree here, we're talking double-digit _minutes_ of extra time to  
>>>>> defrag a moderate sized (low triple digit GB) subvolume with  
>>>>> dozens of snapshots, _if you were lucky_ (if you weren't, you  
>>>>> would be looking at potentially multiple _hours_ of runtime for  
>>>>> the defrag).  The performance scaled inversely proportionate to  
>>>>> the number of reflinks involved and the total amount of data in  
>>>>> the subvolume being defragmented, and was pretty bad even in the  
>>>>> case of only a couple of snapshots.
>>>> You cannot ever make the worst program, because an even worse  
>>>> program can be made by slowing down the original by a factor of 2.
>>>> So, you had a badly implemented defrag. At least you got some  
>>>> experience. Let's see what went wrong.
>>>>> Ultimately, there are a couple of issues at play here:
>>>>> * Online defrag has to maintain consistency during operation.   
>>>>> The current implementation does this by rewriting the regions  
>>>>> being defragmented (which causes them to become a single new  
>>>>> extent (most of the time)), which avoids a whole lot of  
>>>>> otherwise complicated logic required to make sure things happen  
>>>>> correctly, and also means that only the file being operated on  
>>>>> is impacted and only the parts being modified need to be  
>>>>> protected against concurrent writes.  Properly handling reflinks  
>>>>> means that _every_ file that shares some part of an extent with  
>>>>> the file being operated on needs to have the reflinked regions  
>>>>> locked for the defrag operation, which has a huge impact on  
>>>>> performance. Using your example, the update to E55 in both files  
>>>>> A and B has to happen as part of the same commit, which can  
>>>>> contain no other writes in that region of the file, otherwise  
>>>>> you run the risk of losing writes to file B that occur while  
>>>>> file A is being defragmented.
>>>> Nah. I think there is a workaround. You can first (atomically)  
>>>> update A, then whatever, then you can update B later. I know,  
>>>> your yelling "what if E55 gets updated in B". Doesn't matter. The  
>>>> defrag continues later by searching for reflink to E55 in B. Then  
>>>> it checks the data contained in E55. If the data matches the E70,  
>>>> then it can safely update the reflink in B. Or the defrag can  
>>>> just verify that neither E55 nor E70 have been written to in the  
>>>> meantime. That means they still have the same data.
>>> So, IOW, you don't care if the total space used by the data is  
>>> instantaneously larger than what you started with?  That seems to  
>>> be at odds with your previous statements, but OK, if we allow for  
>>> that then this is indeed a non-issue.
>> It is normal and common for defrag operation to use some disk space  
>> while it is running. I estimate that a reasonable limit would be to  
>> use up to 1% of total partition size. So, if a partition size is  
>> 100 GB, the defrag can use 1 GB. Lets call this "defrag operation  
>> space".
>> The defrag should, when started, verify that there is "sufficient  
>> free space" on the partition. In the case that there is no  
>> sufficient free space, the defrag should output the message to the  
>> user and abort. The size of "sufficient free space" must be larger  
>> than the "defrag operation space". I would estimate that a good  
>> limit would be 2% of the partition size. "defrag operation space"  
>> is a part of "sufficient free space" while defrag operation is in  
>> progress.
>> If, during defrag operation, sufficient free space drops below 2%,  
>> the defrag should output a message and abort. Another possibility  
>> is for defrag to pause until the user frees some disk space, but  
>> this is not common in other defrag implementations AFAIK.
>>>>> It's not horrible when it's just a small region in two files,  
>>>>> but it becomes a big issue when dealing with lots of files  
>>>>> and/or particularly large extents (extents in BTRFS can get into  
>>>>> the GB range in terms of size when dealing with really big files).
>>>> You must just split large extents in a smart way. So, in the  
>>>> beginning, the defrag can split large extents (2GB) into smaller  
>>>> ones (32MB) to facilitate more responsive and easier defrag.
>>>> If you have lots of files, update them one-by one. It is  
>>>> possible. Or you can update in big batches. Whatever is faster.
>>> Neither will solve this though.  Large numbers of files are an  
>>> issue because the operation is expensive and has to be done on  
>>> each file, not because the number of files somehow makes the  
>>> operation more espensive. It's O(n) relative to files, not higher  
>>> time complexity.
>> I would say that updating in big batches helps a lot, to the point  
>> that it gets almost as fast as defragging any other file system.  
>> What defrag needs to do is to write a big bunch of defragged file  
>> (data) extents to the disk, and then update the b-trees. What  
>> happens is that many of the updates to the b-trees would fall into  
>> the same disk sector/extent, so instead of many writes there will  
>> be just one write.
>> Here is the general outline for implementation:
>>     - write a big bunch of defragged file extents to disk
>>         - a minimal set of updates of the b-trees that cannot be  
>> delayed is performed (this is nothing or almost nothing in most  
>> circumstances)
>>         - put the rest of required updates of b-trees into "pending  
>> operations buffer"
>>     - analyze the "pending operations buffer", and find out  
>> (approximately) the biggest part of it that can be flushed out by  
>> doing minimal number of disk writes
>>         - flush out that part of "pending operations buffer"
>>     - repeat

> It helps, but you still can't get around having to recompute the new  
> tree state, and that is going to take time proportionate to the  
> number of nodes that need to change, which in turn is proportionate  
> to the number of files.

Yes, but that is just a computation. The defrag performance mostly  
depends on minimizing disk I/O operations, not on computations.

In the past many good and fast defrag computation algorithms have been  
produced, and I don't see any reason why this project wouldn't be also  
able to create such a good algorithm.

>>>> The point is that the defrag can keep a buffer of a "pending  
>>>> operations". Pending operations are those that should be  
>>>> performed in order to keep the original sharing structure. If the  
>>>> defrag gets interrupted, then files in "pending operations" will  
>>>> be unshared. But this should really be some important and urgent  
>>>> interrupt, as the "pending operations" buffer needs at most a  
>>>> second or two to complete its operations.
>>> Depending on the exact situation, it can take well more than a few  
>>> seconds to complete stuff. Especially if there are lots of reflinks.
>> Nope. You are quite wrong there.
>> In the worst case, the "pending operations buffer" will update  
>> (write to disk) all the b-trees. So, the upper limit on time to  
>> flush the "pending operations buffer" equals the time to write the  
>> entire b-tree structure to the disk (into new extents). I estimate  
>> that takes at most a few seconds.

> So what you're talking about is journaling the computed state of  
> defrag operations.  That shouldn't be too bad (as long as it's done  
> in memory instead of on-disk) if you batch the computations  
> properly.  I thought you meant having a buffer of what operations to  
> do, and then computing them on-the-fly (which would have significant  
> overhead)

Looks close to what I was thinking. Soon we might be able to  
communicate. I'm not sure what you mean by "journaling the computed  
state of defrag operations". Maybe it doesn't matter.

What happens is that file (extent) data is first written to disk  
(defragmented), but b-tree is not immediately updated. It doesn't have  
to be. Even if there is a power loss, nothing happens.

So, the changes that should be done to the b-trees are put into  
pending-operations-buffer. When a lot of file (extent) data is written  
to disk, such that defrag-operation-space (1 GB) is close to being  
exhausted, the pending-operations-buffer is examined in order to  
attempt to free as much of defrag-operation-space as possible. The  
simplest algorithm is to flush the entire pending-operations-buffer at  
once. This reduces the number of writes that update the b-trees  
because many changes to the b-trees fall into the same or neighbouring  
disk sectors.

>>>>> * Reflinks can reference partial extents.  This means,  
>>>>> ultimately, that you may end up having to split extents in odd  
>>>>> ways during defrag if you want to preserve reflinks, and might  
>>>>> have to split extents _elsewhere_ that are only tangentially  
>>>>> related to the region being defragmented. See the example in my  
>>>>> previous email for a case like this, maintaining the shared  
>>>>> regions as being shared when you defragment either file to a  
>>>>> single extent will require splitting extents in the other file  
>>>>> (in either case, whichever file you don't defragment to a single  
>>>>> extent will end up having 7 extents if you try to force the one  
>>>>> that's been defragmented to be the canonical version).  Once you  
>>>>> consider that a given extent can have multiple ranges reflinked  
>>>>> from multiple other locations, it gets even more complicated.
>>>> I think that this problem can be solved, and that it can be  
>>>> solved perfectly (the result is a perfectly-defragmented file).  
>>>> But, if it is so hard to do, just skip those problematic extents  
>>>> in initial version of defrag.
>>>> Ultimately, in the super-duper defrag, those partially-referenced  
>>>> extents should be split up by defrag.
>>>>> * If you choose to just not handle the above point by not  
>>>>> letting defrag split extents, you put a hard lower limit on the  
>>>>> amount of fragmentation present in a file if you want to  
>>>>> preserve reflinks.  IOW, you can't defragment files past a  
>>>>> certain point.  If we go this way, neither of the two files in  
>>>>> the example from my previous email could be defragmented any  
>>>>> further than they already are, because doing so would require  
>>>>> splitting extents.
>>>> Oh, you're reading my thoughts. That's good.
>>>> Initial implementation of defrag might be not-so-perfect. It  
>>>> would still be better than the current defrag.
>>>> This is not a one-way street. Handling of partially-used extents  
>>>> can be improved in later versions.
>>>>> * Determining all the reflinks to a given region of a given  
>>>>> extent is not a cheap operation, and the information may  
>>>>> immediately be stale (because an operation right after you fetch  
>>>>> the info might change things).  We could work around this by  
>>>>> locking the extent somehow, but doing so would be expensive  
>>>>> because you would have to hold the lock for the entire defrag  
>>>>> operation.
>>>> Instead, you have to create a hook in every function that updates  
>>>> the reflink structure or extents (for exaple, write-to-file  
>>>> operation). So, when a reflink gets changed, the defrag is  
>>>> immediately notified about this. That way the defrag can keep its  
>>>> data about reflinks in-sync with the filesystem.
>>> This doesn't get around the fact that it's still an expensive  
>>> operation to enumerate all the reflinks for a given region of a  
>>> file or extent.
>> No, you are wrong.
>> In order to enumerate all the reflinks in a region, the defrag  
>> needs to have another array, which is also kept in memory and in  
>> sync with the filesystem. It is the easiest to divide the disk into  
>> regions of equal size, where each region is a few MB large. Lets  
>> call this array "regions-to-extents" array. This array doesn't need  
>> to be associative, it is a plain array.
>> This in-memory array links regions of disk to extents that are in  
>> the region. The array in initialized when defrag starts.
>> This array makes the operation of finding all extents of a region  
>> extremely fast.
> That has two issues:
> * That's going to be a _lot_ of memory.  You still need to be able  
> to defragment big (dozens plus TB) arrays without needing multiple  
> GB of RAM just for the defrag operation, otherwise it's not  
> realistically useful (remember, it was big arrays that had issues  
> with the old reflink-aware defrag too).

Ok, but let's get some calculations there. If regions are 4 MB in  
size, the region-extents array for an 8 TB partition would have 2  
million entries. If entries average 64 bytes, that would be:

  - a total of 128 MB memory for an 8 TB partition.

Of course, I'm guessing a lot of numbers there, but it should be doable.

> * You still have to populate the array in the first place.  A sane  
> implementation wouldn't be keeping it in memory even when defrag is  
> not running (no way is anybody going to tolerate even dozens of MB  
> of memory overhead for this), so you're not going to get around the  
> need to enumerate all the reflinks for a file at least once (during  
> startup, or when starting to process that file), so you're just  
> moving the overhead around instead of eliminating it.

Yes, when the defrag starts, the entire b-tree structure is examined  
in order for region-extents array and extents-backref associative  
array to be populated.

Of course, those two arrays exist only during defrag operation. When  
defrag completes, those arrays are deallocated.

>>> It also allows a very real possibility of a user functionally  
>>> delaying the defrag operation indefinitely (by triggering a  
>>> continuous stream of operations that would cause reflink changes  
>>> for a file being operated on by defrag) if not implemented very  
>>> carefully.
>> Yes, if a user does something like that, the defrag can be paused  
>> or even aborted. That is normal.
> Not really.  Most defrag implementations either avoid files that  
> could reasonably be written to, or freeze writes to the file they're  
> operating on, or in some other way just sidestep the issue without  
> delaying the defragmentation process.
>> There are many ways around this problem, but it really doesn't  
>> matter, those are just details. The initial version of defrag can  
>> just abort. The more mature versions of defrag can have a better  
>> handling of this problem.

> Details like this are the deciding factor for whether something is  
> sanely usable in certain use cases, as you have yourself found out  
> (for a lot of users, the fact that defrag can unshare extents is  
> 'just a detail' that's not worth worrying about).

I wouldn't agree there.

Not every issue is equal. Some issues are more important, some are  
trivial, some are tolerable etc...

The defrag is usually allowed to abort. It can easily be restarted  
later. Workaround: You can make a defrag-supervisor program, which  
starts a defrag, and if defrag aborts then it is restarted after some  
(configurable) amount of time.

On the other hand, unsharing is not easy to get undone.

So, those issues are not equals.

  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed [this message]
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on

Archives are clonable:
	git clone --mirror linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ \
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone