From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Fri, 13 Sep 2019 13:02:36 -0400 [thread overview]
Message-ID: <20190913130236.Horde.J6Skdjml2LO57Kn1UxWdtaA@server53.web-hosting.com> (raw)
In-Reply-To: <20190913052520.Horde.TXpSDI4drVhkIzGxF7ZVMA8@server53.web-hosting.com>
Quoting General Zed <general-zed@zedlx.com>:
> Quoting General Zed <general-zed@zedlx.com>:
>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>>> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>>>
>>>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>>
>>>>> On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>>>>>>
>>>>>> At worst, it just has to completely write-out "all metadata",
>>>>>> all the way up
>>>>>> to the super. It needs to be done just once, because what's the point of
>>>>>> writing it 10 times over? Then, the super is updated as the
>>>>>> final commit.
>>>>>
>>>>> This is kind of a silly discussion. The biggest extent possible on
>>>>> btrfs is 128MB, and the incremental gains of forcing 128MB extents to
>>>>> be consecutive are negligible. If you're defragging a 10GB file, you're
>>>>> just going to end up doing 80 separate defrag operations.
>>>>
>>>> Ok, then the max extent is 128 MB, that's fine. Someone here
>>>> previously said
>>>> that it is 2 GB, so he has disinformed me (in order to further his false
>>>> argument).
>>>
>>> If the 128MB limit is removed, you then hit the block group size limit,
>>> which is some number of GB from 1 to 10 depending on number of disks
>>> available and raid profile selection (the striping raid profiles cap
>>> block group sizes at 10 disks, and single/raid1 profiles always use 1GB
>>> block groups regardless of disk count). So 2GB is _also_ a valid extent
>>> size limit, just not the first limit that is relevant for defrag.
>>>
>>> A lot of people get confused by 'filefrag -v' output, which coalesces
>>> physically adjacent but distinct extents. So if you use that tool,
>>> it can _seem_ like there is a 2.5GB extent in a file, but it is really
>>> 20 distinct 128MB extents that start and end at adjacent addresses.
>>> You can see the true structure in 'btrfs ins dump-tree' output.
>>>
>>> That also brings up another reason why 10GB defrags are absurd on btrfs:
>>> extent addresses are virtual. There's no guarantee that a pair of extents
>>> that meet at a block group boundary are physically adjacent, and after
>>> operations like RAID array reorganization or free space defragmentation,
>>> they are typically quite far apart physically.
>>>
>>>> I didn't ever said that I would force extents larger than 128 MB.
>>>>
>>>> If you are defragging a 10 GB file, you'll likely have to do it
>>>> in 10 steps,
>>>> because the defrag is usually allowed to only use a limited amount of disk
>>>> space while in operation. That has nothing to do with the extent size.
>>>
>>> Defrag is literally manipulating the extent size. Fragments and extents
>>> are the same thing in btrfs.
>>>
>>> Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
>>> commit metadata updates after each step, so more than 128MB of temporary
>>> space may be used (especially if your disks are fast and empty,
>>> and you start just after the end of the previous commit interval).
>>> There are some opportunities to coalsce metadata updates, occupying up
>>> to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
>>> a flush, whichever comes first), but exploiting those opportunities
>>> requires more space for uncommitted data.
>>>
>>> If the filesystem starts to get low on space during a defrag, it can
>>> inject commits to force metadata updates to happen more often, which
>>> reduces the amount of temporary space needed (we can't delete the original
>>> fragmented extents until their replacement extent is committed); however,
>>> if the filesystem is so low on space that you're worried about running
>>> out during a defrag, then you probably don't have big enough contiguous
>>> free areas to relocate data into anyway, i.e. the defrag is just going to
>>> push data from one fragmented location to a different fragmented location,
>>> or bail out with "sorry, can't defrag that."
>>
>> Nope.
>>
>> Each defrag "cycle" consists of two parts:
>> 1) move-out part
>> 2) move-in part
>>
>> The move-out part select one contiguous area of the disk. Almost
>> any area will do, but some smart choices are better. It then
>> moves-out all data from that contiguous area into whatever holes
>> there are left empty on the disk. The biggest problem is actually
>> updating the metadata, since the updates are not localized.
>> Anyway, this part can even be skipped.
>>
>> The move-in part now populates the completely free contiguous area
>> with defragmented data.
>>
>> In the case that the move-out part needs to be skipped because the
>> defrag estimates that the update to metatada will be too big (like
>> in the pathological case of a disk with 156 GB of metadata), it can
>> sucessfully defrag by performing only the move-in part. In that
>> case, the move-in area is not free of data and "defragmented" data
>> won't be fully defragmented. Also, there should be at least 20%
>> free disk space in this case in order to avoid defrag turning
>> pathological.
>>
>> But, these are all some pathological cases. They should be
>> considered in some other discussion.
>
> I know how to do this pathological case. Figured it out!
>
> Yeah, always ask General Zed, he knows the best!!!
>
> The move-in phase is not a problem, because this phase generally
> affects a low number of files.
>
> So, let's consider the move-out phase. The main concern here is that
> the move-out area may contain so many different files and fragments
> that the move-out forces a practically undoable metadata update.
>
> So, the way to do it is to select files for move-out, one by one (or
> even more granular, by fragments of files), while keeping track of
> the size of the necessary metadata update. When the metadata update
> exceeds a certain amount (let's say 128 MB, an amount that can
> easily fit into RAM), the move-out is performed with only currently
> selected files (file fragments). (The move-out often doesn't affect
> a whole file since only a part of each file lies within the move-out
> area).
>
> Now the defrag has to decide: whether to continue with another round
> of the move-out to get a cleaner move-in area (by repeating the same
> procedure above), or should it continue with a move-in into a
> partialy dirty area. I can't tell you what's better right now, as
> this can be determined only by experiments.
>
> Lastly, the move-in phase is performed (can be done whether the
> move-in area is dirty or completely clean). Again, the same trick
> can be used: files can be selected one by one until the calculated
> metadata update exceeds 128 MB. However, it is more likely that the
> size of move-in area will be exhausted before this happens.
>
> This algorithm will work even if you have only 3% free disk space left.
>
> This algorithm will also work if you have metadata of huge size, but
> in that case it is better to have much more free disk space (20%) to
> avoid significantly slowing down the defrag operation.
I have just thought out an even better algorithm than this which gets
to fully-defragged state faster, in a smaller number of disk writes.
But I won't write it down unless someone says thanks for your effort
so far, General Zed, and can you please tell us about your great new
defrag algorithm for low free-space conditions.
next prev parent reply other threads:[~2019-09-13 17:02 UTC|newest]
Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-09 2:55 Feature requests: online backup - defrag - change RAID level zedlryqc
2019-09-09 3:51 ` Qu Wenruo
2019-09-09 11:25 ` zedlryqc
2019-09-09 12:18 ` Qu Wenruo
2019-09-09 12:28 ` Qu Wenruo
2019-09-09 17:11 ` webmaster
2019-09-10 17:39 ` Andrei Borzenkov
2019-09-10 22:41 ` webmaster
2019-09-09 15:29 ` Graham Cobb
2019-09-09 17:24 ` Remi Gauvin
2019-09-09 19:26 ` webmaster
2019-09-10 19:22 ` Austin S. Hemmelgarn
2019-09-10 23:32 ` webmaster
2019-09-11 12:02 ` Austin S. Hemmelgarn
2019-09-11 16:26 ` Zygo Blaxell
2019-09-11 17:20 ` webmaster
2019-09-11 18:19 ` Austin S. Hemmelgarn
2019-09-11 20:01 ` webmaster
2019-09-11 21:42 ` Zygo Blaxell
2019-09-13 1:33 ` General Zed
2019-09-11 21:37 ` webmaster
2019-09-12 11:31 ` Austin S. Hemmelgarn
2019-09-12 19:18 ` webmaster
2019-09-12 19:44 ` Chris Murphy
2019-09-12 21:34 ` General Zed
2019-09-12 22:28 ` Chris Murphy
2019-09-12 22:57 ` General Zed
2019-09-12 23:54 ` Zygo Blaxell
2019-09-13 0:26 ` General Zed
2019-09-13 3:12 ` Zygo Blaxell
2019-09-13 5:05 ` General Zed
2019-09-14 0:56 ` Zygo Blaxell
2019-09-14 1:50 ` General Zed
2019-09-14 4:42 ` Zygo Blaxell
2019-09-14 4:53 ` Zygo Blaxell
2019-09-15 17:54 ` General Zed
2019-09-16 22:51 ` Zygo Blaxell
2019-09-17 1:03 ` General Zed
2019-09-17 1:34 ` General Zed
2019-09-17 1:44 ` Chris Murphy
2019-09-17 4:55 ` Zygo Blaxell
2019-09-17 4:19 ` Zygo Blaxell
2019-09-17 3:10 ` General Zed
2019-09-17 4:05 ` General Zed
2019-09-14 1:56 ` General Zed
2019-09-13 5:22 ` General Zed
2019-09-13 6:16 ` General Zed
2019-09-13 6:58 ` General Zed
2019-09-13 9:25 ` General Zed
2019-09-13 17:02 ` General Zed [this message]
2019-09-14 0:59 ` Zygo Blaxell
2019-09-14 1:28 ` General Zed
2019-09-14 4:28 ` Zygo Blaxell
2019-09-15 18:05 ` General Zed
2019-09-16 23:05 ` Zygo Blaxell
2019-09-13 7:51 ` General Zed
2019-09-13 11:04 ` Austin S. Hemmelgarn
2019-09-13 20:43 ` Zygo Blaxell
2019-09-14 0:20 ` General Zed
2019-09-14 18:29 ` Chris Murphy
2019-09-14 23:39 ` Zygo Blaxell
2019-09-13 11:09 ` Austin S. Hemmelgarn
2019-09-13 17:20 ` General Zed
2019-09-13 18:20 ` General Zed
2019-09-12 19:54 ` Austin S. Hemmelgarn
2019-09-12 22:21 ` General Zed
2019-09-13 11:53 ` Austin S. Hemmelgarn
2019-09-13 16:54 ` General Zed
2019-09-13 18:29 ` Austin S. Hemmelgarn
2019-09-13 19:40 ` General Zed
2019-09-14 15:10 ` Jukka Larja
2019-09-12 22:47 ` General Zed
2019-09-11 21:37 ` Zygo Blaxell
2019-09-11 23:21 ` webmaster
2019-09-12 0:10 ` Remi Gauvin
2019-09-12 3:05 ` webmaster
2019-09-12 3:30 ` Remi Gauvin
2019-09-12 3:33 ` Remi Gauvin
2019-09-12 5:19 ` Zygo Blaxell
2019-09-12 21:23 ` General Zed
2019-09-14 4:12 ` Zygo Blaxell
2019-09-16 11:42 ` General Zed
2019-09-17 0:49 ` Zygo Blaxell
2019-09-17 2:30 ` General Zed
2019-09-17 5:30 ` Zygo Blaxell
2019-09-17 10:07 ` General Zed
2019-09-17 23:40 ` Zygo Blaxell
2019-09-18 4:37 ` General Zed
2019-09-18 18:00 ` Zygo Blaxell
2019-09-10 23:58 ` webmaster
2019-09-09 23:24 ` Qu Wenruo
2019-09-09 23:25 ` webmaster
2019-09-09 16:38 ` webmaster
2019-09-09 23:44 ` Qu Wenruo
2019-09-10 0:00 ` Chris Murphy
2019-09-10 0:51 ` Qu Wenruo
2019-09-10 0:06 ` webmaster
2019-09-10 0:48 ` Qu Wenruo
2019-09-10 1:24 ` webmaster
2019-09-10 1:48 ` Qu Wenruo
2019-09-10 3:32 ` webmaster
2019-09-10 14:14 ` Nikolay Borisov
2019-09-10 22:35 ` webmaster
2019-09-11 6:40 ` Nikolay Borisov
2019-09-10 22:48 ` webmaster
2019-09-10 23:14 ` webmaster
2019-09-11 0:26 ` webmaster
2019-09-11 0:36 ` webmaster
2019-09-11 1:00 ` webmaster
2019-09-10 11:12 ` Austin S. Hemmelgarn
2019-09-09 3:12 webmaster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190913130236.Horde.J6Skdjml2LO57Kn1UxWdtaA@server53.web-hosting.com \
--to=general-zed@zedlx.com \
--cc=ahferroin7@gmail.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).