Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
	"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Fri, 13 Sep 2019 21:28:49 -0400
Message-ID: <20190913212849.Horde.PHJTyaXyvRA0Reaq2YtVdvS@server53.web-hosting.com> (raw)
In-Reply-To: <20190914005931.GI22121@hungrycats.org>


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
>>
>> Quoting General Zed <general-zed@zedlx.com>:
>>
>> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> >
>> > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > > >
>> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > > >
>> > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
>> > > > > >
>> > > > > > At worst, it just has to completely write-out "all
>> > > > > metadata", all the way up
>> > > > > > to the super. It needs to be done just once, because  
>> what's the point of
>> > > > > > writing it 10 times over? Then, the super is updated as
>> > > > > the final commit.
>> > > > >
>> > > > > This is kind of a silly discussion.  The biggest extent possible on
>> > > > > btrfs is 128MB, and the incremental gains of forcing 128MB  
>> extents to
>> > > > > be consecutive are negligible.  If you're defragging a 10GB  
>> file, you're
>> > > > > just going to end up doing 80 separate defrag operations.
>> > > >
>> > > > Ok, then the max extent is 128 MB, that's fine. Someone here
>> > > > previously said
>> > > > that it is 2 GB, so he has disinformed me (in order to  
>> further his false
>> > > > argument).
>> > >
>> > > If the 128MB limit is removed, you then hit the block group size limit,
>> > > which is some number of GB from 1 to 10 depending on number of disks
>> > > available and raid profile selection (the striping raid profiles cap
>> > > block group sizes at 10 disks, and single/raid1 profiles always use 1GB
>> > > block groups regardless of disk count).  So 2GB is _also_ a valid extent
>> > > size limit, just not the first limit that is relevant for defrag.
>> > >
>> > > A lot of people get confused by 'filefrag -v' output, which coalesces
>> > > physically adjacent but distinct extents.  So if you use that tool,
>> > > it can _seem_ like there is a 2.5GB extent in a file, but it is really
>> > > 20 distinct 128MB extents that start and end at adjacent addresses.
>> > > You can see the true structure in 'btrfs ins dump-tree' output.
>> > >
>> > > That also brings up another reason why 10GB defrags are absurd on btrfs:
>> > > extent addresses are virtual.  There's no guarantee that a pair  
>> of extents
>> > > that meet at a block group boundary are physically adjacent, and after
>> > > operations like RAID array reorganization or free space defragmentation,
>> > > they are typically quite far apart physically.
>> > >
>> > > > I didn't ever said that I would force extents larger than 128 MB.
>> > > >
>> > > > If you are defragging a 10 GB file, you'll likely have to do it
>> > > > in 10 steps,
>> > > > because the defrag is usually allowed to only use a limited  
>> amount of disk
>> > > > space while in operation. That has nothing to do with the extent size.
>> > >
>> > > Defrag is literally manipulating the extent size.  Fragments and extents
>> > > are the same thing in btrfs.
>> > >
>> > > Currently a 10GB defragment will work in 80 steps, but doesn't  
>> necessarily
>> > > commit metadata updates after each step, so more than 128MB of temporary
>> > > space may be used (especially if your disks are fast and empty,
>> > > and you start just after the end of the previous commit interval).
>> > > There are some opportunities to coalsce metadata updates, occupying up
>> > > to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
>> > > a flush, whichever comes first), but exploiting those opportunities
>> > > requires more space for uncommitted data.
>> > >
>> > > If the filesystem starts to get low on space during a defrag, it can
>> > > inject commits to force metadata updates to happen more often, which
>> > > reduces the amount of temporary space needed (we can't delete  
>> the original
>> > > fragmented extents until their replacement extent is  
>> committed); however,
>> > > if the filesystem is so low on space that you're worried about running
>> > > out during a defrag, then you probably don't have big enough contiguous
>> > > free areas to relocate data into anyway, i.e. the defrag is  
>> just going to
>> > > push data from one fragmented location to a different  
>> fragmented location,
>> > > or bail out with "sorry, can't defrag that."
>> >
>> > Nope.
>> >
>> > Each defrag "cycle" consists of two parts:
>> >      1) move-out part
>> >      2) move-in part
>> >
>> > The move-out part select one contiguous area of the disk. Almost any
>> > area will do, but some smart choices are better. It then moves-out all
>> > data from that contiguous area into whatever holes there are left empty
>> > on the disk. The biggest problem is actually updating the metadata,
>> > since the updates are not localized.
>> > Anyway, this part can even be skipped.
>> >
>> > The move-in part now populates the completely free contiguous area with
>> > defragmented data.
>> >
>> > In the case that the move-out part needs to be skipped because the
>> > defrag estimates that the update to metatada will be too big (like in
>> > the pathological case of a disk with 156 GB of metadata), it can
>> > sucessfully defrag by performing only the move-in part. In that case,
>> > the move-in area is not free of data and "defragmented" data won't be
>> > fully defragmented. Also, there should be at least 20% free disk space
>> > in this case in order to avoid defrag turning pathological.
>> >
>> > But, these are all some pathological cases. They should be considered in
>> > some other discussion.
>>
>> I know how to do this pathological case. Figured it out!
>>
>> Yeah, always ask General Zed, he knows the best!!!
>>
>> The move-in phase is not a problem, because this phase generally affects a
>> low number of files.
>>
>> So, let's consider the move-out phase. The main concern here is that the
>> move-out area may contain so many different files and fragments that the
>> move-out forces a practically undoable metadata update.
>>
>> So, the way to do it is to select files for move-out, one by one (or even
>> more granular, by fragments of files), while keeping track of the size of
>> the necessary metadata update. When the metadata update exceeds a certain
>> amount (let's say 128 MB, an amount that can easily fit into RAM), the
>> move-out is performed with only currently selected files (file fragments).
>> (The move-out often doesn't affect a whole file since only a part of each
>> file lies within the move-out area).
>
> This move-out phase sounds like a reinvention of btrfs balance.  Balance
> already does something similar, and python-btrfs gives you a script to
> target block groups with high free space fragmentation for balancing.
> It moves extents (and their references) away from their block group.
> You get GB-sized (or multi-GB-sized) contiguous free space areas into
> which you can then allocate big extents.

Perhaps btrfs balance needs to perform something similar, but I can  
assure you that a balance cannot replace the defrag.

The point and the purpose of "move out" is to create a clean  
contiguous free space area, so that defragmented files can be written  
into it.

>> Now the defrag has to decide: whether to continue with another round of the
>> move-out to get a cleaner move-in area (by repeating the same procedure
>> above), or should it continue with a move-in into a partialy dirty area. I
>> can't tell you what's better right now, as this can be determined only by
>> experiments.
>>
>> Lastly, the move-in phase is performed (can be done whether the move-in area
>> is dirty or completely clean). Again, the same trick can be used: files can
>> be selected one by one until the calculated metadata update exceeds 128 MB.
>> However, it is more likely that the size of move-in area will be exhausted
>> before this happens.
>>
>> This algorithm will work even if you have only 3% free disk space left.
>
> I was thinking more like "you have less than 1GB free on a 1TB filesystem
> and you want to defrag 128MB things", i.e. <0.1% free space.  If you don't
> have all the metadata block group free space you need allocated already
> by that point, you can run out of metadata space and the filesystem goes
> read-only.  Happens quite often to people.  They don't like it very much.

The defrag should abort whenever it detects such adverse conditions as  
0.1% free disk space. In fact, it should probably abort as soon as it  
detects less than 3% free disk space. This is normal and expected. If  
the user has a partition with less than 3% free disk space, he/she  
should not defrag it until he/she frees some space, perhaps by  
deleting unnecessary data or by moving out some data to other  
partitions.

This is not autodefrag. The defrag operation is an on-demand  
operation. It has certain requirements in order to function.

>> This algorithm will also work if you have metadata of huge size, but in that
>> case it is better to have much more free disk space (20%) to avoid
>> significantly slowing down the defrag operation.
>>
>>




  reply index

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed [this message]
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190913212849.Horde.PHJTyaXyvRA0Reaq2YtVdvS@server53.web-hosting.com \
    --to=general-zed@zedlx.com \
    --cc=ahferroin7@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox