Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: webmaster@zedlx.com
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Tue, 10 Sep 2019 09:48:47 +0800
Message-ID: <3978da3b-bb62-4995-bc46-785446d59265@gmx.com> (raw)
In-Reply-To: <20190909212434.Horde.S2TAotDdK47dqQU5ejS2402@server53.web-hosting.com>

[-- Attachment #1.1: Type: text/plain, Size: 7482 bytes --]



On 2019/9/10 上午9:24, webmaster@zedlx.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
> 
>>>> Btrfs defrag works by creating new extents containing the old data.
>>>>
>>>> So if btrfs decides to defrag, no old extents will be used.
>>>> It will all be new extents.
>>>>
>>>> That's why your proposal is freaking strange here.
>>>
>>> Ok, but: can the NEW extents still be shared?
>>
>> Can only be shared by reflink.
>> Not automatically, so if btrfs decides to defrag, it will not be shared
>> at all.
>>
>>> If you had an extent E88
>>> shared by 4 files in different subvolumes, can it be copied to another
>>> place and still be shared by the original 4 files?
>>
>> Not for current btrfs.
>>
>>> I guess that the
>>> answer is YES. And, that's the only requirement for a good defrag
>>> algorithm that doesn't shrink free space.
>>
>> We may go that direction.
>>
>> The biggest burden here is, btrfs needs to do expensive full-backref
>> walk to determine how many files are referring to this extent.
>> And then change them all to refer to the new extent.
> 
> YES! That! Exactly THAT. That is what needs to be done.
> 
> I mean, you just create an (perhaps associative) array which links an
> extent (the array index contains the extent ID) to all the files that
> reference that extent.

You're exactly in the pitfall of btrfs backref walk.

For btrfs, it's definitely not an easy work to do backref walk.
btrfs uses hidden backref, that means, under most case, one extent
shared by 1000 snapshots, in extent tree (shows the backref) it can
completely be possible to only have one ref, for the initial subvolume.

For btrfs, you need to walk up the tree to find how it's shared.

It has to be done like that, that's why we call it backref-*walk*.

E.g
          A (subvol 257)     B (Subvol 258, snapshot of 257)
          |    \        /    |
          |        X         |
          |    /        \    |
          C                  D
         / \                / \
        E   F              G   H

In extent tree, E is only referred by subvol 257.
While C has two referencers, 257 and 258.

So in reality, you need to:
1) Do a tree search from subvol 257
   You got a path, E -> C -> A
2) Check each node to see if it's shared.
   E is only referred by C, no extra referencer.
   C is refered by two new tree blocks, A and B.
   A is refered by subvol 257.
   B is refered by subvol 258.
   So E is shared by 257 and 258.

Now, you see how things would go mad, for each extent you must go that
way to determine the real owner of each extent, not to mention we can
have at most 8 levels, tree blocks at level 0~7 can all be shared.

If it's shared by 1000 subvolumes, hope you had a good day then.

> 
> To initialize it, you do one single walk through the entire b-tree.
> 
> Than the data you require can be retrieved in an instant.

In an instant, think again after reading above backref walk things.

> 
>> It's feasible if the extent is not shared by many.
>> E.g the extent only get shared by ~10 or ~50 subvolumes/files.
>>
>> But what will happen if it's shared by 1000 subvolumes? That would be a
>> performance burden.
>> And trust me, we have already experienced such disaster in qgroup,
>> that's why we want to avoid such case.
> 
> Um, I don't quite get where this 'performance burden' is comming from.

That's why I'd say you need to understand btrfs tech details.

> If you mean that moving a single extent requires rewriting a lot of
> b-trees, than perhaps it could be solved by moving extents in bigger
> batches. So, fo example, you move(create new) extents, but you do that
> for 100 megabytes of extents at the time, then you update the b-trees.
> So then, there would be much less b-tree writes to disk.
> 
> Also, if the defrag detects 1000 subvolumes, it can warn the user.
> 
> By the way, isn't the current recommendation to stay below 100
> subvolumes?. So if defrag can do 100 subvolumes, that is great. The
> defrag doesn't need to do 1000. If there are 1000 subvolumes, than the
> user should delete most of them before doing defrag.
> 
>> Another problem is, what if some of the subvolumes are read-only, should
>> we touch it or not? (I guess not)
> 
> I guess YES. Except if the user overrides it with some switch.
> 
>> Then the defrag will be not so complete. Bad fragmented extents are
>> still in RO subvols.
> 
> Let the user choose!
> 
>> So the devil is still in the detail, again and again.
> 
> Ok, let's flesh out some details.
> 
>>> I can't understand a defrag that substantially decreases free space. I
>>> mean, each such defrag is a lottery, because you might end up with
>>> practically unusable file system if the partition fills up.
>>>
>>> CURRENT DEFRAG IS A LOTTERY!
>>>
>>> How bad is that?
>>>
>>
>> Now you see why btrfs defrag has problem.
>>
>> On one hand, guys like you don't want to unshare extents. I understand
>> and it makes sense to some extents. And used to be the default behavior.
>>
>> On the other hand, btrfs has to CoW extents to do defrag, and we have
>> extreme cases where we want to defrag shared extents even it's going to
>> decrease free space.
>>
>> And I have to admit, my memory made the discussion a little off-topic,
>> as I still remember some older kernel doesn't touch shared extents at
>> all.
>>
>> So here what we could do is: (From easy to hard)
>> - Introduce an interface to allow defrag not to touch shared extents
>>   it shouldn't be that difficult compared to other work we are going
>>   to do.
>>   At least, user has their choice.
> 
> That defrag wouldn't acomplish much. You can call it defrag, but it is
> more like nothing happens.

If one subvolume is not shared by snapshots or reflinks at all, I'd say
that's exactly what user want.

> 
>> - Introduce different levels for defrag
>>   Allow btrfs to do some calculation and space usage policy to
>>   determine if it's a good idea to defrag some shared extents.
>>   E.g. my extreme case, unshare the extent would make it possible to
>>   defrag the other subvolume to free a huge amount of space.
>>   A compromise, let user to choose if they want to sacrifice some space.
> 
> Meh. You can always defrag one chosen subvolume perfectly, without
> unsharing any file extents.

If the subvolume is shared by another snapshot, you always need to face
the decision whether to unshare.
It's unavoidable.

It's only whether if it's worthy to unshare.

> So, since it can be done perfectly without unsharing, why unshare at all?

No, you can't.

Go check my initial "red-herring" case.

> 
>> - Ultimate super-duper cross subvolume defrag
>>   Defrag could also automatically change all the referencers.
>>   That's why we call it ultimate super duper, but as I already mentioned
>>   it's a big performance problem, and if Ro subvolume is involved, it'll
>>   go super tricky.
> 
> Yes, that is what's needed. I don't really see where the big problem is.
> I mean, it is just a defrag, like any other. Nothing special.
> The usual defrag algorithm is somewhat complicated, but I don't see why
> this one is much worse.
> 
> OK, if RO subvolumes are tricky, than exclude them for the time being.
> So later, after many years, maybe someone will add the code for this
> tricky RO case.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo [this message]
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3978da3b-bb62-4995-bc46-785446d59265@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=webmaster@zedlx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git