Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: webmaster@zedlx.com
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Tue, 10 Sep 2019 08:48:40 +0800
Message-ID: <3666d54b-76f7-9eee-4fb6-36c1dcc37fe9@gmx.com> (raw)
In-Reply-To: <20190909200638.Horde.GlzWP3_SqKPkxpAfp05Rsz7@server53.web-hosting.com>

[-- Attachment #1.1: Type: text/plain, Size: 8151 bytes --]



On 2019/9/10 上午8:06, webmaster@zedlx.com wrote:
> 
> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
> 
>> On 2019/9/10 上午12:38, webmaster@zedlx.com wrote:
>>>
>>> Quoting Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>>>> 2) Sensible defrag
>>>>>>> The defrag is currently a joke.
>>>>>
>>>>> Maybe there are such cases, but I would say that a vast majority of
>>>>> users (99,99%) in a vast majority of cases (99,99%) don't want the
>>>>> defrag operation to reduce free disk space.
>>>>>
>>>>>> What's wrong with current file based defrag?
>>>>>> If you want to defrag a subvolume, just iterate through all files.
>>>>>
>>>>> I repeat: The defrag should not decrease free space. That's the
>>>>> 'normal'
>>>>> expectation.
>>>>
>>>> Since you're talking about btrfs, it's going to do CoW for metadata not
>>>> matter whatever, as long as you're going to change anything, btrfs will
>>>> cause extra space usage.
>>>> (Although the final result may not cause extra used disk space as freed
>>>> space is as large as newly allocated space, but to maintain CoW, newly
>>>> allocated space can't overlap with old data)
>>>
>>> It is OK for defrag to temporarily decrease free space while defrag
>>> operation is in progress. That's normal.
>>>
>>>> Further more, talking about snapshots with space wasted by extent
>>>> booking, it's definitely possible user want to break the shared
>>>> extents:
>>>>
>>>> Subvol 257, inode 257 has the following file extents:
>>>> (257 EXTENT_DATA 0)
>>>> disk bytenr X len 16M
>>>> offset 0 num_bytes 4k  << Only 4k is referred in the whole 16M extent.
>>>>
>>>> Subvol 258, inode 257 has the following file extents:
>>>> (257 EXTENT_DATA 0)
>>>> disk bytenr X len 16M
>>>> offset 0 num_bytes 4K  << Shared with that one in subv 257
>>>> (257 EXTENT_DATA 4K)
>>>> disk bytenr Y len 16M
>>>> offset 0 num_bytes 4K  << Similar case, only 4K of 16M is used.
>>>>
>>>> In that case, user definitely want to defrag file in subvol 258, as if
>>>> that extent at bytenr Y can be freed, we can free up 16M, and
>>>> allocate a
>>>> new 8K extent for subvol 258, ino 257.
>>>> (And will also want to defrag the extent in subvol 257 ino 257 too)
>>>
>>> You are confusing the actual defrag with a separate concern, let's call
>>> it 'reserved space optimization'. It is about partially used extents.
>>> The actual name 'reserved space optimization' doesn't matter, I just
>>> made it up.
>>
>> Then when it's not snapshotted, it's plain defrag.
>>
>> How things go from "reserved space optimization" to "plain defrag" just
>> because snapshots?
> 
> I'm not sure that I'm still following you here.
> 
> I'm just saying that when you have some unused space within an extent
> and you want the defrag to free it up, that is OK, but such thing is not
> the main focus of the defrag operation. So you are giving me some edge
> case here which is half-relevant and it can be easily solved. The extent
> just needs to be split up into pieces, it's nothing special.
> 
>>> 'reserved space optimization' is usually performed as a part of the
>>> defrag operation, but it doesn't have to be, as the actual defrag is
>>> something separate.
>>>
>>> Yes, 'reserved space optimization' can break up extents.
>>>
>>> 'reserved space optimization' can either decrease or increase the free
>>> space. If the algorithm determines that more space should be reserved,
>>> than free space will decrease. If the algorithm determines that less
>>> space should be reserved, than free space will increase.
>>>
>>> The 'reserved space optimization' can be accomplished such that the free
>>> space does not decrease, if such behavior is needed.
>>>
>>> Also, the defrag operation can join some extents. In my original
>>> example,
>>> the extents e33 and e34 can be fused into one.
>>
>> Btrfs defrag works by creating new extents containing the old data.
>>
>> So if btrfs decides to defrag, no old extents will be used.
>> It will all be new extents.
>>
>> That's why your proposal is freaking strange here.
> 
> Ok, but: can the NEW extents still be shared?

Can only be shared by reflink.
Not automatically, so if btrfs decides to defrag, it will not be shared
at all.

> If you had an extent E88
> shared by 4 files in different subvolumes, can it be copied to another
> place and still be shared by the original 4 files?

Not for current btrfs.

> I guess that the
> answer is YES. And, that's the only requirement for a good defrag
> algorithm that doesn't shrink free space.

We may go that direction.

The biggest burden here is, btrfs needs to do expensive full-backref
walk to determine how many files are referring to this extent.
And then change them all to refer to the new extent.

It's feasible if the extent is not shared by many.
E.g the extent only get shared by ~10 or ~50 subvolumes/files.

But what will happen if it's shared by 1000 subvolumes? That would be a
performance burden.
And trust me, we have already experienced such disaster in qgroup,
that's why we want to avoid such case.

Another problem is, what if some of the subvolumes are read-only, should
we touch it or not? (I guess not)
Then the defrag will be not so complete. Bad fragmented extents are
still in RO subvols.

So the devil is still in the detail, again and again.

> 
> Perhaps the metadata extents need to be unshared. That is OK. But I
> guess that after a typical defrag, the sharing ratio in metadata woudn't
> change much.

Metadata (tree blocks) in btrfs is always get unshared as long as you
modified the tree.
But indeed, the ratio isn't that high.

> 
>>>> That's why knowledge in btrfs tech details can make a difference.
>>>> Sometimes you may find some ideas are brilliant and why btrfs is not
>>>> implementing it, but if you understand btrfs to some extent, you will
>>>> know the answer by yourself.
>>>
>>> Yes, it is true, but what you are posting so far are all 'red
>>> herring'-type arguments. It's just some irrelevant concerns, and you
>>> just got me explaining thinks like I would to a little baby. I don't
>>> know whether I stumbled on some rookie member of btrfs project, or you
>>> are just lazy and you don't want to think or consider my proposals.
>>
>> Go check my name in git log.
> 
> I didn't check yet. Ok, let's just try to communicate here, I'm dead
> serious.
> 
> I can't understand a defrag that substantially decreases free space. I
> mean, each such defrag is a lottery, because you might end up with
> practically unusable file system if the partition fills up.
> 
> CURRENT DEFRAG IS A LOTTERY!
> 
> How bad is that?
>

Now you see why btrfs defrag has problem.

On one hand, guys like you don't want to unshare extents. I understand
and it makes sense to some extents. And used to be the default behavior.

On the other hand, btrfs has to CoW extents to do defrag, and we have
extreme cases where we want to defrag shared extents even it's going to
decrease free space.

And I have to admit, my memory made the discussion a little off-topic,
as I still remember some older kernel doesn't touch shared extents at all.

So here what we could do is: (From easy to hard)
- Introduce an interface to allow defrag not to touch shared extents
  it shouldn't be that difficult compared to other work we are going
  to do.
  At least, user has their choice.

- Introduce different levels for defrag
  Allow btrfs to do some calculation and space usage policy to
  determine if it's a good idea to defrag some shared extents.
  E.g. my extreme case, unshare the extent would make it possible to
  defrag the other subvolume to free a huge amount of space.
  A compromise, let user to choose if they want to sacrifice some space.

- Ultimate super-duper cross subvolume defrag
  Defrag could also automatically change all the referencers.
  That's why we call it ultimate super duper, but as I already mentioned
  it's a big performance problem, and if Ro subvolume is involved, it'll
  go super tricky.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo [this message]
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3666d54b-76f7-9eee-4fb6-36c1dcc37fe9@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=webmaster@zedlx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox