Re: exclusive subvolume space missing

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Tomasz Pala <gotar@polanet.pl>, linux-btrfs@vger.kernel.org
Subject: Re: exclusive subvolume space missing
Date: Mon, 11 Dec 2017 08:24:16 +0800	[thread overview]
Message-ID: <599e2f5d-8e78-9b59-879c-6ba375510508@gmx.com> (raw)
In-Reply-To: <f9d281bb-e8a9-77c3-ab29-6fda9e5228ab@gmx.com>

[-- Attachment #1.1: Type: text/plain, Size: 5654 bytes --]

On 2017年12月11日 07:44, Qu Wenruo wrote:
> 
> 
> On 2017年12月10日 19:27, Tomasz Pala wrote:
>> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
>>
>>>> 1. is there any switch resulting in 'defrag only exclusive data'?
>>>
>>> IIRC, no.
>>
>> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
>> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
>> defrag. After defragging files were not snapshotted again and I've lost
>> 3.6 GB again, so I got this fully reproducible.
>> There are 7 files, one of which is 99% of the space (10 MB). None of
>> them has nocow set, so they're riding all-btrfs.
>>
>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.
> 
> Both output has nothing related to file name/dir name, but only some
> "meaningless" bytenr, so it should be completely OK to share them.
> 
>>
>> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
>> condition which could be triggered by malicious user during a few hours
>> or faster
> 
> You won't want to hear this:
> The biggest ratio in theory is, 128M / 4K = 32768.
> 
>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.
> 
> But yes, it's possible.
> 
> While on the other hand, XFS, which also supports reflink, handles it
> quite well, so I'm wondering if it's possible for btrfs to follow its
> behavior.
> 
>>
>>>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>>>    reclaiming space that was lost due to fragmentation, without breaking
>>>>    spanshoted CoW where it would be not only pointless, but actually harmful?
>>>
>>> What about using old kernel, like v4.13?
>>
>> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
>> will be pushing towards 4.14.
> 
> No, I really mean v4.13.

My fault, it is v3.13.

What a stupid error...

> 
> From btrfs(5):
> ---
>                Warning
>                Defragmenting with Linux kernel versions < 3.9 or ≥
> 3.14-rc2 as
>                well as with Linux stable kernel versions ≥ 3.10.31, ≥
> 3.12.12
>                or ≥ 3.13.4 will break up the ref-links of CoW data (for
>                example files copied with cp --reflink, snapshots or
>                de-duplicated data). This may cause considerable increase of
>                space usage depending on the broken up ref-links.
> ---
> 
>>
>>>> 4. How can I prevent this from happening again? All the files, that are
>>>>    written constantly (stats collector here, PostgreSQL database and
>>>>    logs on other machines), are marked with nocow (+C); maybe some new
>>>>    attribute to mark file as autodefrag? +t?
>>>
>>> Unfortunately, nocow only works if there is no other subvolume/inode
>>> referring to it.
>>
>> This shouldn't be my case anymore after defrag (==breaking links).
>> I guess no easy way to check refcounts of the blocks?
> 
> No easy way unfortunately.
> It's either time consuming (used by qgroup) or complex (manually tree
> search and do the backref walk by yourself)
> 
>>
>>> But in my understanding, btrfs is not suitable for such conflicting
>>> situation, where you want to have snapshots of frequent partial updates.
>>>
>>> IIRC, btrfs is better for use case where either update is less frequent,
>>> or update is replacing the whole file, not just part of it.
>>>
>>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
>>
>> That is something coherent with my conclusions after 2 years on btrfs,
>> however I didn't expect a single file to eat 1000 times more space than it
>> should...
>>
>>
>> I wonder how many other filesystems were trashed like this - I'm short
>> of ~10 GB on other system, many other users might be affected by that
>> (telling the Internet stories about btrfs running out of space).
> 
> Firstly, no other filesystem supports snapshot.
> So it's pretty hard to get a baseline.
> 
> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.
> 
> But it's quite a hard work to achieve in btrfs, needs years development
> at least.
> 
>>
>> It is not a problem that I need to defrag a file, the problem is I don't know:
>> 1. whether I need to defrag,
>> 2. *what* should I defrag
>> nor have a tool that would defrag smart - only the exclusive data or, in
>> general, the block that are worth defragging if space released from
>> extents is greater than space lost on inter-snapshot duplication.
>>
>> I can't just defrag entire filesystem since it breaks links with snapshots.
>> This change was a real deal-breaker here...
> 
> IIRC it's better to add a option to make defrag snapshot-aware.
> (Don't break snapshot sharing but only to defrag exclusive data)
> 
> Thanks,
> Qu
> 
>>
>> Any way to fed the deduplication code with snapshots maybe? There are
>> directories and files in the same layout, this could be fast-tracked to
>> check and deduplicate.
>>
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]