On 2017年12月11日 19:40, Tomasz Pala wrote:
> On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote:
> 
>>> I could debug something before I'll clean this up, is there anything you
>>> want to me to check/know about the files?
>>
>> fiemap result along with btrfs dump-tree -t2 result.
> 
> fiemap attached, but dump-tree requires unmounted fs, doesn't it?

It doesn't.

You can dump your tree with fs mounted, although it may affect the accuracy.

The good news is, in your case, it doesn't really need extent tree, as
there is no shared extent here.

> 
>>> - I've lost 3.6 GB during the night with reasonably small
>>> amount of writes, I guess it might be possible to trash entire
>>> filesystem within 10 minutes if doing this on purpose.
>>
>> That's a little complex.
>> To get into such situation, snapshot must be used and one must know
>> which file extent is shared and how it's shared.
> 
> Hostile user might assume that any of his own files old enough were
> being snapshotted. Unless snapshots are not used at all...
> 
> The 'obvious' solution would be for quotas to limit the data size including
> extents lost due to fragmentation, but this is not the real solution as
> users don't care about fragmentation. So we're back to square one.
> 
>> But as I mentioned, XFS supports reflink, which means file extent can be
>> shared between several inodes.
>>
>> From the message I got from XFS guys, they free any unused space of a
>> file extent, so it should handle it quite well.
> 
> Forgive my ignorance, as I'm not familiar with details, but isn't the
> problem 'solvable' by reusing space freed from the same extent for any
> single (i.e. the same) inode?

Not that easy.

The extent tree design makes it a little tricky to do that.
So btrfs use the current extent booking, the laziest way to delete extent.

> This would certainly increase
> fragmentation of a file, but reduce extent usage significially.
> 
> 
> Still, I don't comprehend the cause of my situation. If - after doing a
> defrag (after snapshotting whatever there were already trashed) btrfs
> decides to allocate new extents for the file, why doesn't is use them
> efficiently as long as I'm not doing snapshots anymore?

Even without snapshot, things can easily go crazy.

This will write 128M file (max btrfs file extent size) and write it to disk.
# xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file

Then, overwrite the 1~128M range.
# xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file

Guess your real disk usage, it's 127M + 128M = 255M.

The point here, if there is any reference of a file extent, the whole
extent won't be freed, even it's only 1M of a 128M extent.

While defrag will basically read out the whole 128M file, and rewrite it.
Basically the same as:

# dd if=/mnt/btrfs/file of=/mnt/btrfs/file2
# rm /mnt/btrfs/file

In this case, it will cause a new 128M file extent, while old 128M+127M
extents lost all their reference so they are freed.
As a result, it frees 127M.



> I'm attaching the second fiemap, the same file from last snapshot taken.
> According to this one-liner:
> 
> for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done
> 
> current file doesn't share any physical locations with the old one.
> But still grows, so what does this situation have with snapshots anyway?

In your fiemap, all your file extent is exclusive, so not really related
to snapshot.

But the file is very fragmented.
Most of them is 4K sized, several 8K sized.
And the final extent is 220K sized.

Are you pre-allocating the file before write using tools like dd?
If so, just as I explained above, it will at least *DOUBLE* on-disk
space usage, and cause tons of fragment.

It's recommended to use fallocate to prealloc file instead of things
like dd.
(preallocated range acts must like nocow, although only for first write)

And if possible, use nocow for this file.


> 
> Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB
> occupied per extent. How is that possible?

Appending small write and frequently fsync or small random DIO.

Avoid such pattern or at least use nocow.
Also avoid using dd to preallocate file.

Another solution is autodefrag, but I doubt the effect.

Thanks,
Qu

>