On May 27, 2020, at 4:56 AM, Reindl Harald <h.reindl@thelounge.net> wrote:
> Am 27.05.20 um 12:32 schrieb Lukas Czerner:
>> On Wed, May 27, 2020 at 12:11:52PM +0200, Reindl Harald wrote:
>>> 
>>> Am 27.05.20 um 11:57 schrieb Lukas Czerner:
>>>> On Wed, May 27, 2020 at 11:32:02AM +0200, Reindl Harald wrote:
>>>>> would that also fix the issue that *way too much* is trimmed all the
>>>>> time, no matter if it's a thin provisioned vmware disk or a phyiscal
>>>>> RAID10 with SSD
>>>> 
>>>> no, the mechanism remains the same, but the proposal is to make it
>>>> pesisten across re-mounts.
>>>> 
>>>>> 
>>>>> no way of 315 MB deletes within 2 hours or so on a system with just 485M
>>>>> used
>>>> 
>>>> The reason is that we're working on block group granularity. So if you
>>>> have almost free block group, and you free some blocks from it, the flag
>>>> gets freed and next time you run fstrim it'll trim all the free space in
>>>> the group. Then again if you free some blocks from the group, the flags
>>>> gets cleared again ...
>>>> 
>>>> But I don't think this is a problem at all. Certainly not worth tracking
>>>> free/trimmed extents to solve it.
>>> 
>>> it is a problem
>>> 
>>> on a daily "fstrim -av" you trim gigabytes of alredy trimmed blocks
>>> which for example on a vmware thin provisioned vdisk makes it down to
>>> CBT (changed-block-tracking)
>>> 
>>> so instead completly ignore that untouched space thanks to CBT it's
>>> considered as changed and verified in the follow up backup run which
>>> takes magnitutdes longer than needed
>> 
>> Looks like you identified the problem then ;)
> 
> well, in a perfect world.....
> 
>> But seriously, trim/discard was always considered advisory and the
>> storage is completely free to do whatever it wants to do with the
>> information. I might even be the case that the discard requests are
>> ignored and we might not even need optimization like this. But
>> regardless it does take time to go through the block gropus and as a
>> result this optimization is useful in the fs itself.
> 
> luckily at least fstrim is non-blocking in a vmware environment, on my
> physical box it takes ages
> 
> this machine *does nothing* than wait to be cloned, 235 MB pretended
> deleted data within 50 minutes is absurd on a completly idle guest
> 
> so even when i am all in for optimizations thatÄs way over top
> 
> [root@master:~]$ fstrim -av
> /boot: 0 B (0 bytes) trimmed on /dev/sda1
> /: 235.8 MiB (247201792 bytes) trimmed on /dev/sdb1
> 
> [root@master:~]$ df
> Filesystem     Type  Size  Used Avail Use% Mounted on
> /dev/sdb1      ext4  5.8G  502M  5.3G   9% /
> /dev/sda1      ext4  485M   39M  443M   9% /boot


I don't think that this patch will necessarily fix the problem you
are seeing, in the sense that WAS_TRIMMED was previously stored in
the group descriptor in memory, so repeated fstrim runs _shouldn't_
result in the group being trimmed unless it had some blocks freed.
If your image has even one block freed in any group, then fstrim will
result in all of the free blocks in that group being trimmed again.

That said, I think a follow-on optimization would be to limit *clearing*
the WAS_TRIMMED flag until at least some minimum number of blocks have
been freed (e.g. at least 1024 blocks freed, or the group is "empty", or
similar).  That would avoid excess TRIM calls down to the storage when
only a few blocks were freed that would be unlikely to actually result
in an erase blocks being freed.  This size could be dynamic based on
the minimum trim size passed by fstrim (e.g. min(1024 * minblocks, 16MB)).

That would also fit in nicely with changing "-o discard" over to using
per-block group tracking of the trim state, and use the same mechanism
as fstrim, rather than using the per-extent tracking that is used today
and causes (IMHO) too many small trims to the storage to be useful.
Not only do many small trim requests cause overhead during IO, but they
can also miss actual trims because the individual extent are smaller
than the discard size of the storage, and it doesn't combine the trim
range with adjacent free blocks.

Doing the block-group-based trim in the background, when a group has had
a bunch of blocks freed, and when the filesystem is idle (maybe a percpu
counter that is incremented whenever user read/write requests are done,
that can be checked for idleness when there are groups to be trimmed)
doesn't have to be perfect or totally crash proof, since it will always
have another chance to trim the group the next time some blocks are freed,
or if manual "fstrim" is called on the group with a smaller minlen.

Cheers, Andreas