On Mon, Aug 30, 2021 at 08:04:57AM +1000, Dave Chinner wrote:
> On Sat, Aug 28, 2021 at 01:58:24PM +1000, Chris Dunlop wrote:
>> On Sat, Aug 28, 2021 at 10:21:37AM +1000, Chris Dunlop wrote:
>>> On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
>>>> commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
>>>> Author: Brian Foster <bfoster@redhat.com>
>>>> Date:   Wed Apr 28 15:06:05 2021 -0700
>>>>
>>>>   xfs: set aside allocation btree blocks from block reservation
>>>
>>> Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of
>>> seeing if this fs is likely to be suffering from this particular issue
>>> or is it a matter of installing an appropriate kernel and seeing if the
>>> problem goes away?
>>
>> Is this sufficient to tell us that this filesystem probably isn't suffering
>> from that issue?
>
> IIRC, it's the per-ag histograms that are more important here
> because we are running out of space in an AG because of
> overcommitting the per-ag space. If there is an AG that is much more
> fragmented than others, then it will be consuming much more in way
> of freespace btree blocks than others...

Per-ag histograms attached.

Do the blocks used by the allocation btrees show up in the AG 
histograms? E.g. with an AG like this:

AG 18
    from      to extents  blocks    pct
       1       1    1961    1961   0.01
       2       3   17129   42602   0.11
       4       7   33374  183312   0.48
       8      15   68076  783020   2.06
      16      31  146868 3469398   9.14
      32      63  248690 10614558  27.96
      64     127   32088 2798748   7.37
     128     255    8654 1492521   3.93
     256     511    4227 1431586   3.77
     512    1023    2531 1824377   4.81
    1024    2047    2125 3076304   8.10
    2048    4095    1615 4691302  12.36
    4096    8191    1070 6062351  15.97
    8192   16383     139 1454627   3.83
   16384   32767       2   41359   0.11
total free extents 568549
total free blocks 37968026
average free extent size 66.7806

...it looks like it's significantly fragmented, but, if the allocation 
btrees aren't part of this, it seems there's still sufficient free 
space that it shouldn't be getting to ENOSPC?

> FWIW, if you are using reflink heavily and you have rmap enabled (as
> you have), there's every chance that an AG has completely run out of
> space and so new rmap records for shared extents can't be allocated
> - that can give you spurious ENOSPC errors before the filesystem is
> 100% full, too.

This doesn't seem to be the case for this fs as we seem to have "free" 
space in all the AGs, IFF the allocation btrees aren't included in the 
per-AG reported free space.

> i.e. every shared extent in the filesystem has a rmap record
> pointing back to each owner of the shared extent. That means for an
> extent shared 1000 times, there are 1000 rmap records for that
> shared extent. If you share it again, a new rmap record needs to be
> inserted into the rmapbt, and if the AG is completely out of space
> this can fail w/ ENOSPC. Hence you can get ENOSPC errors attempting
> to shared or unshare extents because there isn't space in the AG for
> the tracking metadata for the new extent record....

FYI, in this particular fs the reflinks have low owner counts: I think 
most of the extents are single owner, and the vast majority (and 
perhaps all of) of the multi-owner extents have only 2 owners. I don't 
think there would be any with more than, say, 3 owners.

Out of interest: if an multi-reflinked extent is reduced down to one 
owner is that extent then removed from the reflink btree?

>> $ sudo xfs_db -r -c 'freesp -s' /dev/mapper/vg00-chroot
>>    from      to extents  blocks    pct
>>       1       1   74943   74943   0.00
>>       2       3   71266  179032   0.01
>>       4       7  155670  855072   0.04
>>       8      15  304838 3512336   0.17
>>      16      31  613606 14459417   0.72
>>      32      63 1043230 47413004   2.35
>>      64     127 1130921 106646418   5.29
>>     128     255 1043683 188291054   9.34
>>     256     511  576818 200011819   9.93
>>     512    1023  328790 230908212  11.46
>>    1024    2047  194784 276975084  13.75
>>    2048    4095  119242 341977975  16.97
>>    4096    8191   72903 406955899  20.20
>>    8192   16383    5991 67763286   3.36
>>   16384   32767    1431 31354803   1.56
>>   32768   65535     310 14366959   0.71
>>   65536  131071     122 10838153   0.54
>>  131072  262143      87 15901152   0.79
>>  262144  524287      44 17822179   0.88
>>  524288 1048575      16 12482310   0.62
>> 1048576 2097151      14 20897049   1.04
>> 4194304 8388607       1 5213142   0.26
>> total free extents 5738710
>> total free blocks 2014899298
>> average free extent size 351.107
>
> So 5.7M freespace records. Assume perfect packing an thats roughly
> 500 records to a btree block so at least 10,000 freespace btree
> blocks in the filesytem. But we really need to see the per-ag
> histograms to be able to make any meaningful analysis of the free
> space layout in the filesystem....

See attached for per-ag histograms.

> Context is very important when trying to determine if free space
> fragmentation is an issue or not. Most of the time, it isn't an
> issue at all but people have generally been trained to think "all
> fragmentation is bad" rather than "only worry about fragmentation if
> there is a problem that is directly related to physical allocation
> patterns"...

In this case it's a typical backup application: it uploads regular 
incremental files and those are later merged into a full backup file, 
either by extending or overwriting or reflinking depending on whether the 
app decides to use reflinks or not. The uploads are sequential and mostly 
large-ish writes (132K+), then the merge is small to medium size randomish 
writes or reflinks (4K-???). So the smaller writes/reflinks are going to 
create a significant amount of fragmentation. The incremental files are 
removed entirely at some later time (no discard involved).

I guess if it's determined this pattern is critically suboptimal and 
causing this errant ENOSPC issue, and the changes in 5.13 don't help, 
there's nothing to stop me from occasionally doing a full (non-reflink) 
copy of the large full backup files into another file to get them nicely 
sequential. I'd lose any reflinks along the way of course, but they don't 
last a long time anyway (days to a few weeks) depending on how long the 
smaller incremental files are kept.


Cheers,

Chris