linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* XFS fallocate implementation incorrectly reports ENOSPC
@ 2021-08-26  2:06 Chris Dunlop
  2021-08-26 15:05 ` Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-26  2:06 UTC (permalink / raw)
  To: linux-xfs

Hi,

As reported by Charles Hathaway here (with no resolution):

XFS fallocate implementation incorrectly reports ENOSPC
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

Given this sequence:

fallocate -l 1GB image.img
mkfs.xfs -f image.img
mkdir mnt
mount -o loop ./image.img mnt
fallocate -o 0 -l 700mb mnt/image.img
fallocate -o 0 -l 700mb mnt/image.img

Why does the second fallocate fail with ENOSPC, and is that considered an 
XFS bug?

Ext4 is happy to do the second fallocate without error.

Tested on linux-5.10.60

Background: I'm chasing a mysterious ENOSPC error on an XFS filesystem 
with way more space than the app should be asking for. There are no quotas 
on the fs. Unfortunately it's a third party app and I can't tell what 
sequence is producing the error, but this fallocate issue is a 
possibility.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XFS fallocate implementation incorrectly reports ENOSPC
  2021-08-26  2:06 XFS fallocate implementation incorrectly reports ENOSPC Chris Dunlop
@ 2021-08-26 15:05 ` Eric Sandeen
  2021-08-26 20:56   ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2021-08-26 15:05 UTC (permalink / raw)
  To: Chris Dunlop, linux-xfs



On 8/25/21 9:06 PM, Chris Dunlop wrote:
> Hi,
> 
> As reported by Charles Hathaway here (with no resolution):
> 
> XFS fallocate implementation incorrectly reports ENOSPC
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323
> 
> Given this sequence:
> 
> fallocate -l 1GB image.img
> mkfs.xfs -f image.img
> mkdir mnt
> mount -o loop ./image.img mnt
> fallocate -o 0 -l 700mb mnt/image.img
> fallocate -o 0 -l 700mb mnt/image.img
> 
> Why does the second fallocate fail with ENOSPC, and is that considered an XFS bug?

Interesting.  Off the top of my head, I assume that xfs is not looking at
current file space usage when deciding how much is needed to satisfy the
fallocate request.  While filesystems can return ENOSPC at any time for
any reason, this does seem a bit suboptimal.
  
> Ext4 is happy to do the second fallocate without error.
> 
> Tested on linux-5.10.60
> 
> Background: I'm chasing a mysterious ENOSPC error on an XFS filesystem with way more space than the app should be asking for. There are no quotas on the fs. Unfortunately it's a third party app and I can't tell what sequence is producing the error, but this fallocate issue is a possibility.

Presumably you've tried stracing it and looking for ENOSPC returns from
syscalls?

-Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XFS fallocate implementation incorrectly reports ENOSPC
  2021-08-26 15:05 ` Eric Sandeen
@ 2021-08-26 20:56   ` Chris Dunlop
  2021-08-27  2:55     ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-26 20:56 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Thu, Aug 26, 2021 at 10:05:00AM -0500, Eric Sandeen wrote:
> On 8/25/21 9:06 PM, Chris Dunlop wrote:
>>
>> fallocate -l 1GB image.img
>> mkfs.xfs -f image.img
>> mkdir mnt
>> mount -o loop ./image.img mnt
>> fallocate -o 0 -l 700mb mnt/image.img
>> fallocate -o 0 -l 700mb mnt/image.img
>>
>> Why does the second fallocate fail with ENOSPC, and is that considered an XFS bug?
>
> Interesting.  Off the top of my head, I assume that xfs is not looking at
> current file space usage when deciding how much is needed to satisfy the
> fallocate request.  While filesystems can return ENOSPC at any time for
> any reason, this does seem a bit suboptimal.

Yes, I would have thought the second fallocate should be a noop.

>> Background: I'm chasing a mysterious ENOSPC error on an XFS filesystem 
>> with way more space than the app should be asking for. There are no 
>> quotas on the fs. Unfortunately it's a third party app and I can't tell 
>> what sequence is producing the error, but this fallocate issue is a 
>> possibility.
>
> Presumably you've tried stracing it and looking for ENOSPC returns from
> syscalls?

That would be an obvious approach. Unfortunately it's not that easy. The 
problem is associated with one specific client which is out of my control 
so I can't experiment in a controlled environment. The app runs for 
several hours in multiple phases, each with multiple threads, and the 
problem typically occurs in the early hours of the morning after several 
hours of running, so attaching to the correct instance is fraught, and the 
strace output will be voluminous.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XFS fallocate implementation incorrectly reports ENOSPC
  2021-08-26 20:56   ` Chris Dunlop
@ 2021-08-27  2:55     ` Chris Dunlop
  2021-08-27  5:49       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-27  2:55 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Fri, Aug 27, 2021 at 06:56:35AM +1000, Chris Dunlop wrote:
> On Thu, Aug 26, 2021 at 10:05:00AM -0500, Eric Sandeen wrote:
>> On 8/25/21 9:06 PM, Chris Dunlop wrote:
>>>
>>> fallocate -l 1GB image.img
>>> mkfs.xfs -f image.img
>>> mkdir mnt
>>> mount -o loop ./image.img mnt
>>> fallocate -o 0 -l 700mb mnt/image.img
>>> fallocate -o 0 -l 700mb mnt/image.img
>>>
>>> Why does the second fallocate fail with ENOSPC, and is that considered an XFS bug?
>>
>> Interesting.  Off the top of my head, I assume that xfs is not looking at
>> current file space usage when deciding how much is needed to satisfy the
>> fallocate request.  While filesystems can return ENOSPC at any time for
>> any reason, this does seem a bit suboptimal.
>
> Yes, I would have thought the second fallocate should be a noop.

On further reflection, "filesystems can return ENOSPC at any time" is 
certainly something apps need to be prepared for (and in this case, it's 
doing the right thing, by logging the error and aborting), but it's not 
really a "not a bug" excuse for the filesystem in all circumstances (or 
this one?), is it? E.g. a write(fd, buf, 1) returning ENOSPC on an fresh 
filesystem would be considered a bug, no?

...or maybe your "suboptimal" was entirely tongue in cheek?

>>> Background: I'm chasing a mysterious ENOSPC error on an XFS 
>>> filesystem with way more space than the app should be asking for. 
>>> There are no quotas on the fs. Unfortunately it's a third party 
>>> app and I can't tell what sequence is producing the error, but 
>>> this fallocate issue is a possibility.
>>
>> Presumably you've tried stracing it and looking for ENOSPC returns from
>> syscalls?
>
> That would be an obvious approach. Unfortunately it's not that easy. 
> The problem is associated with one specific client which is out of my 
> control so I can't experiment in a controlled environment. The app 
> runs for several hours in multiple phases, each with multiple threads, 
> and the problem typically occurs in the early hours of the morning 
> after several hours of running, so attaching to the correct instance 
> is fraught, and the strace output will be voluminous.

I decided to stop being lazy and look into taking the strace option 
further. I can script looking for the right process as it starts up, and 
with judicious use of "-Z" for failed calls only, and filtering out 
commonly failing syscalls (futex, stat etc.), the output volume is reduced 
to just about nothing. This could be the solution - but it'll probably 
take a week or so for it to fail again and see if I can catch what's going 
on.

Thanks for the inspiration / kick in the pants to get this going.

Strace has grown more options since the last time I looked at the man 
page: "-Z" is fantastic!

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XFS fallocate implementation incorrectly reports ENOSPC
  2021-08-27  2:55     ` Chris Dunlop
@ 2021-08-27  5:49       ` Dave Chinner
  2021-08-27  6:53         ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2021-08-27  5:49 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Eric Sandeen, linux-xfs

On Fri, Aug 27, 2021 at 12:55:39PM +1000, Chris Dunlop wrote:
> On Fri, Aug 27, 2021 at 06:56:35AM +1000, Chris Dunlop wrote:
> > On Thu, Aug 26, 2021 at 10:05:00AM -0500, Eric Sandeen wrote:
> > > On 8/25/21 9:06 PM, Chris Dunlop wrote:
> > > > 
> > > > fallocate -l 1GB image.img
> > > > mkfs.xfs -f image.img
> > > > mkdir mnt
> > > > mount -o loop ./image.img mnt
> > > > fallocate -o 0 -l 700mb mnt/image.img
> > > > fallocate -o 0 -l 700mb mnt/image.img
> > > > 
> > > > Why does the second fallocate fail with ENOSPC, and is that considered an XFS bug?
> > > 
> > > Interesting.  Off the top of my head, I assume that xfs is not looking at
> > > current file space usage when deciding how much is needed to satisfy the
> > > fallocate request.  While filesystems can return ENOSPC at any time for
> > > any reason, this does seem a bit suboptimal.
> > 
> > Yes, I would have thought the second fallocate should be a noop.
> 
> On further reflection, "filesystems can return ENOSPC at any time" is
> certainly something apps need to be prepared for (and in this case, it's
> doing the right thing, by logging the error and aborting), but it's not
> really a "not a bug" excuse for the filesystem in all circumstances (or this
> one?), is it? E.g. a write(fd, buf, 1) returning ENOSPC on an fresh
> filesystem would be considered a bug, no?

Sure, but the fallocate case here is different. You're asking to
preallocate up to 700MB of space on a filesystem that only has 300MB
of space free. Up front, without knowing anything about the layout
of the file we might need to allocate 700MB of space into, there's a
very good chance that we'll get ENOSPC partially through the
operation.

The real problem with preallocation failing part way through due to
overcommit of space is that we can't go back an undo the
allocation(s) made by fallocate because when we get ENOSPC we have
lost all the state of the previous allocations made. If fallocate is
filling holes between unwritten extents already in the file, then we
have no way of knowing where the holes we filled were and hence
cannot reliably free the space we've allocated before ENOSPC was
hit.

Hence if we allow the fallocate to go ahead and preallocate space
until we hit ENOSPC, we still end up returning to userspace with
ENOSPC, but we've also consumed all the remaining space in the
filesystem.

So there's a very good argument for simply rejecting any attempt to
preallocate space that has the possibility of over-committing space
and hence hitting ENOSPC part way through. Given that we spend a lot
of effort in XFS to avoid over-committing resources so that ENOSPC
is reliable and not prone to deadlocks, the choice to make fallocate
avoid a potential over-commit is at least internally consistent with
the XFS ENOSPC architecture.

IOWs, either behaviour could be considered a "bug" because it is
sub-optimal behaviour, but at some point you've got to choose what
is the least worst behaviour and run with it.

> ...or maybe your "suboptimal" was entirely tongue in cheek?
> 
> > > > Background: I'm chasing a mysterious ENOSPC error on an XFS
> > > > filesystem with way more space than the app should be asking
> > > > for. There are no quotas on the fs. Unfortunately it's a third
> > > > party app and I can't tell what sequence is producing the error,
> > > > but this fallocate issue is a possibility.

More likely speculative preallocation is causing this than
fallocate. However, we've had a background worker that cleans up
speculative prealloc before reporting ENOSPC for a while now - what
kernel version are seeing this on?

Also, it might not even be data allocation that is the issue - if
the filesystem is full and free space is fragmented, you could be
getting ENOSPC because inodes cannot be allocated. In which case,
the output of xfs-info would be useful so we can see if sparse inode
clusters are enabled or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XFS fallocate implementation incorrectly reports ENOSPC
  2021-08-27  5:49       ` Dave Chinner
@ 2021-08-27  6:53         ` Chris Dunlop
  2021-08-27 22:03           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-27  6:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs

G'day Dave,

On Fri, Aug 27, 2021 at 03:49:56PM +1000, Dave Chinner wrote:
> On Fri, Aug 27, 2021 at 12:55:39PM +1000, Chris Dunlop wrote:
>> On Fri, Aug 27, 2021 at 06:56:35AM +1000, Chris Dunlop wrote:
>>> On Thu, Aug 26, 2021 at 10:05:00AM -0500, Eric Sandeen wrote:
>>>> On 8/25/21 9:06 PM, Chris Dunlop wrote:
>>>>>
>>>>> fallocate -l 1GB image.img
>>>>> mkfs.xfs -f image.img
>>>>> mkdir mnt
>>>>> mount -o loop ./image.img mnt
>>>>> fallocate -o 0 -l 700mb mnt/image.img
>>>>> fallocate -o 0 -l 700mb mnt/image.img
>>>>>
>>>>> Why does the second fallocate fail with ENOSPC, and is that considered an XFS bug?
>>>>
>>>> Interesting.  Off the top of my head, I assume that xfs is not looking at
>>>> current file space usage when deciding how much is needed to satisfy the
>>>> fallocate request.  While filesystems can return ENOSPC at any time for
>>>> any reason, this does seem a bit suboptimal.
>>>
>>> Yes, I would have thought the second fallocate should be a noop.
>>
>> On further reflection, "filesystems can return ENOSPC at any time" is
>> certainly something apps need to be prepared for (and in this case, it's
>> doing the right thing, by logging the error and aborting), but it's not
>> really a "not a bug" excuse for the filesystem in all circumstances (or this
>> one?), is it? E.g. a write(fd, buf, 1) returning ENOSPC on an fresh
>> filesystem would be considered a bug, no?
>
> Sure, but the fallocate case here is different. You're asking to
> preallocate up to 700MB of space on a filesystem that only has 300MB
> of space free. Up front, without knowing anything about the layout
> of the file we might need to allocate 700MB of space into, there's a
> very good chance that we'll get ENOSPC partially through the
> operation.

But I'm not asking for more space - the space is already there:

$ filefrag -v mnt/image.img 
Filesystem type is: ef53
File size of mnt/image.img is 700000000 (170899 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..   30719:      34816..     65535:  30720:             unwritten
    1:    30720..   59391:      69632..     98303:  28672:      65536: unwritten
    2:    59392..  122879:     100352..    163839:  63488:      98304: unwritten
    3:   122880..  170898:     165888..    213906:  48019:     163840: last,unwritten,eof
mnt/image.img: 4 extents found

I.e. the fallocate /could/ potentially look at the existing file and 
say "nothing for me do to here".

Of course, that should be pretty easy and quick in this case - but for 
a file with hundereds of thousands of extents and potential holes in 
the midst it would be somewhat less quick and easy. So that's probably 
a good reason for it to fail. Sigh. On the other hand that might be a 
case of "play stupid games, win stupid prizes". On the gripping hand I 
can imagine the emails to the mailing list from people like me asking 
why their "simple" fallocate is taking 20 minutes...

>>>>> Background: I'm chasing a mysterious ENOSPC error on an XFS
>>>>> filesystem with way more space than the app should be asking
>>>>> for. There are no quotas on the fs. Unfortunately it's a third
>>>>> party app and I can't tell what sequence is producing the error,
>>>>> but this fallocate issue is a possibility.
>
> More likely speculative preallocation is causing this than
> fallocate. However, we've had a background worker that cleans up
> speculative prealloc before reporting ENOSPC for a while now - what
> kernel version are seeing this on?

5.10.60. How long is "a while now"? I vaguely recall something about 
that going through.

> Also, it might not even be data allocation that is the issue - if
> the filesystem is full and free space is fragmented, you could be
> getting ENOSPC because inodes cannot be allocated. In which case,
> the output of xfs-info would be useful so we can see if sparse inode
> clusters are enabled or not....

$ xfs_info /chroot
meta-data=/dev/mapper/vg00-chroot isize=512    agcount=32, agsize=244184192 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=1
          =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=7813893120, imaxpct=5
          =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

It's currently fuller than I like:

$ df /chroot
Filesystem                1K-blocks        Used  Available Use% Mounted on
/dev/mapper/vg00-chroot 31253485568 24541378460 6712107108  79% /chroot

...so that's 6.3T free, but this problem was happening with 71% (8.5T) 
free. The /maximum/ the app could conceivably be asking for is around 
1.1T (to entirely duplicate an existing file), but it really shouldn't 
be doing anywhere near that: I can see it doing write-in-place on the 
existing file and should be asking for modest amounts of extention 
(then again, userland developers, so who knows, right? ;-}).

Oh, another reference: this is extensive reflinking happening on this 
filesystem. I don't know if that's a factor. You may remember my 
previous email relating to that:

Extreme fragmentation ho!
https://www.spinics.net/lists/linux-xfs/msg47707.html

I'm excited by my new stracing script prompted by Eric - at least that 
should tell us what precisely is failing. Shame I'm going to have to 
wait a while for it to trigger.


Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: XFS fallocate implementation incorrectly reports ENOSPC
  2021-08-27  6:53         ` Chris Dunlop
@ 2021-08-27 22:03           ` Dave Chinner
  2021-08-28  0:21             ` Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC] Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2021-08-27 22:03 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Eric Sandeen, linux-xfs

On Fri, Aug 27, 2021 at 04:53:47PM +1000, Chris Dunlop wrote:
> G'day Dave,
> 
> On Fri, Aug 27, 2021 at 03:49:56PM +1000, Dave Chinner wrote:
> > On Fri, Aug 27, 2021 at 12:55:39PM +1000, Chris Dunlop wrote:
> > > On Fri, Aug 27, 2021 at 06:56:35AM +1000, Chris Dunlop wrote:
> > > > On Thu, Aug 26, 2021 at 10:05:00AM -0500, Eric Sandeen wrote:
> > > > > On 8/25/21 9:06 PM, Chris Dunlop wrote:
> > > > > > 
> > > > > > fallocate -l 1GB image.img
> > > > > > mkfs.xfs -f image.img
> > > > > > mkdir mnt
> > > > > > mount -o loop ./image.img mnt
> > > > > > fallocate -o 0 -l 700mb mnt/image.img
> > > > > > fallocate -o 0 -l 700mb mnt/image.img
> > > > > > 
> > > > > > Why does the second fallocate fail with ENOSPC, and is that considered an XFS bug?
> > > > > 
> > > > > Interesting.  Off the top of my head, I assume that xfs is not looking at
> > > > > current file space usage when deciding how much is needed to satisfy the
> > > > > fallocate request.  While filesystems can return ENOSPC at any time for
> > > > > any reason, this does seem a bit suboptimal.
> > > > 
> > > > Yes, I would have thought the second fallocate should be a noop.
> > > 
> > > On further reflection, "filesystems can return ENOSPC at any time" is
> > > certainly something apps need to be prepared for (and in this case, it's
> > > doing the right thing, by logging the error and aborting), but it's not
> > > really a "not a bug" excuse for the filesystem in all circumstances (or this
> > > one?), is it? E.g. a write(fd, buf, 1) returning ENOSPC on an fresh
> > > filesystem would be considered a bug, no?
> > 
> > Sure, but the fallocate case here is different. You're asking to
> > preallocate up to 700MB of space on a filesystem that only has 300MB
> > of space free. Up front, without knowing anything about the layout
> > of the file we might need to allocate 700MB of space into, there's a
> > very good chance that we'll get ENOSPC partially through the
> > operation.
> 
> But I'm not asking for more space - the space is already there:

"Up front, without knowing anything about the layout of the file..."

[....]

> Sigh. On the other hand that might be a case of "play stupid
> games, win stupid prizes". On the gripping hand I can imagine the emails to
> the mailing list from people like me asking why their "simple" fallocate is
> taking 20 minutes...

Yup, we have to chose between behaviours people will complain about.
We chose the behaviour that doesn't happen except on really small
filesystems because, in practice, we almost never see production
workloads asking to fallocate() more than half the entire filesystem
capacity at a time.....

> > > > > > Background: I'm chasing a mysterious ENOSPC error on an XFS
> > > > > > filesystem with way more space than the app should be asking
> > > > > > for. There are no quotas on the fs. Unfortunately it's a third
> > > > > > party app and I can't tell what sequence is producing the error,
> > > > > > but this fallocate issue is a possibility.
> > 
> > More likely speculative preallocation is causing this than
> > fallocate. However, we've had a background worker that cleans up
> > speculative prealloc before reporting ENOSPC for a while now - what
> > kernel version are seeing this on?
> 
> 5.10.60. How long is "a while now"? I vaguely recall something about that
> going through.

Longer than that.

> > Also, it might not even be data allocation that is the issue - if
> > the filesystem is full and free space is fragmented, you could be
> > getting ENOSPC because inodes cannot be allocated. In which case,
> > the output of xfs-info would be useful so we can see if sparse inode
> > clusters are enabled or not....
> 
> $ xfs_info /chroot
> meta-data=/dev/mapper/vg00-chroot isize=512    agcount=32, agsize=244184192 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=0 inobtcount=0
> data     =                       bsize=4096   blocks=7813893120, imaxpct=5
>          =                       sunit=128    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> It's currently fuller than I like:
> 
> $ df /chroot
> Filesystem                1K-blocks        Used  Available Use% Mounted on
> /dev/mapper/vg00-chroot 31253485568 24541378460 6712107108  79% /chroot
> 
> ...so that's 6.3T free, but this problem was happening with 71% (8.5T) free.
> The /maximum/ the app could conceivably be asking for is around 1.1T (to
> entirely duplicate an existing file), but it really shouldn't be doing
> anywhere near that: I can see it doing write-in-place on the existing file
> and should be asking for modest amounts of extention (then again, userland
> developers, so who knows, right? ;-}).
> 
> Oh, another reference: this is extensive reflinking happening on this
> filesystem. I don't know if that's a factor. You may remember my previous
> email relating to that:
> 
> Extreme fragmentation ho!
> https://www.spinics.net/lists/linux-xfs/msg47707.html

Ah. Details that are likely extremely important. The workload,
layout problems and ephemeral ENOSPC symptoms match the description
of the problem that was fixed by the series of commits that went
into 5.13 that ended in this one:

commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
Author: Brian Foster <bfoster@redhat.com>
Date:   Wed Apr 28 15:06:05 2021 -0700

    xfs: set aside allocation btree blocks from block reservation
    
    The blocks used for allocation btrees (bnobt and countbt) are
    technically considered free space. This is because as free space is
    used, allocbt blocks are removed and naturally become available for
    traditional allocation. However, this means that a significant
    portion of free space may consist of in-use btree blocks if free
    space is severely fragmented.
    
    On large filesystems with large perag reservations, this can lead to
    a rare but nasty condition where a significant amount of physical
    free space is available, but the majority of actual usable blocks
    consist of in-use allocbt blocks. We have a record of a (~12TB, 32
    AG) filesystem with multiple AGs in a state with ~2.5GB or so free
    blocks tracked across ~300 total allocbt blocks, but effectively at
    100% full because the the free space is entirely consumed by
    refcountbt perag reservation.
    
    Such a large perag reservation is by design on large filesystems.
    The problem is that because the free space is so fragmented, this AG
    contributes the 300 or so allocbt blocks to the global counters as
    free space. If this pattern repeats across enough AGs, the
    filesystem lands in a state where global block reservation can
    outrun physical block availability. For example, a streaming
    buffered write on the affected filesystem continues to allow delayed
    allocation beyond the point where writeback starts to fail due to
    physical block allocation failures. The expected behavior is for the
    delalloc block reservation to fail gracefully with -ENOSPC before
    physical block allocation failure is a possibility.
    
    To address this problem, set aside in-use allocbt blocks at
    reservation time and thus ensure they cannot be reserved until truly
    available for physical allocation. This allows alloc btree metadata
    to continue to reside in free space, but dynamically adjusts
    reservation availability based on internal state. Note that the
    logic requires that the allocbt counter is fully populated at
    reservation time before it is fully effective. We currently rely on
    the mount time AGF scan in the perag reservation initialization code
    for this dependency on filesystems where it's most important (i.e.
    with active perag reservations).
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC]
  2021-08-27 22:03           ` Dave Chinner
@ 2021-08-28  0:21             ` Chris Dunlop
  2021-08-28  3:58               ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-28  0:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs

On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
> On Fri, Aug 27, 2021 at 04:53:47PM +1000, Chris Dunlop wrote:
>> On 8/25/21 9:06 PM, Chris Dunlop wrote:
>>> Background: I'm chasing a mysterious ENOSPC error on an XFS
>>> filesystem with way more space than the app should be asking
>>> for. There are no quotas on the fs. Unfortunately it's a third
>>> party app and I can't tell what sequence is producing the error,
>>> but this fallocate issue is a possibility.
>>
>> Oh, another reference: this is extensive reflinking happening on this
>> filesystem.
>
> Ah. Details that are likely extremely important. The workload,
> layout problems and ephemeral ENOSPC symptoms match the description
> of the problem that was fixed by the series of commits that went
> into 5.13 that ended in this one:
>
> commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Wed Apr 28 15:06:05 2021 -0700
>
>    xfs: set aside allocation btree blocks from block reservation

Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of 
seeing if this fs is likely to be suffering from this particular issue or 
is it a matter of installing an appropriate kernel and seeing if the 
problem goes away?

The job getting this ENOSPC error is one of 45 similar jobs, and it's the 
only one getting the error. There doesn't seem to be anything special 
about this job, it's main file where the writes are going is the 9th 
largest (up to 1.8T), and it has a lot of extents (842G split into 750M 
extents) but not as many as some others (e.g. 809G split into 1G extents).  
That said, the app works in mysterious ways so this particular job may be 
a special snowflake in some unobvious manner.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC]
  2021-08-28  0:21             ` Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC] Chris Dunlop
@ 2021-08-28  3:58               ` Chris Dunlop
  2021-08-29 22:04                 ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-28  3:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs

On Sat, Aug 28, 2021 at 10:21:37AM +1000, Chris Dunlop wrote:
> On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
>> commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
>> Author: Brian Foster <bfoster@redhat.com>
>> Date:   Wed Apr 28 15:06:05 2021 -0700
>>
>>   xfs: set aside allocation btree blocks from block reservation
>
> Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of 
> seeing if this fs is likely to be suffering from this particular issue 
> or is it a matter of installing an appropriate kernel and seeing if 
> the problem goes away?

Is this sufficient to tell us that this filesystem probably isn't suffering 
from that issue?

$ sudo xfs_db -r -c 'freesp -s' /dev/mapper/vg00-chroot
    from      to extents  blocks    pct
       1       1   74943   74943   0.00
       2       3   71266  179032   0.01
       4       7  155670  855072   0.04
       8      15  304838 3512336   0.17
      16      31  613606 14459417   0.72
      32      63 1043230 47413004   2.35
      64     127 1130921 106646418   5.29
     128     255 1043683 188291054   9.34
     256     511  576818 200011819   9.93
     512    1023  328790 230908212  11.46
    1024    2047  194784 276975084  13.75
    2048    4095  119242 341977975  16.97
    4096    8191   72903 406955899  20.20 
    8192   16383    5991 67763286   3.36
   16384   32767    1431 31354803   1.56
   32768   65535     310 14366959   0.71 
   65536  131071     122 10838153   0.54 
  131072  262143      87 15901152   0.79
  262144  524287      44 17822179   0.88
  524288 1048575      16 12482310   0.62
1048576 2097151      14 20897049   1.04
4194304 8388607       1 5213142   0.26
total free extents 5738710
total free blocks 2014899298
average free extent size 351.107

Or from:

How to tell how fragmented the free space is on an XFS filesystem?
https://www.suse.com/support/kb/doc/?id=000018219

Based on xfs_info "agcount=32":

$ {
   for AGNO in {0..31}; do
     sudo /usr/sbin/xfs_db -r -c "freesp -s -a $AGNO" /dev/mapper/vg00-chroot > /tmp/ag${AGNO}.txt
   done
   grep -h '^average free extent size' /tmp/ag*.txt | sort -k5n | head -n5
   echo --
   grep -h '^average free extent size' /tmp/ag*.txt | sort -k5n | tail -n5
}
average free extent size 66.7806
average free extent size 79.201
average free extent size 80.221
average free extent size 87.595
average free extent size 103.079
--
average free extent size 898.962
average free extent size 906.709
average free extent size 1001.18
average free extent size 1849.23
average free extent size 2782.75

Even those ags with the lowest average free extent size are higher than what 
the web page suggests is "an AG in fairly good shape".

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC]
  2021-08-28  3:58               ` Chris Dunlop
@ 2021-08-29 22:04                 ` Dave Chinner
  2021-08-30  4:21                   ` Darrick J. Wong
  2021-08-30  7:37                   ` Mysterious ENOSPC Chris Dunlop
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2021-08-29 22:04 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Eric Sandeen, linux-xfs

On Sat, Aug 28, 2021 at 01:58:24PM +1000, Chris Dunlop wrote:
> On Sat, Aug 28, 2021 at 10:21:37AM +1000, Chris Dunlop wrote:
> > On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
> > > commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
> > > Author: Brian Foster <bfoster@redhat.com>
> > > Date:   Wed Apr 28 15:06:05 2021 -0700
> > > 
> > >   xfs: set aside allocation btree blocks from block reservation
> > 
> > Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of
> > seeing if this fs is likely to be suffering from this particular issue
> > or is it a matter of installing an appropriate kernel and seeing if the
> > problem goes away?
> 
> Is this sufficient to tell us that this filesystem probably isn't suffering
> from that issue?

IIRC, it's the per-ag histograms that are more important here
because we are running out of space in an AG because of
overcommitting the per-ag space. If there is an AG that is much more
fragmented than others, then it will be consuming much more in way
of freespace btree blocks than others...

FWIW, if you are using reflink heavily and you have rmap enabled (as
you have), there's every chance that an AG has completely run out of
space and so new rmap records for shared extents can't be allocated
- that can give you spurious ENOSPC errors before the filesystem is
100% full, too.

i.e. every shared extent in the filesystem has a rmap record
pointing back to each owner of the shared extent. That means for an
extent shared 1000 times, there are 1000 rmap records for that
shared extent. If you share it again, a new rmap record needs to be
inserted into the rmapbt, and if the AG is completely out of space
this can fail w/ ENOSPC. Hence you can get ENOSPC errors attempting
to shared or unshare extents because there isn't space in the AG for
the tracking metadata for the new extent record....

> $ sudo xfs_db -r -c 'freesp -s' /dev/mapper/vg00-chroot
>    from      to extents  blocks    pct
>       1       1   74943   74943   0.00
>       2       3   71266  179032   0.01
>       4       7  155670  855072   0.04
>       8      15  304838 3512336   0.17
>      16      31  613606 14459417   0.72
>      32      63 1043230 47413004   2.35
>      64     127 1130921 106646418   5.29
>     128     255 1043683 188291054   9.34
>     256     511  576818 200011819   9.93
>     512    1023  328790 230908212  11.46
>    1024    2047  194784 276975084  13.75
>    2048    4095  119242 341977975  16.97
>    4096    8191   72903 406955899  20.20    8192   16383    5991 67763286
> 3.36
>   16384   32767    1431 31354803   1.56
>   32768   65535     310 14366959   0.71   65536  131071     122 10838153
> 0.54  131072  262143      87 15901152   0.79
>  262144  524287      44 17822179   0.88
>  524288 1048575      16 12482310   0.62
> 1048576 2097151      14 20897049   1.04
> 4194304 8388607       1 5213142   0.26
> total free extents 5738710
> total free blocks 2014899298
> average free extent size 351.107

So 5.7M freespace records. Assume perfect packing an thats roughly
500 records to a btree block so at least 10,000 freespace btree
blocks in the filesytem. But we really need to see the per-ag
histograms to be able to make any meaningful analysis of the free
space layout in the filesystem....

> Or from:
> 
> How to tell how fragmented the free space is on an XFS filesystem?
> https://www.suse.com/support/kb/doc/?id=000018219
> 
> Based on xfs_info "agcount=32":
> 
> $ {
>   for AGNO in {0..31}; do
>     sudo /usr/sbin/xfs_db -r -c "freesp -s -a $AGNO" /dev/mapper/vg00-chroot > /tmp/ag${AGNO}.txt
>   done
>   grep -h '^average free extent size' /tmp/ag*.txt | sort -k5n | head -n5
>   echo --
>   grep -h '^average free extent size' /tmp/ag*.txt | sort -k5n | tail -n5
> }
> average free extent size 66.7806

Average size by itself isn't actually useful for analysis. The
histogram is what gives us all the necessary information. e.g. this
could be a thousand single block extents and one 65000 block extent
or it could be a million 64k extents. The former is pretty good, the
latter is awful (indicates likely worst case 64kB extent
fragmentation behaviour), because ....

> Even those ags with the lowest average free extent size are higher than what
> the web page suggests is "an AG in fairly good shape".

... the kb article completely glosses over the fact that we really
have to consider the histogram those averages are dervied from
before making a judgement on the state of the AG. It equates
"average extent size" with "fragmented AG", when in reality there's
a whole lot more to consider such as number of free extents, the
size of the AG, the amount of free space being indexed, the nature
of the workload and the allocations it requires, etc.

e.g. I'd consider the "AG greatly fragmented" case given in that KB
article to be perfectly fine if the workload is random 4KB writes
and hole punching to manage space in sparse files (perhaps, say,
lots of raw VM image files and guests have -o discard enabled). In
those cases, there's a huge number of viable allocation candidates
in the free space that can be found quickly and efficiently as
there's no possibility of large contiguous extents being formed for
user data because the IO patterns are small random writes into
sparse files...

Context is very important when trying to determine if free space
fragmentation is an issue or not. Most of the time, it isn't an
issue at all but people have generally been trained to think "all
fragmentation is bad" rather than "only worry about fragmentation if
there is a problem that is directly related to physical allocation
patterns"...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC]
  2021-08-29 22:04                 ` Dave Chinner
@ 2021-08-30  4:21                   ` Darrick J. Wong
  2021-08-30  7:40                     ` Chris Dunlop
  2021-08-30  7:37                   ` Mysterious ENOSPC Chris Dunlop
  1 sibling, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2021-08-30  4:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Dunlop, Eric Sandeen, linux-xfs

On Mon, Aug 30, 2021 at 08:04:57AM +1000, Dave Chinner wrote:
> On Sat, Aug 28, 2021 at 01:58:24PM +1000, Chris Dunlop wrote:
> > On Sat, Aug 28, 2021 at 10:21:37AM +1000, Chris Dunlop wrote:
> > > On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
> > > > commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
> > > > Author: Brian Foster <bfoster@redhat.com>
> > > > Date:   Wed Apr 28 15:06:05 2021 -0700
> > > > 
> > > >   xfs: set aside allocation btree blocks from block reservation
> > > 
> > > Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of
> > > seeing if this fs is likely to be suffering from this particular issue
> > > or is it a matter of installing an appropriate kernel and seeing if the
> > > problem goes away?
> > 
> > Is this sufficient to tell us that this filesystem probably isn't suffering
> > from that issue?

Since you've formatted with rmapbt enabled, you probably have a new
enough xfsprogs that you can /also/ use this on a live fs:

$ xfs_spaceman -c 'freesp -g'  /
        AG    extents     blocks
         0       2225    1426437
         1       2201    1716114
         2       2635    1196409
         3       2307    1567751

And if you really want the per-AG histogram...

$ xfs_spaceman -c 'freesp -s -a 2'  /
   from      to extents  blocks    pct
      1       1     262     262   0.02
      2       3     240     551   0.05
      4       7     306    1740   0.15
      8      15     370    4194   0.35
     16      31     563   13286   1.11
     32      63     362   16926   1.41
     64     127     271   22729   1.90
    128     255     112   20234   1.69
    256     511      82   30446   2.54
    512    1023      36   26021   2.17
   1024    2047      20   29074   2.43
   2048    4095       5   13499   1.13
   4096    8191       2    9550   0.80
   8192   16383       1   14484   1.21
  16384   32767       2   50101   4.19
  65536  131071       1   68649   5.74
 524288 1048575       1  874663  73.11
total free extents 2636
total free blocks 1196409
average free extent size 453.873

--D

> IIRC, it's the per-ag histograms that are more important here
> because we are running out of space in an AG because of
> overcommitting the per-ag space. If there is an AG that is much more
> fragmented than others, then it will be consuming much more in way
> of freespace btree blocks than others...
> 
> FWIW, if you are using reflink heavily and you have rmap enabled (as
> you have), there's every chance that an AG has completely run out of
> space and so new rmap records for shared extents can't be allocated
> - that can give you spurious ENOSPC errors before the filesystem is
> 100% full, too.
> 
> i.e. every shared extent in the filesystem has a rmap record
> pointing back to each owner of the shared extent. That means for an
> extent shared 1000 times, there are 1000 rmap records for that
> shared extent. If you share it again, a new rmap record needs to be
> inserted into the rmapbt, and if the AG is completely out of space
> this can fail w/ ENOSPC. Hence you can get ENOSPC errors attempting
> to shared or unshare extents because there isn't space in the AG for
> the tracking metadata for the new extent record....
> 
> > $ sudo xfs_db -r -c 'freesp -s' /dev/mapper/vg00-chroot
> >    from      to extents  blocks    pct
> >       1       1   74943   74943   0.00
> >       2       3   71266  179032   0.01
> >       4       7  155670  855072   0.04
> >       8      15  304838 3512336   0.17
> >      16      31  613606 14459417   0.72
> >      32      63 1043230 47413004   2.35
> >      64     127 1130921 106646418   5.29
> >     128     255 1043683 188291054   9.34
> >     256     511  576818 200011819   9.93
> >     512    1023  328790 230908212  11.46
> >    1024    2047  194784 276975084  13.75
> >    2048    4095  119242 341977975  16.97
> >    4096    8191   72903 406955899  20.20    8192   16383    5991 67763286
> > 3.36
> >   16384   32767    1431 31354803   1.56
> >   32768   65535     310 14366959   0.71   65536  131071     122 10838153
> > 0.54  131072  262143      87 15901152   0.79
> >  262144  524287      44 17822179   0.88
> >  524288 1048575      16 12482310   0.62
> > 1048576 2097151      14 20897049   1.04
> > 4194304 8388607       1 5213142   0.26
> > total free extents 5738710
> > total free blocks 2014899298
> > average free extent size 351.107
> 
> So 5.7M freespace records. Assume perfect packing an thats roughly
> 500 records to a btree block so at least 10,000 freespace btree
> blocks in the filesytem. But we really need to see the per-ag
> histograms to be able to make any meaningful analysis of the free
> space layout in the filesystem....
> 
> > Or from:
> > 
> > How to tell how fragmented the free space is on an XFS filesystem?
> > https://www.suse.com/support/kb/doc/?id=000018219
> > 
> > Based on xfs_info "agcount=32":
> > 
> > $ {
> >   for AGNO in {0..31}; do
> >     sudo /usr/sbin/xfs_db -r -c "freesp -s -a $AGNO" /dev/mapper/vg00-chroot > /tmp/ag${AGNO}.txt
> >   done
> >   grep -h '^average free extent size' /tmp/ag*.txt | sort -k5n | head -n5
> >   echo --
> >   grep -h '^average free extent size' /tmp/ag*.txt | sort -k5n | tail -n5
> > }
> > average free extent size 66.7806
> 
> Average size by itself isn't actually useful for analysis. The
> histogram is what gives us all the necessary information. e.g. this
> could be a thousand single block extents and one 65000 block extent
> or it could be a million 64k extents. The former is pretty good, the
> latter is awful (indicates likely worst case 64kB extent
> fragmentation behaviour), because ....
> 
> > Even those ags with the lowest average free extent size are higher than what
> > the web page suggests is "an AG in fairly good shape".
> 
> ... the kb article completely glosses over the fact that we really
> have to consider the histogram those averages are dervied from
> before making a judgement on the state of the AG. It equates
> "average extent size" with "fragmented AG", when in reality there's
> a whole lot more to consider such as number of free extents, the
> size of the AG, the amount of free space being indexed, the nature
> of the workload and the allocations it requires, etc.
> 
> e.g. I'd consider the "AG greatly fragmented" case given in that KB
> article to be perfectly fine if the workload is random 4KB writes
> and hole punching to manage space in sparse files (perhaps, say,
> lots of raw VM image files and guests have -o discard enabled). In
> those cases, there's a huge number of viable allocation candidates
> in the free space that can be found quickly and efficiently as
> there's no possibility of large contiguous extents being formed for
> user data because the IO patterns are small random writes into
> sparse files...
> 
> Context is very important when trying to determine if free space
> fragmentation is an issue or not. Most of the time, it isn't an
> issue at all but people have generally been trained to think "all
> fragmentation is bad" rather than "only worry about fragmentation if
> there is a problem that is directly related to physical allocation
> patterns"...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC
  2021-08-29 22:04                 ` Dave Chinner
  2021-08-30  4:21                   ` Darrick J. Wong
@ 2021-08-30  7:37                   ` Chris Dunlop
  2021-09-02  1:42                     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2021-08-30  7:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 6503 bytes --]

On Mon, Aug 30, 2021 at 08:04:57AM +1000, Dave Chinner wrote:
> On Sat, Aug 28, 2021 at 01:58:24PM +1000, Chris Dunlop wrote:
>> On Sat, Aug 28, 2021 at 10:21:37AM +1000, Chris Dunlop wrote:
>>> On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
>>>> commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
>>>> Author: Brian Foster <bfoster@redhat.com>
>>>> Date:   Wed Apr 28 15:06:05 2021 -0700
>>>>
>>>>   xfs: set aside allocation btree blocks from block reservation
>>>
>>> Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of
>>> seeing if this fs is likely to be suffering from this particular issue
>>> or is it a matter of installing an appropriate kernel and seeing if the
>>> problem goes away?
>>
>> Is this sufficient to tell us that this filesystem probably isn't suffering
>> from that issue?
>
> IIRC, it's the per-ag histograms that are more important here
> because we are running out of space in an AG because of
> overcommitting the per-ag space. If there is an AG that is much more
> fragmented than others, then it will be consuming much more in way
> of freespace btree blocks than others...

Per-ag histograms attached.

Do the blocks used by the allocation btrees show up in the AG 
histograms? E.g. with an AG like this:

AG 18
    from      to extents  blocks    pct
       1       1    1961    1961   0.01
       2       3   17129   42602   0.11
       4       7   33374  183312   0.48
       8      15   68076  783020   2.06
      16      31  146868 3469398   9.14
      32      63  248690 10614558  27.96
      64     127   32088 2798748   7.37
     128     255    8654 1492521   3.93
     256     511    4227 1431586   3.77
     512    1023    2531 1824377   4.81
    1024    2047    2125 3076304   8.10
    2048    4095    1615 4691302  12.36
    4096    8191    1070 6062351  15.97
    8192   16383     139 1454627   3.83
   16384   32767       2   41359   0.11
total free extents 568549
total free blocks 37968026
average free extent size 66.7806

...it looks like it's significantly fragmented, but, if the allocation 
btrees aren't part of this, it seems there's still sufficient free 
space that it shouldn't be getting to ENOSPC?

> FWIW, if you are using reflink heavily and you have rmap enabled (as
> you have), there's every chance that an AG has completely run out of
> space and so new rmap records for shared extents can't be allocated
> - that can give you spurious ENOSPC errors before the filesystem is
> 100% full, too.

This doesn't seem to be the case for this fs as we seem to have "free" 
space in all the AGs, IFF the allocation btrees aren't included in the 
per-AG reported free space.

> i.e. every shared extent in the filesystem has a rmap record
> pointing back to each owner of the shared extent. That means for an
> extent shared 1000 times, there are 1000 rmap records for that
> shared extent. If you share it again, a new rmap record needs to be
> inserted into the rmapbt, and if the AG is completely out of space
> this can fail w/ ENOSPC. Hence you can get ENOSPC errors attempting
> to shared or unshare extents because there isn't space in the AG for
> the tracking metadata for the new extent record....

FYI, in this particular fs the reflinks have low owner counts: I think 
most of the extents are single owner, and the vast majority (and 
perhaps all of) of the multi-owner extents have only 2 owners. I don't 
think there would be any with more than, say, 3 owners.

Out of interest: if an multi-reflinked extent is reduced down to one 
owner is that extent then removed from the reflink btree?

>> $ sudo xfs_db -r -c 'freesp -s' /dev/mapper/vg00-chroot
>>    from      to extents  blocks    pct
>>       1       1   74943   74943   0.00
>>       2       3   71266  179032   0.01
>>       4       7  155670  855072   0.04
>>       8      15  304838 3512336   0.17
>>      16      31  613606 14459417   0.72
>>      32      63 1043230 47413004   2.35
>>      64     127 1130921 106646418   5.29
>>     128     255 1043683 188291054   9.34
>>     256     511  576818 200011819   9.93
>>     512    1023  328790 230908212  11.46
>>    1024    2047  194784 276975084  13.75
>>    2048    4095  119242 341977975  16.97
>>    4096    8191   72903 406955899  20.20
>>    8192   16383    5991 67763286   3.36
>>   16384   32767    1431 31354803   1.56
>>   32768   65535     310 14366959   0.71
>>   65536  131071     122 10838153   0.54
>>  131072  262143      87 15901152   0.79
>>  262144  524287      44 17822179   0.88
>>  524288 1048575      16 12482310   0.62
>> 1048576 2097151      14 20897049   1.04
>> 4194304 8388607       1 5213142   0.26
>> total free extents 5738710
>> total free blocks 2014899298
>> average free extent size 351.107
>
> So 5.7M freespace records. Assume perfect packing an thats roughly
> 500 records to a btree block so at least 10,000 freespace btree
> blocks in the filesytem. But we really need to see the per-ag
> histograms to be able to make any meaningful analysis of the free
> space layout in the filesystem....

See attached for per-ag histograms.

> Context is very important when trying to determine if free space
> fragmentation is an issue or not. Most of the time, it isn't an
> issue at all but people have generally been trained to think "all
> fragmentation is bad" rather than "only worry about fragmentation if
> there is a problem that is directly related to physical allocation
> patterns"...

In this case it's a typical backup application: it uploads regular 
incremental files and those are later merged into a full backup file, 
either by extending or overwriting or reflinking depending on whether the 
app decides to use reflinks or not. The uploads are sequential and mostly 
large-ish writes (132K+), then the merge is small to medium size randomish 
writes or reflinks (4K-???). So the smaller writes/reflinks are going to 
create a significant amount of fragmentation. The incremental files are 
removed entirely at some later time (no discard involved).

I guess if it's determined this pattern is critically suboptimal and 
causing this errant ENOSPC issue, and the changes in 5.13 don't help, 
there's nothing to stop me from occasionally doing a full (non-reflink) 
copy of the large full backup files into another file to get them nicely 
sequential. I'd lose any reflinks along the way of course, but they don't 
last a long time anyway (days to a few weeks) depending on how long the 
smaller incremental files are kept.


Cheers,

Chris

[-- Attachment #2: freesp-per-ag.txt --]
[-- Type: text/plain, Size: 24313 bytes --]


AG 0
   from      to extents  blocks    pct
      1       1    5215    5215   0.00
      2       3    2095    4778   0.00
      4       7    1870   10786   0.01
      8      15    4696   53963   0.05
     16      31    8157  192210   0.18
     32      63   14912  707014   0.65
     64     127   31799 3052928   2.79
    128     255   62759 11177040  10.22
    256     511   51477 17851082  16.33
    512    1023   31651 22157880  20.26
   1024    2047   15353 21439837  19.61
   2048    4095    6229 17536785  16.04
   4096    8191    2416 13086492  11.97
   8192   16383     144 1500294   1.37
  16384   32767      21  461837   0.42
  32768   65535       3  108715   0.10
total free extents 238797
total free blocks 109346856
average free extent size 457.907

AG 1
   from      to extents  blocks    pct
      1       1    2395    2395   0.00
      2       3     988    2299   0.00
      4       7    2101   12169   0.01
      8      15    4150   46624   0.04
     16      31    9433  228624   0.19
     32      63   16784  784775   0.67
     64     127   28022 2665137   2.26
    128     255   47302 8960792   7.60
    256     511   36831 13355167  11.33
    512    1023   29405 21185337  17.98
   1024    2047   18508 26248652  22.27
   2048    4095    8667 24290355  20.61
   4096    8191    3239 17611659  14.95
   8192   16383     187 1969888   1.67
  16384   32767      16  314583   0.27
  32768   65535       4  161187   0.14
total free extents 208032
total free blocks 117839643
average free extent size 566.45

AG 2
   from      to extents  blocks    pct
      1       1     792     792   0.00
      2       3    1391    3490   0.01
      4       7   10670   57044   0.09
      8      15   13407  156686   0.24
     16      31   20931  491161   0.74
     32      63   37588 1774854   2.69
     64     127   75778 7268636  11.01
    128     255   79386 13883934  21.04
    256     511   44863 15288638  23.17
    512    1023   17691 12086170  18.31
   1024    2047    5478 7497222  11.36
   2048    4095    1470 4030441   6.11
   4096    8191     400 2179114   3.30
   8192   16383      80  907597   1.38
  16384   32767      15  309075   0.47
  32768   65535       1   61508   0.09
total free extents 309941
total free blocks 65996362
average free extent size 212.932

AG 3
   from      to extents  blocks    pct
      1       1      29      29   0.00
      2       3     546    1392   0.00
      4       7    3695   19778   0.04
      8      15    5115   59644   0.11
     16      31    8650  203188   0.39
     32      63   19446  931805   1.78
     64     127   33918 3206707   6.12
    128     255   41472 7109466  13.56
    256     511   17943 6233330  11.89
    512    1023    9813 6800951  12.97
   1024    2047    5172 7319826  13.96
   2048    4095    3156 8975995  17.12
   4096    8191    1841 10198723  19.46
   8192   16383      94 1092849   2.08
  16384   32767      13  267832   0.51
total free extents 150903
total free blocks 52421515
average free extent size 347.386

AG 4
   from      to extents  blocks    pct
      1       1    3456    3456   0.00
      2       3     500    1139   0.00
      4       7     569    3705   0.00
      8      15    4786   55972   0.07
     16      31    9612  224801   0.28
     32      63   19371  919993   1.14
     64     127   34259 3236421   3.99
    128     255   62935 10890988  13.44
    256     511   41997 14096666  17.39
    512    1023   21973 14775110  18.23
   1024    2047    6946 9350039  11.54
   2048    4095    2680 8009016   9.88
   4096    8191    3185 17772132  21.93
   8192   16383     119 1383976   1.71
  16384   32767      14  283682   0.35
  32768   65535       1   37998   0.05
total free extents 212403
total free blocks 81045094
average free extent size 381.563

AG 5
   from      to extents  blocks    pct
      1       1    3724    3724   0.00
      2       3     700    1556   0.00
      4       7     117     604   0.00
      8      15    2299   27842   0.03
     16      31    5116  118536   0.14
     32      63   10153  479910   0.55
     64     127   17573 1659991   1.91
    128     255   48504 8454216   9.74
    256     511   33549 11365292  13.10
    512    1023   16890 11524528  13.28
   1024    2047    8227 11463511  13.21
   2048    4095    4975 14744962  16.99
   4096    8191    4363 24030370  27.69
   8192   16383     235 2623084   3.02
  16384   32767      14  281041   0.32
total free extents 156439
total free blocks 86779167
average free extent size 554.716

AG 6
   from      to extents  blocks    pct
      1       1    1674    1674   0.00
      2       3     355     813   0.00
      4       7    2715   16326   0.03
      8      15    7931   88289   0.19
     16      31   13045  305893   0.65
     32      63   24945 1181407   2.52
     64     127   41067 3842176   8.21
    128     255   32329 6311725  13.49
    256     511   18891 6828744  14.59
    512    1023    9908 6967594  14.89
   1024    2047    4350 6045470  12.92
   2048    4095    2193 6439871  13.76
   4096    8191    1461 7911939  16.90
   8192   16383      68  771093   1.65
  16384   32767       4   92454   0.20
total free extents 160936
total free blocks 46805468
average free extent size 290.833

AG 7
   from      to extents  blocks    pct
      1       1    2619    2619   0.01
      2       3    8411   22859   0.05
      4       7   24615  135628   0.29
      8      15   52117  602928   1.27
     16      31  113925 2689338   5.68
     32      63  130210 5632031  11.89
     64     127   70649 6543643  13.81
    128     255   34479 6648866  14.03
    256     511   12599 4479515   9.45
    512    1023    4811 3368174   7.11
   1024    2047    2103 3005162   6.34
   2048    4095    1634 5032217  10.62
   4096    8191    1350 7678961  16.21
   8192   16383     122 1317350   2.78
  16384   32767      10  221553   0.47
total free extents 459654
total free blocks 47380844
average free extent size 103.079

AG 8
   from      to extents  blocks    pct
      1       1    3356    3356   0.00
      2       3    1201    2823   0.00
      4       7    3239   17470   0.02
      8      15    5367   61388   0.08
     16      31   10163  239796   0.31
     32      63   16501  762075   0.97
     64     127   23574 2217722   2.84
    128     255   25811 4591149   5.87
    256     511   17518 6205779   7.94
    512    1023   11425 8119641  10.38
   1024    2047    8511 12248970  15.67
   2048    4095    6239 17951935  22.96
   4096    8191    3798 21258331  27.19
   8192   16383     298 3316518   4.24
  16384   32767      47 1032837   1.32
  32768   65535       4  159307   0.20
total free extents 137052
total free blocks 78189097
average free extent size 570.507

AG 9
   from      to extents  blocks    pct
      1       1    1658    1658   0.00
      2       3     671    1583   0.00
      4       7    3038   17804   0.02
      8      15    7393   84680   0.08
     16      31   14733  346835   0.34
     32      63   26665 1247101   1.23
     64     127   42681 4044159   3.99
    128     255   27529 5020861   4.95
    256     511   17799 6408209   6.32
    512    1023   13490 9703531   9.57
   1024    2047   10811 15690391  15.48
   2048    4095    8127 23294001  22.98
   4096    8191    4988 27939154  27.56
   8192   16383     408 4660404   4.60
  16384   32767      96 2050928   2.02
  32768   65535      18  852224   0.84
total free extents 180105
total free blocks 101363523
average free extent size 562.802

AG 10
   from      to extents  blocks    pct
      1       1     966     966   0.00
      2       3     746    2040   0.00
      4       7    6188   34186   0.04
      8      15   13586  157397   0.18
     16      31   33591  800147   0.93
     32      63   47124 2046100   2.38
     64     127   17498 1627896   1.90
    128     255   22847 4329426   5.04
    256     511   17337 6308302   7.35
    512    1023   12951 9341029  10.88
   1024    2047    9293 13400500  15.61
   2048    4095    6398 18424400  21.47
   4096    8191    3745 20831876  24.27
   8192   16383     316 3677143   4.28
  16384   32767     120 2682190   3.13
  32768   65535      25 1167323   1.36
  65536  131071      11  992269   1.16
total free extents 192742
total free blocks 85823190
average free extent size 445.275

AG 11
   from      to extents  blocks    pct
      1       1    1299    1299   0.00
      2       3     323     708   0.00
      4       7    3575   19776   0.03
      8      15    7735   89139   0.13
     16      31   17713  420384   0.62
     32      63   26712 1166375   1.73
     64     127   14584 1334644   1.97
    128     255   16415 3058179   4.52
    256     511   13009 4603114   6.81
    512    1023    9239 6592614   9.75
   1024    2047    7659 11116956  16.45
   2048    4095    5350 15410257  22.80
   4096    8191    3331 18748075  27.74
   8192   16383     208 2429051   3.59
  16384   32767      55 1244321   1.84
  32768   65535      12  505639   0.75
  65536  131071      10  855948   1.27
total free extents 127229
total free blocks 67596479
average free extent size 531.298

AG 12
   from      to extents  blocks    pct
      1       1    3005    3005   0.00
      2       3     714    1597   0.00
      4       7     187     945   0.00
      8      15    1305   16178   0.02
     16      31    2951   69749   0.08
     32      63    7235  345651   0.38
     64     127   12451 1169316   1.29
    128     255   17704 3131500   3.47
    256     511   17974 6108480   6.76
    512    1023   12145 8411406   9.31
   1024    2047   10674 15329545  16.97
   2048    4095    8003 22792418  25.23
   4096    8191    4873 27235851  30.14
   8192   16383     362 4212767   4.66
  16384   32767      64 1314405   1.45
  32768   65535       5  212528   0.24
total free extents 99652
total free blocks 90355341
average free extent size 906.709

AG 13
   from      to extents  blocks    pct
      1       1    2160    2160   0.00
      2       3     519    1163   0.00
      4       7     484    3058   0.00
      8      15    2409   27772   0.03
     16      31    4278  100415   0.11
     32      63    9579  457724   0.51
     64     127   16929 1587860   1.76
    128     255   17847 3225558   3.57
    256     511   16287 5626056   6.23
    512    1023   11821 8380892   9.28
   1024    2047   10335 14995088  16.60
   2048    4095    7862 22594531  25.01
   4096    8191    4758 26754739  29.61
   8192   16383     420 4805199   5.32
  16384   32767      71 1495249   1.65
  32768   65535       7  291647   0.32
total free extents 105766
total free blocks 90349111
average free extent size 854.236

AG 14
   from      to extents  blocks    pct
      1       1    1381    1381   0.00
      2       3     245     547   0.00
      4       7     283    1808   0.00
      8      15    2387   27720   0.03
     16      31    4178   97402   0.10
     32      63    8758  415855   0.42
     64     127   15409 1452075   1.46
    128     255   21170 4098561   4.12
    256     511   17815 6425768   6.46
    512    1023   12923 9287495   9.34
   1024    2047   11645 16908465  17.00
   2048    4095    8668 25031077  25.16
   4096    8191    5335 29855758  30.01
   8192   16383     380 4260079   4.28
  16384   32767      72 1495166   1.50
  32768   65535       3  112759   0.11
total free extents 110652
total free blocks 99471916
average free extent size 898.962

AG 15
   from      to extents  blocks    pct
      1       1     207     207   0.00
      2       3     519    1471   0.02
      4       7    1978   10867   0.13
      8      15    3736   42434   0.50
     16      31    6604  154719   1.83
     32      63   13689  653865   7.73
     64     127   24824 2356818  27.86
    128     255   21639 3771966  44.59
    256     511    1990  611208   7.23
    512    1023     157  105129   1.24
   1024    2047      74  107559   1.27
   2048    4095     153  377991   4.47
   4096    8191      27  163987   1.94
   8192   16383       9  101213   1.20
total free extents 75606
total free blocks 8459434
average free extent size 111.888

AG 16
   from      to extents  blocks    pct
      1       1    1140    1140   0.00
      2       3     758    1750   0.00
      4       7    1018    5639   0.01
      8      15    1759   19882   0.02
     16      31    3172   75580   0.08
     32      63    8026  384525   0.40
     64     127   13894 1294275   1.36
    128     255   16508 3126070   3.27
    256     511   13907 5090005   5.33
    512    1023   11460 8377061   8.77
   1024    2047    9811 14258785  14.93
   2048    4095    7805 22582010  23.65
   4096    8191    5407 30571003  32.02
   8192   16383     551 6374167   6.68
  16384   32767     156 3255515   3.41
  32768   65535       2   69263   0.07
total free extents 95374
total free blocks 95486670
average free extent size 1001.18

AG 17
   from      to extents  blocks    pct
      1       1    3555    3555   0.02
      2       3    3384    8530   0.04
      4       7    6883   37356   0.20
      8      15   11694  133428   0.70
     16      31   33349  776859   4.10
     32      63   69828 3159981  16.67
     64     127   78168 7397203  39.01
    128     255   24793 4100867  21.63
    256     511    3287 1064507   5.61
    512    1023     816  550030   2.90
   1024    2047     321  453787   2.39
   2048    4095     167  457272   2.41
   4096    8191      77  412315   2.17
   8192   16383      28  309618   1.63
  16384   32767       4   95252   0.50
total free extents 236354
total free blocks 18960560
average free extent size 80.221

AG 18
   from      to extents  blocks    pct
      1       1    1961    1961   0.01
      2       3   17129   42602   0.11
      4       7   33374  183312   0.48
      8      15   68076  783020   2.06
     16      31  146868 3469398   9.14
     32      63  248690 10614558  27.96
     64     127   32088 2798748   7.37
    128     255    8654 1492521   3.93
    256     511    4227 1431586   3.77
    512    1023    2531 1824377   4.81
   1024    2047    2125 3076304   8.10
   2048    4095    1615 4691302  12.36
   4096    8191    1070 6062351  15.97
   8192   16383     139 1454627   3.83
  16384   32767       2   41359   0.11
total free extents 568549
total free blocks 37968026
average free extent size 66.7806

AG 19
   from      to extents  blocks    pct
      1       1     107     107   0.00
      2       3    1571    4670   0.01
      4       7    4283   22880   0.06
      8      15    6402   73460   0.19
     16      31   13908  333208   0.88
     32      63   30597 1431708   3.78
     64     127   55680 5363294  14.16
    128     255   47437 8647119  22.83
    256     511   17362 5680534  15.00
    512    1023    5096 3578314   9.45
   1024    2047    2397 3364622   8.89
   2048    4095    1309 3712093   9.80
   4096    8191     715 3963154  10.47
   8192   16383     143 1501527   3.97
  16384   32767       9  191537   0.51
total free extents 187016
total free blocks 37868227
average free extent size 202.487

AG 20
   from      to extents  blocks    pct
      1       1     598     598   0.00
      2       3     827    2187   0.01
      4       7    3608   19728   0.05
      8      15    6881   79046   0.21
     16      31   13022  305537   0.80
     32      63   25486 1210959   3.17
     64     127   49017 4712712  12.34
    128     255   30156 5748421  15.05
    256     511   10155 3495836   9.15
    512    1023    5674 4149553  10.86
   1024    2047    3253 4652429  12.18
   2048    4095    1900 5476465  14.34
   4096    8191    1208 6732503  17.63
   8192   16383     105 1198304   3.14
  16384   32767      16  338049   0.89
  32768   65535       2   74999   0.20
total free extents 151908
total free blocks 38197326
average free extent size 251.45

AG 21
   from      to extents  blocks    pct
      1       1      20      20   0.00
      2       3      28      63   0.00
      4       7     115     649   0.01
      8      15     244    2809   0.03
     16      31     453   10626   0.11
     32      63    1265   61288   0.63
     64     127    2405  228896   2.35
    128     255    3165  577983   5.93
    256     511    1913  647654   6.64
    512    1023    1038  714076   7.32
   1024    2047     700 1028270  10.55
   2048    4095     698 2035971  20.88
   4096    8191     514 2981234  30.58
   8192   16383      74  834862   8.56
  16384   32767      15  325690   3.34
  32768   65535       7  299123   3.07
total free extents 12654
total free blocks 9749214
average free extent size 770.445

AG 22
   from      to extents  blocks    pct
      1       1     181     181   0.00
      2       3     181     485   0.00
      4       7     471    2569   0.00
      8      15    1133   12622   0.02
     16      31    1660   38976   0.06
     32      63    2968  138446   0.23
     64     127    5624  534816   0.88
    128     255    6238 1013851   1.67
    256     511    4898 1598605   2.63
    512    1023    3271 2265332   3.73
   1024    2047    2349 3367324   5.54
   2048    4095    1707 4927265   8.10
   4096    8191    1258 7233661  11.90
   8192   16383     416 4681490   7.70
  16384   32767     240 5417597   8.91
  32768   65535     142 6879800  11.31
  65536  131071      73 6599297  10.85
 131072  262143      55 9907812  16.30
 262144  524287      15 6182398  10.17
total free extents 32880
total free blocks 60802527
average free extent size 1849.23

AG 23
   from      to extents  blocks    pct
      1       1      90      90   0.00
      2       3      73     174   0.00
      4       7     260    1427   0.00
      8      15     639    7159   0.01
     16      31    1103   26041   0.03
     32      63    2083   99039   0.12
     64     127    4417  423805   0.52
    128     255    6331 1033005   1.28
    256     511    5124 1670066   2.07
    512    1023    3618 2499924   3.09
   1024    2047    2480 3492362   4.32
   2048    4095    1509 4278094   5.30
   4096    8191     996 5688231   7.04
   8192   16383     124 1403821   1.74
  16384   32767      35  805104   1.00
  32768   65535      29 1301420   1.61
  65536  131071      26 2259171   2.80
 131072  262143      31 5862140   7.26
 262144  524287      30 11971903  14.82
 524288 1048575      15 11844396  14.66
1048576 2097151      14 20897049  25.87
4194304 8388607       1 5213142   6.45
total free extents 29028
total free blocks 80777563
average free extent size 2782.75

AG 24
   from      to extents  blocks    pct
      1       1     670     670   0.00
      2       3    1385    3465   0.01
      4       7    4049   22284   0.04
      8      15    7957   91548   0.15
     16      31   18216  435852   0.72
     32      63   50655 2550157   4.20
     64     127   82684 7564495  12.46
    128     255   48304 8493950  14.00
    256     511   21102 7373519  12.15
    512    1023   10541 7401799  12.20
   1024    2047    5096 7179159  11.83
   2048    4095    2525 7157989  11.79
   4096    8191    1378 7677669  12.65
   8192   16383     119 1376478   2.27
  16384   32767     103 2424330   3.99
  32768   65535      15  672757   1.11
  65536  131071       2  131468   0.22
 131072  262143       1  131200   0.22
total free extents 254802
total free blocks 60688789
average free extent size 238.18

AG 25
   from      to extents  blocks    pct
      1       1     405     405   0.00
      2       3     165     378   0.00
      4       7     743    4356   0.01
      8      15    1868   21572   0.03
     16      31    3909   93678   0.12
     32      63    8919  425029   0.55
     64     127   15412 1455339   1.90
    128     255   24492 4427828   5.77
    256     511   19279 6661076   8.68
    512    1023   13237 9346109  12.17
   1024    2047    9003 12812201  16.69
   2048    4095    5778 16383579  21.34
   4096    8191    3442 19072421  24.84
   8192   16383     248 2911859   3.79
  16384   32767      99 2365870   3.08
  32768   65535      18  795680   1.04
total free extents 107017
total free blocks 76777380
average free extent size 717.432

AG 26
   from      to extents  blocks    pct
      1       1      99      99   0.00
      2       3     298     817   0.00
      4       7    1677    8992   0.01
      8      15    2826   32643   0.04
     16      31    5761  136023   0.15
     32      63   12636  615792   0.66
     64     127   18809 1774258   1.91
    128     255   28196 4707461   5.06
    256     511   24529 8159340   8.77
    512    1023   16408 11363082  12.21
   1024    2047   10999 15601527  16.76
   2048    4095    7504 21234282  22.81
   4096    8191    4545 25548241  27.45
   8192   16383     254 2849284   3.06
  16384   32767      34  733619   0.79
  32768   65535       6  307024   0.33
total free extents 134581
total free blocks 93072484
average free extent size 691.572

AG 27
   from      to extents  blocks    pct
      1       1   10879   10879   0.02
      2       3   11703   30182   0.06
      4       7   15365   82109   0.16
      8      15   21594  249932   0.50
     16      31   28409  655518   1.31
     32      63   43497 2049403   4.08
     64     127   72680 6904421  13.76
    128     255  102300 19405736  38.67
    256     511   26206 9273138  18.48
    512    1023    8191 5686275  11.33
   1024    2047    2047 2745613   5.47
   2048    4095     489 1365775   2.72
   4096    8191     180  993540   1.98
   8192   16383      30  320798   0.64
  16384   32767      16  368624   0.73
  32768   65535       1   34557   0.07
total free extents 343587
total free blocks 50176500
average free extent size 146.037

AG 28
   from      to extents  blocks    pct
      1       1    7552    7552   0.01
      2       3    3098    7292   0.01
      4       7    2771   15821   0.02
      8      15    7104   82312   0.10
     16      31   10221  237417   0.29
     32      63   20206  967751   1.16
     64     127   35069 3342138   4.02
    128     255   72234 13584381  16.35
    256     511   37385 13205348  15.89
    512    1023   19299 13455515  16.19
   1024    2047    8191 11466542  13.80
   2048    4095    3650 10478167  12.61
   4096    8191    2434 13579353  16.34
   8192   16383     162 1867524   2.25
  16384   32767      27  595097   0.72
  32768   65535       4  207899   0.25
total free extents 229407
total free blocks 83100109
average free extent size 362.239

AG 29
   from      to extents  blocks    pct
      1       1    7059    7059   0.04
      2       3    6270   15470   0.09
      4       7    9456   50893   0.30
      8      15   15264  175419   1.03
     16      31   26440  618832   3.64
     32      63   44154 2089707  12.28
     64     127   80897 7696115  45.23
    128     255   20391 3384375  19.89
    256     511    3974 1229352   7.22
    512    1023     626  410718   2.41
   1024    2047     141  193862   1.14
   2048    4095      59  170185   1.00
   4096    8191      85  473185   2.78
   8192   16383      21  239248   1.41
  16384   32767      10  208379   1.22
  32768   65535       1   53376   0.31
total free extents 214848
total free blocks 17016175
average free extent size 79.201

AG 30
   from      to extents  blocks    pct
      1       1    1672    1672   0.03
      2       3    1073    2577   0.05
      4       7    1202    6461   0.13
      8      15    1751   19741   0.39
     16      31    2830   65939   1.29
     32      63    4589  216879   4.25
     64     127    8443  801744  15.71
    128     255    5988 1023450  20.05
    256     511    2230  737877  14.46
    512    1023     714  495411   9.71
   1024    2047     377  536218  10.51
   2048    4095     212  611170  11.98
   4096    8191      85  478388   9.37
   8192   16383       7   86683   1.70
  16384   32767       1   19328   0.38
total free extents 31174
total free blocks 5103538
average free extent size 163.711

AG 31
   from      to extents  blocks    pct
      1       1    4225    4225   0.02
      2       3    3496    8376   0.05
      4       7    5808   32657   0.19
      8      15   12158  139830   0.80
     16      31   22809  534239   3.07
     32      63   42627 2017708  11.60
     64     127   79484 7548752  43.41
    128     255   22140 3540716  20.36
    256     511    4862 1476486   8.49
    512    1023     568  364917   2.10
   1024    2047     151  204945   1.18
   2048    4095      58  168429   0.97
   4096    8191      74  435946   2.51
   8192   16383      30  379043   2.18
  16384   32767      27  532823   3.06
total free extents 198517
total free blocks 17389092
average free extent size 87.595

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC]
  2021-08-30  4:21                   ` Darrick J. Wong
@ 2021-08-30  7:40                     ` Chris Dunlop
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Dunlop @ 2021-08-30  7:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Eric Sandeen, linux-xfs

On Sun, Aug 29, 2021 at 09:21:18PM -0700, Darrick J. Wong wrote:
> Since you've formatted with rmapbt enabled, you probably have a new
> enough xfsprogs that you can /also/ use this on a live fs:

Yep, I put on xfsprogs 5.12.0 to look into all of this.

> $ xfs_spaceman -c 'freesp -g'  /
...
> $ xfs_spaceman -c 'freesp -s -a 2'  /

Tks, that's useful.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC
  2021-08-30  7:37                   ` Mysterious ENOSPC Chris Dunlop
@ 2021-09-02  1:42                     ` Dave Chinner
  2021-09-17  6:07                       ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2021-09-02  1:42 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Eric Sandeen, linux-xfs

On Mon, Aug 30, 2021 at 05:37:20PM +1000, Chris Dunlop wrote:
> On Mon, Aug 30, 2021 at 08:04:57AM +1000, Dave Chinner wrote:
> > On Sat, Aug 28, 2021 at 01:58:24PM +1000, Chris Dunlop wrote:
> > > On Sat, Aug 28, 2021 at 10:21:37AM +1000, Chris Dunlop wrote:
> > > > On Sat, Aug 28, 2021 at 08:03:43AM +1000, Dave Chinner wrote:
> > > > > commit fd43cf600cf61c66ae0a1021aca2f636115c7fcb
> > > > > Author: Brian Foster <bfoster@redhat.com>
> > > > > Date:   Wed Apr 28 15:06:05 2021 -0700
> > > > > 
> > > > >   xfs: set aside allocation btree blocks from block reservation
> > > > 
> > > > Oh wow. Yes, sounds like a candidate. Is there same easy(-ish?) way of
> > > > seeing if this fs is likely to be suffering from this particular issue
> > > > or is it a matter of installing an appropriate kernel and seeing if the
> > > > problem goes away?
> > > 
> > > Is this sufficient to tell us that this filesystem probably isn't suffering
> > > from that issue?
> > 
> > IIRC, it's the per-ag histograms that are more important here
> > because we are running out of space in an AG because of
> > overcommitting the per-ag space. If there is an AG that is much more
> > fragmented than others, then it will be consuming much more in way
> > of freespace btree blocks than others...
> 
> Per-ag histograms attached.
> 
> Do the blocks used by the allocation btrees show up in the AG histograms?
> E.g. with an AG like this:
> 
> AG 18
>    from      to extents  blocks    pct
>       1       1    1961    1961   0.01
>       2       3   17129   42602   0.11
>       4       7   33374  183312   0.48
>       8      15   68076  783020   2.06
>      16      31  146868 3469398   9.14
>      32      63  248690 10614558  27.96
>      64     127   32088 2798748   7.37
>     128     255    8654 1492521   3.93
>     256     511    4227 1431586   3.77
>     512    1023    2531 1824377   4.81
>    1024    2047    2125 3076304   8.10
>    2048    4095    1615 4691302  12.36
>    4096    8191    1070 6062351  15.97
>    8192   16383     139 1454627   3.83
>   16384   32767       2   41359   0.11
> total free extents 568549
> total free blocks 37968026
> average free extent size 66.7806
> 
> ...it looks like it's significantly fragmented, but, if the allocation
> btrees aren't part of this, it seems there's still sufficient free space
> that it shouldn't be getting to ENOSPC?

Unless something asks for ~120GB of space to be allocated from the
AG, and then it will have only a small amount of free space and
could trigger such issues.

As you said, this is difficult to reproduce, so the current state of
the FS is unlikely to be in the exact state that triggers the
problem. What I'm looking at is whether the underlying conditions
are present that could potentially lead to that sort of problem
occuring

> > Context is very important when trying to determine if free space
> > fragmentation is an issue or not. Most of the time, it isn't an
> > issue at all but people have generally been trained to think "all
> > fragmentation is bad" rather than "only worry about fragmentation if
> > there is a problem that is directly related to physical allocation
> > patterns"...
> 
> In this case it's a typical backup application: it uploads regular
> incremental files and those are later merged into a full backup file, either
> by extending or overwriting or reflinking depending on whether the app
> decides to use reflinks or not. The uploads are sequential and mostly
> large-ish writes (132K+), then the merge is small to medium size randomish
> writes or reflinks (4K-???). So the smaller writes/reflinks are going to
> create a significant amount of fragmentation. The incremental files are
> removed entirely at some later time (no discard involved).

IOWs, sets of data with different layouts and temporal
characteristics. Yup, that will cause fragmentation over time and
slowly prevent recovery of large free spaces as files are deleted.
The AG histograms largely reflect this.

> I guess if it's determined this pattern is critically suboptimal and causing
> this errant ENOSPC issue, and the changes in 5.13 don't help, there's
> nothing to stop me from occasionally doing a full (non-reflink) copy of the
> large full backup files into another file to get them nicely sequential. I'd
> lose any reflinks along the way of course, but they don't last a long time
> anyway (days to a few weeks) depending on how long the smaller incremental
> files are kept.

IOWs, you suggest defragmenting the file data. You could do that
transparently with xfs_fsr, but defragmenting data doesn't actually
fix free space fragmentation - it actually makes it worse. This is
inherent in the defragmentation algorithm - small used spaces get
turned into small free spaces and large free spaces get turned into
large used spaces.

Defragmenting free space is a whole lot harder, and it involves
identifying where free space is interleaved with data and then
moving that data to other free space so the small free spaces are
reconnected into a large free space. Defragmenting data is easy,
defragmenting free space is much harder...

> AG 15
>    from      to extents  blocks    pct
>       1       1     207     207   0.00
>       2       3     519    1471   0.02
>       4       7    1978   10867   0.13
>       8      15    3736   42434   0.50
>      16      31    6604  154719   1.83
>      32      63   13689  653865   7.73
>      64     127   24824 2356818  27.86
>     128     255   21639 3771966  44.59
>     256     511    1990  611208   7.23
>     512    1023     157  105129   1.24
>    1024    2047      74  107559   1.27
>    2048    4095     153  377991   4.47
>    4096    8191      27  163987   1.94
>    8192   16383       9  101213   1.20
> total free extents 75606
> total free blocks 8459434
> average free extent size 111.888

This is the AG is a candidate - it's only got ~35GB of free space
in it and has significant free space fragmentation - at least 160
freespace btree blocks per btree in this AG.

> AG 30
>    from      to extents  blocks    pct
>       1       1    1672    1672   0.03
>       2       3    1073    2577   0.05
>       4       7    1202    6461   0.13
>       8      15    1751   19741   0.39
>      16      31    2830   65939   1.29
>      32      63    4589  216879   4.25
>      64     127    8443  801744  15.71
>     128     255    5988 1023450  20.05
>     256     511    2230  737877  14.46
>     512    1023     714  495411   9.71
>    1024    2047     377  536218  10.51
>    2048    4095     212  611170  11.98
>    4096    8191      85  478388   9.37
>    8192   16383       7   86683   1.70
>   16384   32767       1   19328   0.38
> total free extents 31174
> total free blocks 5103538
> average free extent size 163.711

This one has the least free space, but fewer free space
extents. It's still a potential candidate for AG ENOSPC conditions
to be triggered, though.

Ok, now I've seen the filesystem layout, I can say that the
preconditions for per-ag ENOSPC conditions do actually exist. Hence
we now really need to know what operation is reporting ENOSPC. I
guess we'll just have to wait for that to occur again and hope your
scripts capture it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Mysterious ENOSPC
  2021-09-02  1:42                     ` Dave Chinner
@ 2021-09-17  6:07                       ` Chris Dunlop
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Dunlop @ 2021-09-17  6:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs

On Thu, Sep 02, 2021 at 11:42:06AM +1000, Dave Chinner wrote:
> On Mon, Aug 30, 2021 at 08:04:57AM +1000, Dave Chinner wrote:
>> FWIW, if you are using reflink heavily and you have rmap enabled (as
>> you have), there's every chance that an AG has completely run out of
>> space and so new rmap records for shared extents can't be allocated
>> - that can give you spurious ENOSPC errors before the filesystem is
>> 100% full, too.
>>
>> i.e. every shared extent in the filesystem has a rmap record
>> pointing back to each owner of the shared extent. That means for an
>> extent shared 1000 times, there are 1000 rmap records for that
>> shared extent. If you share it again, a new rmap record needs to be
>> inserted into the rmapbt, and if the AG is completely out of space
>> this can fail w/ ENOSPC. Hence you can get ENOSPC errors attempting
>> to shared or unshare extents because there isn't space in the AG for
>> the tracking metadata for the new extent record....
...
> Ok, now I've seen the filesystem layout, I can say that the
> preconditions for per-ag ENOSPC conditions do actually exist. Hence
> we now really need to know what operation is reporting ENOSPC. I
> guess we'll just have to wait for that to occur again and hope your
> scripts capture it.

FYI, "something" seems to have changed without any particular prompting 
and there haven't been any ENOSPC events in the last 3 weeks whereas 
previously they were occurring 4-5 times a week. Sigh.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-17  6:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-26  2:06 XFS fallocate implementation incorrectly reports ENOSPC Chris Dunlop
2021-08-26 15:05 ` Eric Sandeen
2021-08-26 20:56   ` Chris Dunlop
2021-08-27  2:55     ` Chris Dunlop
2021-08-27  5:49       ` Dave Chinner
2021-08-27  6:53         ` Chris Dunlop
2021-08-27 22:03           ` Dave Chinner
2021-08-28  0:21             ` Mysterious ENOSPC [was: XFS fallocate implementation incorrectly reports ENOSPC] Chris Dunlop
2021-08-28  3:58               ` Chris Dunlop
2021-08-29 22:04                 ` Dave Chinner
2021-08-30  4:21                   ` Darrick J. Wong
2021-08-30  7:40                     ` Chris Dunlop
2021-08-30  7:37                   ` Mysterious ENOSPC Chris Dunlop
2021-09-02  1:42                     ` Dave Chinner
2021-09-17  6:07                       ` Chris Dunlop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).