All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_bmap_extents_to_btree allocation warnings
@ 2022-01-05  7:10 Krister Johansen
  2022-01-06  1:01 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Krister Johansen @ 2022-01-05  7:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner

Hi,
I've been running into occasional WARNs related to allocating a block to
hold the new btree that XFS is attempting to create when calling this
function.  The problem is sporadic -- once every 10-40 days and a
different system each time.

The process that's triggering the problem is dd punching a hole into
file via direct I/O.  It's doing this as part of a watchdog process to
ensure that the system remains able to issue read and write requests.
The direct I/O is an attempt to avoid reading/writing cached data from
this process.

I'm hardly an expert; however, after some digging it appears that the
direct I/O path for this particular workload is more susceptible to the
problem because its tp->t_firstblock is always set to a block in an
existing AG, while the rest of the I/O on this filesystem goes through
the page cache and uses the delayed allocation mechanism by default.
(IOW, t_firstblock is NULLFSBLOCK most of the time.)

Looking at the version history and mailing list archives for this
particular bit of code seem to indicate that this particular function
hasn't had a ton of churn.  The XFS_ALLOCTYPE_START_BNO vs
XFS_ALLOCTYPE_NEAR_BNO bits seem not to have changed much since they
were authored in the 90s.

I haven't yet been able to get a kdump to look into this WARN in more
detail, but I was curious if the rationale for using
XFS_ALLOCTYPE_NEAR_BNO still held true for modern linux based XFS?

It seemed like one reason for keeping the bmap and the inode in the same
AG might be that with 32-bit block pointers in an inode there wouldn't
be space to store the AG and the block if the btree were allocated in a
different AG.  It also seemed like there were lock order concerns when
iterating over multiple AGs.  However, linux is using 64-bit block
pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in
xfs_alloc_vextent() seems to try to ensure that it never considers an AG
that's less than the agno for the fsbno passed in via args.

Would something like this:

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4dccd4d90622..5d949ac1ecae 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -664,6 +664,13 @@ xfs_bmap_extents_to_btree(
 	if (error)
 		goto out_root_realloc;
 
+	if (args.fsbno == NULLFSBLOCK && args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		args.type = XFS_ALLOCTYPE_START_BNO;
+		error = xfs_alloc_vextent(&args);
+		if (error)
+			goto out_root_realloc;
+	}
+
 	if (WARN_ON_ONCE(args.fsbno == NULLFSBLOCK)) {
 		error = -ENOSPC;
 		goto out_root_realloc;

Or this:

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4dccd4d90622..94e4ecb75561 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -647,14 +647,10 @@ xfs_bmap_extents_to_btree(
 	args.tp = tp;
 	args.mp = mp;
 	xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork);
+	args.type = XFS_ALLOCTYPE_START_BNO;
 	if (tp->t_firstblock == NULLFSBLOCK) {
-		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
-	} else if (tp->t_flags & XFS_TRANS_LOWMODE) {
-		args.type = XFS_ALLOCTYPE_START_BNO;
-		args.fsbno = tp->t_firstblock;
 	} else {
-		args.type = XFS_ALLOCTYPE_NEAR_BNO;
 		args.fsbno = tp->t_firstblock;
 	}
 	args.minlen = args.maxlen = args.prod = 1;

be a reasonable way to address the WARN?  Or does this open a box of
problems that obvious to the experienced, but just subtle enough to
elude the unfamiliar?

I ask because these filesystems are pretty busy on a day to day basis
and the path where t_firstblock is NULLFSBLOCK is never hitting this
problem.  The overall workload is a btree based database.  Lots of
random reads and writes to many files that all live in the same
directory.

While I don't have a full RCA, the circumstantial evidence seems to
suggest that letting the allocation be satisfied by more than one AG
will result in many fewer failures in xfs_bmap_extents_to_btree.
Assuming of course, that it's actually a safe and sane thing to do.

Many thanks,

-K





Including additional diagnostic output that was requested by the FAQ in
case it's helpful:

Kernel version: various 5.4 LTS versions
xfsprogs version: 5.3.0
number of CPUs: 48

Storage layout: 4x 7TB NVMe drives in a 28T LVM stripe with 16k width

xfs-info:

meta-data=/dev/mapper/db-vol     isize=512    agcount=32, agsize=228849020 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=7323168640, imaxpct=5
         =                       sunit=4      swidth=16 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=4 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

An example warning:

------------[ cut here ]------------
WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
Modules linked in: btrfs xor zstd_compress raid6_pq ufs msdos softdog binfmt_misc udp_diag tcp_diag inet_diag xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua p

CPU: 4 PID: 115756 Comm: dd Not tainted 5.4.139 #2
Hardware name: Amazon EC2 i3en.12xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
Code: 00 00 8b 85 00 ff ff ff 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 c7 45 9c 10 00 00 00 48 89 95 60 ff ff ff e9 5a fd ff ff <0f> 0b c7 85 00 ff ff ff e4 ff ff ff 8b 9d

RSP: 0018:ffffa948115fb740 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffff8ffec2274048 RCX: 00000000001858ab
RDX: 000000000017e467 RSI: 0000000000000000 RDI: ffff904ea1726000
RBP: ffffa948115fb870 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 000000000017e467 R12: ffff8ff22bcdd6a8
R13: ffff904ea9f85000 R14: ffff8ffec2274000 R15: ffff904e476a4380
FS:  00007f5625e89580(0000) GS:ffff904ebb300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055bd5952c000 CR3: 000000011d6f0004 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 ? xfs_alloc_update_counters.isra.0+0x3d/0x50 [xfs]
 ? xfs_trans_log_buf+0x30/0x80 [xfs]
 ? xfs_alloc_log_agf+0x73/0x100 [xfs]
 xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs]
 xfs_bmapi_allocate+0x2a8/0x2d0 [xfs]
 ? kmem_zone_alloc+0x85/0x140 [xfs]
 xfs_bmapi_write+0x3a9/0x5f0 [xfs]
 xfs_iomap_write_direct+0x293/0x3c0 [xfs]
 xfs_file_iomap_begin+0x4d2/0x5c0 [xfs]
 iomap_apply+0x68/0x160
 ? iomap_dio_bio_actor+0x3d0/0x3d0
 iomap_dio_rw+0x2c1/0x450
 ? iomap_dio_bio_actor+0x3d0/0x3d0
 xfs_file_dio_aio_write+0x103/0x2e0 [xfs]
 ? xfs_file_dio_aio_write+0x103/0x2e0 [xfs]
 xfs_file_write_iter+0x99/0xe0 [xfs]
 new_sync_write+0x125/0x1c0
 __vfs_write+0x29/0x40
 vfs_write+0xb9/0x1a0
 ksys_write+0x67/0xe0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5625da71e7
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48

RSP: 002b:00007ffd3916c2a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000004000 RCX: 00007f5625da71e7
RDX: 0000000000004000 RSI: 000055bd59529000 RDI: 0000000000000001
RBP: 000055bd59529000 R08: 000055bd595280f0 R09: 000000000000007c
R10: 000055bd59529000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000004000 R15: 000055bd59529000
---[ end trace b26426a6b66a298e ]---
XFS (dm-1): Internal error xfs_trans_cancel at line 1053 of file fs/xfs/xfs_trans.c.  Caller xfs_iomap_write_direct+0x1fb/0x3c0 [xfs]

The xfs_db freesp report after the problem occurred.  (N.B. it was a few
hours before I was able to get to this machine to investigate)

xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol
   from      to extents  blocks    pct
      1       1      48      48   0.00
      2       3     119     303   0.02
      4       7      46     250   0.01
      8      15      22     255   0.01
     16      31      17     374   0.02
     32      63      16     728   0.04
     64     127       9     997   0.05
    128     255     149   34271   1.83
    256     511       7    2241   0.12
    512    1023       4    2284   0.12
   1024    2047       1    1280   0.07
   2048    4095       1    3452   0.18
1048576 2097151       1 1825182  97.52
total free extents 440
total free blocks 1871665
average free extent size 4253.78

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: xfs_bmap_extents_to_btree allocation warnings
  2022-01-05  7:10 xfs_bmap_extents_to_btree allocation warnings Krister Johansen
@ 2022-01-06  1:01 ` Dave Chinner
  2022-01-06  8:52   ` Krister Johansen
  2022-01-08  5:40   ` Krister Johansen
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Chinner @ 2022-01-06  1:01 UTC (permalink / raw)
  To: Krister Johansen; +Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig

On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> Hi,
> I've been running into occasional WARNs related to allocating a block to
> hold the new btree that XFS is attempting to create when calling this
> function.  The problem is sporadic -- once every 10-40 days and a
> different system each time.

The warning is:

> WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
> RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
> Call Trace:
>  xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs]
>  xfs_bmapi_allocate+0x2a8/0x2d0 [xfs]
>  xfs_bmapi_write+0x3a9/0x5f0 [xfs]
>  xfs_iomap_write_direct+0x293/0x3c0 [xfs]
>  xfs_file_iomap_begin+0x4d2/0x5c0 [xfs]
>  iomap_apply+0x68/0x160
>  iomap_dio_rw+0x2c1/0x450
>  xfs_file_dio_aio_write+0x103/0x2e0 [xfs]
>  xfs_file_write_iter+0x99/0xe0 [xfs]
>  new_sync_write+0x125/0x1c0
>  __vfs_write+0x29/0x40
>  vfs_write+0xb9/0x1a0
>  ksys_write+0x67/0xe0
>  __x64_sys_write+0x1a/0x20
>  do_syscall_64+0x57/0x190
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Which is indicative of a multi-extent allocation (i.e. data extent +
BMBT indexing btree block allocations) not selecting an AG that has
enough space for both data and BMBT blocks to be allocated.

That's likely because args.total was not set correctly before AG
selection was made. This exact sort of bug can be seen in the fix
for recent error injection debug code in commit 6e8bd39d7227 ("xfs:
Initialize xfs_alloc_arg->total correctly when allocating minlen
extents"). You're not running a CONFIG_XFS_DEBUG=y kernel are you?

In this case:

> The process that's triggering the problem is dd punching a hole into
> file via direct I/O.  It's doing this as part of a watchdog process to
> ensure that the system remains able to issue read and write requests.
> The direct I/O is an attempt to avoid reading/writing cached data from
> this process.
> 
> I'm hardly an expert; however, after some digging it appears that the
> direct I/O path for this particular workload is more susceptible to the
> problem because its tp->t_firstblock is always set to a block in an
> existing AG,

Which is always true for BMBT block allocation - we've already
allocated the data extent in the transaction in
xfs_bmap_add_extent_hole_real(), and now we are doing the BMBT block
allocation needed to convert the extent list inline in the inode
literal area into external btree block format.

> while the rest of the I/O on this filesystem goes through
> the page cache and uses the delayed allocation mechanism by default.
> (IOW, t_firstblock is NULLFSBLOCK most of the time.)

Has nothing to do with delayed allocation - the physical allocation
that occurs in writeback will do exactly the same thing as direct IO
allocation....

> It seemed like one reason for keeping the bmap and the inode in the same
> AG might be that with 32-bit block pointers in an inode there wouldn't
> be space to store the AG and the block if the btree were allocated in a
> different AG.

The BMBT is a 64 bit btree so can index blocks anywhere in the
filesystem. 32 bit btree indexes are only used for internal
references within an AG, not for user data.

> It also seemed like there were lock order concerns when
> iterating over multiple AGs.

yes, that's what t_firstblock is used to avoid - AGFs must be locked
in ascending order only.

> However, linux is using 64-bit block
> pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in
> xfs_alloc_vextent() seems to try to ensure that it never considers an AG
> that's less than the agno for the fsbno passed in via args.

Because otherwise allocations ABBA deadlock on AG locking.

> Would something like this:
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 4dccd4d90622..5d949ac1ecae 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -664,6 +664,13 @@ xfs_bmap_extents_to_btree(
>  	if (error)
>  		goto out_root_realloc;
>  
> +	if (args.fsbno == NULLFSBLOCK && args.type == XFS_ALLOCTYPE_NEAR_BNO) {
> +		args.type = XFS_ALLOCTYPE_START_BNO;
> +		error = xfs_alloc_vextent(&args);
> +		if (error)
> +			goto out_root_realloc;
> +	}
> +
>  	if (WARN_ON_ONCE(args.fsbno == NULLFSBLOCK)) {
>  		error = -ENOSPC;
>  		goto out_root_realloc;

It might, but it doesn't fix the root cause which is that we
selected an AG without enough space in it for the entire chain of
allocations in the first place.

> Or this:
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 4dccd4d90622..94e4ecb75561 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -647,14 +647,10 @@ xfs_bmap_extents_to_btree(
>  	args.tp = tp;
>  	args.mp = mp;
>  	xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork);
> +	args.type = XFS_ALLOCTYPE_START_BNO;
>  	if (tp->t_firstblock == NULLFSBLOCK) {
> -		args.type = XFS_ALLOCTYPE_START_BNO;
>  		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
> -	} else if (tp->t_flags & XFS_TRANS_LOWMODE) {
> -		args.type = XFS_ALLOCTYPE_START_BNO;
> -		args.fsbno = tp->t_firstblock;
>  	} else {
> -		args.type = XFS_ALLOCTYPE_NEAR_BNO;
>  		args.fsbno = tp->t_firstblock;
>  	}
>  	args.minlen = args.maxlen = args.prod = 1;

Same again - hides the symptom, doesn't address the root cause.

> be a reasonable way to address the WARN?  Or does this open a box of
> problems that obvious to the experienced, but just subtle enough to
> elude the unfamiliar?

No, yes, and yes.

> I ask because these filesystems are pretty busy on a day to day basis
> and the path where t_firstblock is NULLFSBLOCK is never hitting this
> problem.  The overall workload is a btree based database.  Lots of
> random reads and writes to many files that all live in the same
> directory.

Which indicates that you are likely hitting AG full conditions more
frequently than not.

> meta-data=/dev/mapper/db-vol     isize=512    agcount=32, agsize=228849020 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1
....
> The xfs_db freesp report after the problem occurred.  (N.B. it was a few
> hours before I was able to get to this machine to investigate)
> 
> xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol
>    from      to extents  blocks    pct
>       1       1      48      48   0.00
>       2       3     119     303   0.02
>       4       7      46     250   0.01
>       8      15      22     255   0.01
>      16      31      17     374   0.02
>      32      63      16     728   0.04
>      64     127       9     997   0.05
>     128     255     149   34271   1.83
>     256     511       7    2241   0.12
>     512    1023       4    2284   0.12
>    1024    2047       1    1280   0.07
>    2048    4095       1    3452   0.18
> 1048576 2097151       1 1825182  97.52
> total free extents 440
> total free blocks 1871665
> average free extent size 4253.78

So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2%
full, so it's extremely likely you are hitting a full AG condition.

/me goes and looks at xfs_iomap_write_direct()....

.... and notices that it passes "0" as the total allocation block
count, which means it isn't reserving space in the AG for both the
data extent and the BMBT blocks...

... and several other xfs_bmapi_write() callers have the same
issue...

Ok, let me spend a bit more looking into this in more depth, but it
looks like the problem is at the xfs_bmapi_write() caller level, not
deep in the allocator itself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs_bmap_extents_to_btree allocation warnings
  2022-01-06  1:01 ` Dave Chinner
@ 2022-01-06  8:52   ` Krister Johansen
  2022-01-10  8:41     ` Dave Chinner
  2022-01-08  5:40   ` Krister Johansen
  1 sibling, 1 reply; 7+ messages in thread
From: Krister Johansen @ 2022-01-06  8:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig

Hi David,
Thanks for the response.

On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote:
> On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
> > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
> > Call Trace:
> >  xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs]
> >  xfs_bmapi_allocate+0x2a8/0x2d0 [xfs]
> >  xfs_bmapi_write+0x3a9/0x5f0 [xfs]
> >  xfs_iomap_write_direct+0x293/0x3c0 [xfs]
> >  xfs_file_iomap_begin+0x4d2/0x5c0 [xfs]
> >  iomap_apply+0x68/0x160
> >  iomap_dio_rw+0x2c1/0x450
> >  xfs_file_dio_aio_write+0x103/0x2e0 [xfs]
> >  xfs_file_write_iter+0x99/0xe0 [xfs]
> >  new_sync_write+0x125/0x1c0
> >  __vfs_write+0x29/0x40
> >  vfs_write+0xb9/0x1a0
> >  ksys_write+0x67/0xe0
> >  __x64_sys_write+0x1a/0x20
> >  do_syscall_64+0x57/0x190
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Which is indicative of a multi-extent allocation (i.e. data extent +
> BMBT indexing btree block allocations) not selecting an AG that has
> enough space for both data and BMBT blocks to be allocated.
> 
> That's likely because args.total was not set correctly before AG
> selection was made. This exact sort of bug can be seen in the fix
> for recent error injection debug code in commit 6e8bd39d7227 ("xfs:
> Initialize xfs_alloc_arg->total correctly when allocating minlen
> extents"). You're not running a CONFIG_XFS_DEBUG=y kernel are you?

I am not running with CONFIG_XFS_DEBUG, no.

> > while the rest of the I/O on this filesystem goes through
> > the page cache and uses the delayed allocation mechanism by default.
> > (IOW, t_firstblock is NULLFSBLOCK most of the time.)
> 
> Has nothing to do with delayed allocation - the physical allocation
> that occurs in writeback will do exactly the same thing as direct IO
> allocation....

It must be similar, but not identical.  Though I think you're right that
I've confused myself about what's happening here.  The dd process that
is triggering this warning is an incredibly small fraction of all the
I/O that occurs on this filesystem, and yet it accounts for all of the
instances of the warning that I can find.

I looked in xfs_bmapi_convert_delalloc and saw that xfs_trans_alloc
allocated a new transaction but didn't touch t_firstblock.  What I
neglected to notice, though, was that xfs_bmap_btalloc and
xfs_bmbt_alloc_block could modify t_firstblock, and that this code path
could get there via the xfs_btree_insert calls in
xfs_bmap_add_extent_delay_real for the latter and the xfs_bmap_alloc
path for the former.

I missed the difference in the alloc_args accounting between the delayed
and real allocation paths and assumed that t_firstblock was the thing
making the difference between the regular and direct I/O cases.

> > However, linux is using 64-bit block
> > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in
> > xfs_alloc_vextent() seems to try to ensure that it never considers an AG
> > that's less than the agno for the fsbno passed in via args.
> 
> Because otherwise allocations ABBA deadlock on AG locking.

Sorry, what I'm really trying to ask is: are there still cases in XFS
where using XFS_ALLOCTYPE_START_BNO can give you this kind of deadlock?
There's some deadlock avoidance in the START_BNO implementation, but
there are plenty of places where XFS_ALLOCTYPE_NEAR_BNO is still used if
t_firstblock is not NULLFSBLOCK.  I'm trying to work out if that code
still exists because of concerns about deadlocks, or if its an attempt
to limit the number of AGs searched instead.  (Or some other policy
choice, even.)

> > be a reasonable way to address the WARN?  Or does this open a box of
> > problems that obvious to the experienced, but just subtle enough to
> > elude the unfamiliar?
> 
> No, yes, and yes.

Thank you humoring my questions nonetheless.

> > The xfs_db freesp report after the problem occurred.  (N.B. it was a few
> > hours before I was able to get to this machine to investigate)
> > 
> > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol
> >    from      to extents  blocks    pct
> >       1       1      48      48   0.00
> >       2       3     119     303   0.02
> >       4       7      46     250   0.01
> >       8      15      22     255   0.01
> >      16      31      17     374   0.02
> >      32      63      16     728   0.04
> >      64     127       9     997   0.05
> >     128     255     149   34271   1.83
> >     256     511       7    2241   0.12
> >     512    1023       4    2284   0.12
> >    1024    2047       1    1280   0.07
> >    2048    4095       1    3452   0.18
> > 1048576 2097151       1 1825182  97.52
> > total free extents 440
> > total free blocks 1871665
> > average free extent size 4253.78
> 
> So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2%
> full, so it's extremely likely you are hitting a full AG condition.
> 
> /me goes and looks at xfs_iomap_write_direct()....
> 
> .... and notices that it passes "0" as the total allocation block
> count, which means it isn't reserving space in the AG for both the
> data extent and the BMBT blocks...
> 
> ... and several other xfs_bmapi_write() callers have the same
> issue...
> 
> Ok, let me spend a bit more looking into this in more depth, but it
> looks like the problem is at the xfs_bmapi_write() caller level, not
> deep in the allocator itself.

At least on 5.4 xfs_bmapi_write is still passing resblks instead of
zero, which is computed in xfs_iomap_write_direct.

Related to your comment about alloc_args.total, I did a bit of tracing
of the xfs_alloc tracepoints on my system and found that total seems to
be getting set in both cases, but that a) it's actually a larger value
for directio; and b) in the buffered write case the code is requesting
more blocks at one time which causes a larger allocation to occur.  I'm
not certain, but wondered if this could be causing us to select an AG
with more space by luck.

directio:

              dd-102229  [005] .... 4969662.383215: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 14240 minlen 4 maxlen 4 mod 0 prod 1 minleft 2 total 8 alignment 1 minalignslop 3 len 4 type THIS_BNO otype THIS_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff
              dd-102229  [005] .... 4969662.383217: <stack trace>
 => xfs_alloc_ag_vextent_exact+0x2b4/0x2d0 [xfs]
 => xfs_alloc_ag_vextent+0x13b/0x150 [xfs]
 => xfs_alloc_vextent+0x2bc/0x550 [xfs]
 => xfs_bmap_btalloc+0x461/0x940 [xfs]
 => xfs_bmap_alloc+0x34/0x40 [xfs]
 => xfs_bmapi_allocate+0xdc/0x2d0 [xfs]
 => xfs_bmapi_write+0x3a9/0x5f0 [xfs]
 => xfs_iomap_write_direct+0x293/0x3c0 [xfs]
 => xfs_file_iomap_begin+0x4d2/0x5c0 [xfs]
 => iomap_apply+0x68/0x160
 => iomap_dio_rw+0x2c1/0x450
 => xfs_file_dio_aio_write+0x103/0x2e0 [xfs]
 => xfs_file_write_iter+0x99/0xe0 [xfs]
 => new_sync_write+0x125/0x1c0
 => __vfs_write+0x29/0x40
 => vfs_write+0xb9/0x1a0
 => ksys_write+0x67/0xe0
 => __x64_sys_write+0x1a/0x20
 => do_syscall_64+0x57/0x190
 => entry_SYSCALL_64_after_hwframe+0x44/0xa9

buffered write + fsync:

              dd-109921  [010] .... 4972814.844429: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 21280 minlen 16 maxlen 16 mod 0 prod 1 minleft 2 total 4 alignment 1 minalignslop 3 len 16 type THIS_BNO otype THIS_BNO wasdel 1 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff
              dd-109921  [010] .... 4972814.844433: <stack trace>
 => xfs_alloc_ag_vextent_exact+0x2b4/0x2d0 [xfs]
 => xfs_alloc_ag_vextent+0x13b/0x150 [xfs]
 => xfs_alloc_vextent+0x2bc/0x550 [xfs]
 => xfs_bmap_btalloc+0x461/0x940 [xfs]
 => xfs_bmap_alloc+0x34/0x40 [xfs]
 => xfs_bmapi_allocate+0xdc/0x2d0 [xfs]
 => xfs_bmapi_convert_delalloc+0x26f/0x4b0 [xfs]
 => xfs_map_blocks+0x15a/0x3f0 [xfs]
 => xfs_do_writepage+0x118/0x420 [xfs]
 => write_cache_pages+0x1ae/0x4b0
 => xfs_vm_writepages+0x6a/0xa0 [xfs]
 => do_writepages+0x43/0xd0
 => __filemap_fdatawrite_range+0xd5/0x110
 => file_write_and_wait_range+0x74/0xc0
 => xfs_file_fsync+0x5d/0x230 [xfs]
 => vfs_fsync_range+0x49/0x80
 => do_fsync+0x3d/0x70
 => __x64_sys_fsync+0x14/0x20
 => do_syscall_64+0x57/0x190
 => entry_SYSCALL_64_after_hwframe+0x44/0xa9

I'm not sure if this is useful, but if there's something else that you'd
like me to instrument that would be, let me know and I'll see what I can
pull together.

Thanks again,

-K

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs_bmap_extents_to_btree allocation warnings
  2022-01-06  1:01 ` Dave Chinner
  2022-01-06  8:52   ` Krister Johansen
@ 2022-01-08  5:40   ` Krister Johansen
  2022-01-10  8:50     ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Krister Johansen @ 2022-01-08  5:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig, Brian Foster

Hi Dave,

On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote:
> On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> > Hi,
> > I've been running into occasional WARNs related to allocating a block to
> > hold the new btree that XFS is attempting to create when calling this
> > function.  The problem is sporadic -- once every 10-40 days and a
> > different system each time.
> 
> The warning is:
> 
> > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
> > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs]
> > Call Trace:
> >  xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs]
> >  xfs_bmapi_allocate+0x2a8/0x2d0 [xfs]
> >  xfs_bmapi_write+0x3a9/0x5f0 [xfs]
> >  xfs_iomap_write_direct+0x293/0x3c0 [xfs]
> >  xfs_file_iomap_begin+0x4d2/0x5c0 [xfs]
> >  iomap_apply+0x68/0x160
> >  iomap_dio_rw+0x2c1/0x450
> >  xfs_file_dio_aio_write+0x103/0x2e0 [xfs]
> >  xfs_file_write_iter+0x99/0xe0 [xfs]
> >  new_sync_write+0x125/0x1c0
> >  __vfs_write+0x29/0x40
> >  vfs_write+0xb9/0x1a0
> >  ksys_write+0x67/0xe0
> >  __x64_sys_write+0x1a/0x20
> >  do_syscall_64+0x57/0x190
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 

<snip>
 
> So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2%
> full, so it's extremely likely you are hitting a full AG condition.
> 
> /me goes and looks at xfs_iomap_write_direct()....
> 
> .... and notices that it passes "0" as the total allocation block
> count, which means it isn't reserving space in the AG for both the
> data extent and the BMBT blocks...
> 
> ... and several other xfs_bmapi_write() callers have the same
> issue...
> 
> Ok, let me spend a bit more looking into this in more depth, but it
> looks like the problem is at the xfs_bmapi_write() caller level, not
> deep in the allocator itself.

I noodled on this a bit more and have another hypothesis.  Feel free to
tell me that this one is just as nuts (or more).  However, after
thinking through your comments about the accounting, and reviewing some
other patches and threads for similar problems:

https://lore.kernel.org/linux-xfs/20171127202434.43125-4-bfoster@redhat.com/
https://lore.kernel.org/linux-xfs/20171207185810.48757-1-bfoster@redhat.com/
https://lore.kernel.org/linux-xfs/20190327145000.10756-1-bfoster@redhat.com/

I wondered if perhaps the problem was related to other problems in
xfs_alloc_fix_freelist.  Taking inspiration from some of the fixes that
Brian made here, it looks like there's a possibility of the freelist
refill code grabbing blocks that were assumed to be available by
previous checks in that function.

For example, using some values from a successful trace of a directio
allocation:

              dd-102227  [027] .... 4969662.381037: xfs_alloc_near_first: dev 25
3:1 agno 0 agbno 5924 minlen 4 maxlen 4 mod 0 prod 1 minleft 1 total 8 alignment
 4 minalignslop 0 len 4 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv
0 datatype 0x9 firstblock 0xffffffffffffffff

              dd-102227  [027] .... 4969662.381047: xfs_alloc_near_first: dev 25
3:1 agno 0 agbno 5921 minlen 1 maxlen 1 mod 0 prod 1 minleft 0 total 0 alignment
 1 minalignslop 0 len 1 type NEAR_BNO otype NEAR_BNO wasdel 0 wasfromfl 0 resv 0
 datatype 0x0 firstblock 0x1724

[first is the bmap alloc, second is the extents_to_btree alloc]
 
if agflcount = min(pagf_flcount, min_free)
   agflcount = min(3, 8)

and available = pagf_freeblks + agflcount - reservation - min_free - minleft
    available = 14 + 3 - 0 - 8 - 1

    available = 8

which satisfies the total from the first allocation request; however, if
this code path needs to refill the freelists and the ag btree is full
because a lot of space is allocated and not much is free, then inserts
here may trigger rebalances.  Usage might look something like this:

   pagf_freeblks = 14
   allocate 5 blocks to fill freelist
   pags_freeblks = 9
   fill of freelist triggers split that requires 4 nodes
   next iteration allocates 4 blocks to refill freelist
   pages_freeblks = 5
   refill requires rebalance and another node
   next iteration allocates 1 block to refill freelist
   pages_freeblks = 4
   freelist filled; return to caller

   caller consumes remaining 4 blocks for bmap allocation

   pages_freeblks = 0

   no blocks available for xfs_bmap_extents_to_btree

I'm not sure if this is possible, but I thought I'd mention it since
Brian's prior work here got me thinking about it.  If this does sound
plausible, what do you think about re-validating the space_available
conditions after refilling the freelist?  Something like:

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 353e53b..d235744 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2730,6 +2730,16 @@ xfs_alloc_fix_freelist(
 		}
 	}
 	xfs_trans_brelse(tp, agflbp);
+
+	/*
+	 * Freelist refill may have consumed blocks from pagf_freeblks.  Ensure
+	 * that this allocation still meets its requested constraints by
+	 * revalidating the min_freelist and space_available checks.
+	 */
+	need = xfs_alloc_min_freelist(mp, pag);
+	if (!xfs_alloc_space_available(args, need, flags))
+		goto out_agbp_relse;
+
 	args->agbp = agbp;
 	return 0;
 
perhaps?


Thanks again,

-K

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: xfs_bmap_extents_to_btree allocation warnings
  2022-01-06  8:52   ` Krister Johansen
@ 2022-01-10  8:41     ` Dave Chinner
  2022-01-10 23:40       ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2022-01-10  8:41 UTC (permalink / raw)
  To: Krister Johansen; +Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig

On Thu, Jan 06, 2022 at 12:52:28AM -0800, Krister Johansen wrote:
> On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote:
> > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> > > However, linux is using 64-bit block
> > > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in
> > > xfs_alloc_vextent() seems to try to ensure that it never considers an AG
> > > that's less than the agno for the fsbno passed in via args.
> > 
> > Because otherwise allocations ABBA deadlock on AG locking.
> 
> Sorry, what I'm really trying to ask is: are there still cases in XFS
> where using XFS_ALLOCTYPE_START_BNO can give you this kind of deadlock?

I'd say yes - I can think of several scenarios where we make
multiple allocations per transaction (directory and attribute code)
and I think that they don't actually use t_firstblock correctly to
avoid AGF locking issues. i.e. I think that tp->t_firstblock should
actually be tracking the highest locked AGF in the transaction, not
the first AGF we locked...

> There's some deadlock avoidance in the START_BNO implementation, but
> there are plenty of places where XFS_ALLOCTYPE_NEAR_BNO is still used if
> t_firstblock is not NULLFSBLOCK.

Right, NEAR_BNO is used because we likely have physical locality
constraints (e.g. seek minimisation by placing related blocks as
close to each other as possible) and potentially other limits, like
the allocations are for per-AG data and therefore must be placed
within the specified AG...

> I'm trying to work out if that code
> still exists because of concerns about deadlocks, or if its an attempt
> to limit the number of AGs searched instead.  (Or some other policy
> choice, even.)

All of the above, and more... :(

> > > be a reasonable way to address the WARN?  Or does this open a box of
> > > problems that obvious to the experienced, but just subtle enough to
> > > elude the unfamiliar?
> > 
> > No, yes, and yes.
> 
> Thank you humoring my questions nonetheless.
> 
> > > The xfs_db freesp report after the problem occurred.  (N.B. it was a few
> > > hours before I was able to get to this machine to investigate)
> > > 
> > > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol
> > >    from      to extents  blocks    pct
> > >       1       1      48      48   0.00
> > >       2       3     119     303   0.02
> > >       4       7      46     250   0.01
> > >       8      15      22     255   0.01
> > >      16      31      17     374   0.02
> > >      32      63      16     728   0.04
> > >      64     127       9     997   0.05
> > >     128     255     149   34271   1.83
> > >     256     511       7    2241   0.12
> > >     512    1023       4    2284   0.12
> > >    1024    2047       1    1280   0.07
> > >    2048    4095       1    3452   0.18
> > > 1048576 2097151       1 1825182  97.52
> > > total free extents 440
> > > total free blocks 1871665
> > > average free extent size 4253.78
> > 
> > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2%
> > full, so it's extremely likely you are hitting a full AG condition.
> > 
> > /me goes and looks at xfs_iomap_write_direct()....
> > 
> > .... and notices that it passes "0" as the total allocation block
> > count, which means it isn't reserving space in the AG for both the
> > data extent and the BMBT blocks...
> > 
> > ... and several other xfs_bmapi_write() callers have the same
> > issue...
> > 
> > Ok, let me spend a bit more looking into this in more depth, but it
> > looks like the problem is at the xfs_bmapi_write() caller level, not
> > deep in the allocator itself.
> 
> At least on 5.4 xfs_bmapi_write is still passing resblks instead of
> zero, which is computed in xfs_iomap_write_direct.

yup, I missed commit da781e64b28c ("xfs: don't set bmapi total block
req where minleft is") back in 2019 where that behaviour was
changed, and instead it changes xfs_bmapi_write() to implcitly
manage space for BMBT blocks via args->minleft whilst still
explicitly requiring the caller to reserve those blocks at
transaction allocation time.

Bit of a mess, really, because multi-allocation transactions are
still required to pass both the data blocks + the possible BMBT
blocks that might be needed to xfs_bmapi_write(). I suspect that for
this case the implicit args->minleft reservation is double
accounted...

> Related to your comment about alloc_args.total, I did a bit of tracing
> of the xfs_alloc tracepoints on my system and found that total seems to
> be getting set in both cases, but that a) it's actually a larger value
> for directio; and b) in the buffered write case the code is requesting
> more blocks at one time which causes a larger allocation to occur.  I'm
> not certain, but wondered if this could be causing us to select an AG
> with more space by luck.
> 
> directio:
> 
>               dd-102229  [005] .... 4969662.383215: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 14240 minlen 4 maxlen 4 mod 0 prod 1 minleft 2 total 8 alignment 1 minalignslop 3 len 4 type THIS_BNO otype THIS_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff

That one is correct - 4 extra blocks on top of the data extent...

> buffered write + fsync:
> 
>               dd-109921  [010] .... 4972814.844429: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 21280 minlen 16 maxlen 16 mod 0 prod 1 minleft 2 total 4 alignment 1 minalignslop 3 len 16 type THIS_BNO otype THIS_BNO wasdel 1 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff

But that one is clearly wrong - we're asking for 16 blocks for the
data extent, and a total block count of the allocation of 4 blocks.
Even though we've reserved 16 blocks during the initial delayed
allocation, we've still got to have 16 + 4 blocks free in the AG for
the allocation to succced. That's one of the bugs the commit I
mentioned above fixed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs_bmap_extents_to_btree allocation warnings
  2022-01-08  5:40   ` Krister Johansen
@ 2022-01-10  8:50     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2022-01-10  8:50 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig, Brian Foster

On Fri, Jan 07, 2022 at 09:40:14PM -0800, Krister Johansen wrote:
> On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote:
> > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> I wondered if perhaps the problem was related to other problems in
> xfs_alloc_fix_freelist.  Taking inspiration from some of the fixes that
> Brian made here, it looks like there's a possibility of the freelist
> refill code grabbing blocks that were assumed to be available by
> previous checks in that function.
> 
> For example, using some values from a successful trace of a directio
> allocation:
> 
>               dd-102227  [027] .... 4969662.381037: xfs_alloc_near_first: dev 25
> 3:1 agno 0 agbno 5924 minlen 4 maxlen 4 mod 0 prod 1 minleft 1 total 8 alignment
>  4 minalignslop 0 len 4 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv
> 0 datatype 0x9 firstblock 0xffffffffffffffff
> 
>               dd-102227  [027] .... 4969662.381047: xfs_alloc_near_first: dev 25
> 3:1 agno 0 agbno 5921 minlen 1 maxlen 1 mod 0 prod 1 minleft 0 total 0 alignment
>  1 minalignslop 0 len 1 type NEAR_BNO otype NEAR_BNO wasdel 0 wasfromfl 0 resv 0
>  datatype 0x0 firstblock 0x1724
> 
> [first is the bmap alloc, second is the extents_to_btree alloc]
>  
> if agflcount = min(pagf_flcount, min_free)
>    agflcount = min(3, 8)
> 
> and available = pagf_freeblks + agflcount - reservation - min_free - minleft
>     available = 14 + 3 - 0 - 8 - 1
> 
>     available = 8
> 
> which satisfies the total from the first allocation request; however, if
> this code path needs to refill the freelists and the ag btree is full
> because a lot of space is allocated and not much is free, then inserts
> here may trigger rebalances.  Usage might look something like this:
> 
>    pagf_freeblks = 14
>    allocate 5 blocks to fill freelist
>    pags_freeblks = 9
>    fill of freelist triggers split that requires 4 nodes
>    next iteration allocates 4 blocks to refill freelist
>    pages_freeblks = 5
>    refill requires rebalance and another node
>    next iteration allocates 1 block to refill freelist
>    pages_freeblks = 4
>    freelist filled; return to caller
> 
>    caller consumes remaining 4 blocks for bmap allocation
> 
>    pages_freeblks = 0
> 
>    no blocks available for xfs_bmap_extents_to_btree

No, can't happen.

When the AG is nearly empty, there is only one block in the AG
freespace trees - the root block. Those root blocks can hold ~500
free space extents. Hence if there are only 14 free blocks left in
the AG, then by definition we have a single level tree and a split
cannot ever occur.

Remember, the AGFL is for space management btree splits and merges,
not for BMBT splits/merges. BMBT blocks are user metadata and have
to be reserved up front from the AG, they are not allocated
from/accounted via the free lists that the AGF freespace and rmap
btrees use for their block management.


> I'm not sure if this is possible, but I thought I'd mention it since
> Brian's prior work here got me thinking about it.  If this does sound
> plausible, what do you think about re-validating the space_available
> conditions after refilling the freelist?  Something like:
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 353e53b..d235744 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2730,6 +2730,16 @@ xfs_alloc_fix_freelist(
>  		}
>  	}
>  	xfs_trans_brelse(tp, agflbp);
> +
> +	/*
> +	 * Freelist refill may have consumed blocks from pagf_freeblks.  Ensure
> +	 * that this allocation still meets its requested constraints by
> +	 * revalidating the min_freelist and space_available checks.
> +	 */
> +	need = xfs_alloc_min_freelist(mp, pag);
> +	if (!xfs_alloc_space_available(args, need, flags))
> +		goto out_agbp_relse;

No, that will just lead to a filesystem shutdown because at ENOSPC
we'll have a transaction containing a dirty AGF/AGFL and freespace
btrees. At that point, the transaction cancellation will see that it
can't cleanly back out and at that point it's all over...

This is also an AGF ABBA deadlock vector, because now we have
a dirty AGF locked in the transaction that we haven't tracked via
t_firstblock....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs_bmap_extents_to_btree allocation warnings
  2022-01-10  8:41     ` Dave Chinner
@ 2022-01-10 23:40       ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2022-01-10 23:40 UTC (permalink / raw)
  To: Krister Johansen; +Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig

On Mon, Jan 10, 2022 at 07:41:52PM +1100, Dave Chinner wrote:
> On Thu, Jan 06, 2022 at 12:52:28AM -0800, Krister Johansen wrote:
> > On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote:
> > > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote:
> > > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2%
> > > full, so it's extremely likely you are hitting a full AG condition.
> > > 
> > > /me goes and looks at xfs_iomap_write_direct()....
> > > 
> > > .... and notices that it passes "0" as the total allocation block
> > > count, which means it isn't reserving space in the AG for both the
> > > data extent and the BMBT blocks...
> > > 
> > > ... and several other xfs_bmapi_write() callers have the same
> > > issue...
> > > 
> > > Ok, let me spend a bit more looking into this in more depth, but it
> > > looks like the problem is at the xfs_bmapi_write() caller level, not
> > > deep in the allocator itself.
> > 
> > At least on 5.4 xfs_bmapi_write is still passing resblks instead of
> > zero, which is computed in xfs_iomap_write_direct.
> 
> yup, I missed commit da781e64b28c ("xfs: don't set bmapi total block
> req where minleft is") back in 2019 where that behaviour was
> changed, and instead it changes xfs_bmapi_write() to implcitly
> manage space for BMBT blocks via args->minleft whilst still
> explicitly requiring the caller to reserve those blocks at
> transaction allocation time.

FWIW, I should have said here that you should really add that commit
to your current tree and see if that fixes the problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-10 23:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-05  7:10 xfs_bmap_extents_to_btree allocation warnings Krister Johansen
2022-01-06  1:01 ` Dave Chinner
2022-01-06  8:52   ` Krister Johansen
2022-01-10  8:41     ` Dave Chinner
2022-01-10 23:40       ` Dave Chinner
2022-01-08  5:40   ` Krister Johansen
2022-01-10  8:50     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.