All of lore.kernel.org
 help / color / mirror / Atom feed
* regressions in xfs/168?
@ 2021-05-19 21:02 Darrick J. Wong
  2021-05-19 22:20 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2021-05-19 21:02 UTC (permalink / raw)
  To: xfs; +Cc: Dave Chinner, hsiangkao

Hm.  Does anyone /else/ see failures with the new test xfs/168 (the fs
shrink tests) on a 1k blocksize?  It looks as though we shrink the AG so
small that we trip the assert at the end of xfs_ag_resv_init that checks
that the reservations for an AG don't exceed the free space in that AG,
but tripping that doesn't return any error code, so xfs_ag_shrink_space
commits the new fs size and presses on with even more shrinking until
we've depleted AG 1 so thoroughly that the fs won't mount anymore.

At a bare minimum we probably need to check the same thing the assert
does and bail out of the shrink; or maybe we just need to create a
function to adjust an AG's reservation to make that function less
complicated.

--D

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 flax-mtr00 5.13.0-rc2-xfsx #rc2 SMP PREEMPT Mon May 17 15:26:13 PDT 2021
MKFS_OPTIONS  -- -f -b size=1024, /dev/sdf
MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdf /opt

xfs/168
Message from syslogd@flax-mtr00 at May 19 13:50:05 ...
 kernel:[ 9688.703923] XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 332

Message from syslogd@flax-mtr00 at May 19 13:50:06 ...
 kernel:[ 9689.186021] XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 332

Message from syslogd@flax-mtr00 at May 19 13:50:07 ...
 kernel:[ 9690.313532] XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 332

Message from syslogd@flax-mtr00 at May 19 13:50:07 ...
 kernel:[ 9690.359752] XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 332

Message from syslogd@flax-mtr00 at May 19 13:50:07 ...
 kernel:[ 9690.406718] XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 332

Message from syslogd@flax-mtr00 at May 19 13:50:07 ...
 kernel:[ 9690.977567] XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 332
[failed, exit status 1]- output mismatch (see /var/tmp/fstests/xfs/168.out.bad)
    --- tests/xfs/168.out       2021-05-16 18:48:31.290361859 -0700
    +++ /var/tmp/fstests/xfs/168.out.bad        2021-05-19 13:50:09.520067445 -0700
    @@ -1,2 +1,3 @@
     QA output created by 168
    -Silence is golden
    +xfs_repair failed with shrinking 748457
    +(see /var/tmp/fstests/xfs/168.full for details)
    ...
    (Run 'diff -u /tmp/fstests/tests/xfs/168.out /var/tmp/fstests/xfs/168.out.bad'  to see the entire diff)
Ran: xfs/168
Failures: xfs/168
Failed 1 of 1 tests

Test xfs/168 FAILED with code 1 and bad golden output:
--- /tmp/fstests/tests/xfs/168.out      2021-05-16 18:48:31.290361859 -0700
+++ /var/tmp/fstests/xfs/168.out.bad    2021-05-19 13:50:09.520067445 -0700
@@ -1,2 +1,3 @@
 QA output created by 168
-Silence is golden
+xfs_repair failed with shrinking 748457
+(see /var/tmp/fstests/xfs/168.full for details)

--D

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: regressions in xfs/168?
  2021-05-19 21:02 regressions in xfs/168? Darrick J. Wong
@ 2021-05-19 22:20 ` Dave Chinner
  2021-05-20  0:08   ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2021-05-19 22:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs, hsiangkao

On Wed, May 19, 2021 at 02:02:05PM -0700, Darrick J. Wong wrote:
> Hm.  Does anyone /else/ see failures with the new test xfs/168 (the fs
> shrink tests) on a 1k blocksize?  It looks as though we shrink the AG so
> small that we trip the assert at the end of xfs_ag_resv_init that checks
> that the reservations for an AG don't exceed the free space in that AG,
> but tripping that doesn't return any error code, so xfs_ag_shrink_space
> commits the new fs size and presses on with even more shrinking until
> we've depleted AG 1 so thoroughly that the fs won't mount anymore.

Yup, now that I've got the latest fstests I see that failure, too.

[58972.431760] Call Trace:
[58972.432467]  xfs_ag_resv_init+0x1d3/0x240
[58972.433611]  xfs_ag_shrink_space+0x1bf/0x360
[58972.434801]  xfs_growfs_data+0x413/0x640
[58972.435894]  xfs_file_ioctl+0x32f/0xd30
[58972.439289]  __x64_sys_ioctl+0x8e/0xc0
[58972.440337]  do_syscall_64+0x3a/0x70
[58972.441347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[58972.442741] RIP: 0033:0x7f7021755d87

> At a bare minimum we probably need to check the same thing the assert
> does and bail out of the shrink; or maybe we just need to create a
> function to adjust an AG's reservation to make that function less
> complicated.

So if I'm reading xfs_ag_shrink_space() correctly, it doesn't
check what the new reservation will be and so it's purely looking at
whether the physical range can be freed or not? And when freeing
that physical range results in less free space in the AG than the
reservation requires, we pop an assert failure rather than failing
the reservation and undoing the shrink like the code is supposed to
do?

IOWs, the problem is the ASSERT firing on debug kernels, not the
actual shrink code that does handle this reservation ENOSPC error
case properly? i.e. we've got something like an uncaught overflow
in xfs_ag_resv_init() that is tripping the assert? (e.g. used >
ask)

So I'm not sure that the problem is the shrink code here - it should
undo a reservation failure just fine, but the reservation code is
failing before we get there on a debug kernel...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: regressions in xfs/168?
  2021-05-19 22:20 ` Dave Chinner
@ 2021-05-20  0:08   ` Darrick J. Wong
  2021-05-20  8:23     ` Gao Xiang
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2021-05-20  0:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, hsiangkao

On Thu, May 20, 2021 at 08:20:06AM +1000, Dave Chinner wrote:
> On Wed, May 19, 2021 at 02:02:05PM -0700, Darrick J. Wong wrote:
> > Hm.  Does anyone /else/ see failures with the new test xfs/168 (the fs
> > shrink tests) on a 1k blocksize?  It looks as though we shrink the AG so
> > small that we trip the assert at the end of xfs_ag_resv_init that checks
> > that the reservations for an AG don't exceed the free space in that AG,
> > but tripping that doesn't return any error code, so xfs_ag_shrink_space
> > commits the new fs size and presses on with even more shrinking until
> > we've depleted AG 1 so thoroughly that the fs won't mount anymore.
> 
> Yup, now that I've got the latest fstests I see that failure, too.
> 
> [58972.431760] Call Trace:
> [58972.432467]  xfs_ag_resv_init+0x1d3/0x240
> [58972.433611]  xfs_ag_shrink_space+0x1bf/0x360
> [58972.434801]  xfs_growfs_data+0x413/0x640
> [58972.435894]  xfs_file_ioctl+0x32f/0xd30
> [58972.439289]  __x64_sys_ioctl+0x8e/0xc0
> [58972.440337]  do_syscall_64+0x3a/0x70
> [58972.441347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [58972.442741] RIP: 0033:0x7f7021755d87
> 
> > At a bare minimum we probably need to check the same thing the assert
> > does and bail out of the shrink; or maybe we just need to create a
> > function to adjust an AG's reservation to make that function less
> > complicated.
> 
> So if I'm reading xfs_ag_shrink_space() correctly, it doesn't
> check what the new reservation will be and so it's purely looking at
> whether the physical range can be freed or not? And when freeing
> that physical range results in less free space in the AG than the
> reservation requires, we pop an assert failure rather than failing
> the reservation and undoing the shrink like the code is supposed to
> do?

Yes.  I've wondered for a while now if that assert in xfs_ag_resv_init
should get turned into an ENOSPC return so that callers can decide what
they want to do with it.

--D

> IOWs, the problem is the ASSERT firing on debug kernels, not the
> actual shrink code that does handle this reservation ENOSPC error
> case properly? i.e. we've got something like an uncaught overflow
> in xfs_ag_resv_init() that is tripping the assert? (e.g. used >
> ask)
> 
> So I'm not sure that the problem is the shrink code here - it should
> undo a reservation failure just fine, but the reservation code is
> failing before we get there on a debug kernel...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: regressions in xfs/168?
  2021-05-20  0:08   ` Darrick J. Wong
@ 2021-05-20  8:23     ` Gao Xiang
  2021-05-20 16:44       ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Gao Xiang @ 2021-05-20  8:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, xfs

Hi Darrick and Dave,

On Wed, May 19, 2021 at 05:08:02PM -0700, Darrick J. Wong wrote:
> On Thu, May 20, 2021 at 08:20:06AM +1000, Dave Chinner wrote:
> > On Wed, May 19, 2021 at 02:02:05PM -0700, Darrick J. Wong wrote:
> > > Hm.  Does anyone /else/ see failures with the new test xfs/168 (the fs
> > > shrink tests) on a 1k blocksize?  It looks as though we shrink the AG so
> > > small that we trip the assert at the end of xfs_ag_resv_init that checks
> > > that the reservations for an AG don't exceed the free space in that AG,
> > > but tripping that doesn't return any error code, so xfs_ag_shrink_space
> > > commits the new fs size and presses on with even more shrinking until
> > > we've depleted AG 1 so thoroughly that the fs won't mount anymore.
> > 
> > Yup, now that I've got the latest fstests I see that failure, too.
> > 
> > [58972.431760] Call Trace:
> > [58972.432467]  xfs_ag_resv_init+0x1d3/0x240
> > [58972.433611]  xfs_ag_shrink_space+0x1bf/0x360
> > [58972.434801]  xfs_growfs_data+0x413/0x640
> > [58972.435894]  xfs_file_ioctl+0x32f/0xd30
> > [58972.439289]  __x64_sys_ioctl+0x8e/0xc0
> > [58972.440337]  do_syscall_64+0x3a/0x70
> > [58972.441347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [58972.442741] RIP: 0033:0x7f7021755d87
> > 
> > > At a bare minimum we probably need to check the same thing the assert
> > > does and bail out of the shrink; or maybe we just need to create a
> > > function to adjust an AG's reservation to make that function less
> > > complicated.
> > 
> > So if I'm reading xfs_ag_shrink_space() correctly, it doesn't
> > check what the new reservation will be and so it's purely looking at
> > whether the physical range can be freed or not? And when freeing
> > that physical range results in less free space in the AG than the
> > reservation requires, we pop an assert failure rather than failing
> > the reservation and undoing the shrink like the code is supposed to
> > do?
> 
> Yes.  I've wondered for a while now if that assert in xfs_ag_resv_init
> should get turned into an ENOSPC return so that callers can decide what
> they want to do with it.

Thanks for the detailed analysis (sorry that I didn't check the 1k blocksize
case before), I'm now renting a department in a new city, no xfstests env
available for now.

But if I read/understand correctly, the following code might resolve the issue?

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 6c5f8d10589c..1f918afd5e91 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -312,10 +312,12 @@ xfs_ag_resv_init(
 	if (error)
 		return error;
 
-	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
-	       xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <=
-	       pag->pagf_freeblks + pag->pagf_flcount);
 #endif
+	if (xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
+	    xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved >
+	    pag->pagf_freeblks + pag->pagf_flcount)
+		return -ENOSPC;
+
 out:
 	return error;
 }

If that works, could you kindly send out it (or some better/sane solution),
many thanks in advance!

Thanks,
Gao Xiang

> 
> --D
> 
> > IOWs, the problem is the ASSERT firing on debug kernels, not the
> > actual shrink code that does handle this reservation ENOSPC error
> > case properly? i.e. we've got something like an uncaught overflow
> > in xfs_ag_resv_init() that is tripping the assert? (e.g. used >
> > ask)
> > 
> > So I'm not sure that the problem is the shrink code here - it should
> > undo a reservation failure just fine, but the reservation code is
> > failing before we get there on a debug kernel...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: regressions in xfs/168?
  2021-05-20  8:23     ` Gao Xiang
@ 2021-05-20 16:44       ` Darrick J. Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Darrick J. Wong @ 2021-05-20 16:44 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Dave Chinner, xfs

On Thu, May 20, 2021 at 04:23:22PM +0800, Gao Xiang wrote:
> Hi Darrick and Dave,
> 
> On Wed, May 19, 2021 at 05:08:02PM -0700, Darrick J. Wong wrote:
> > On Thu, May 20, 2021 at 08:20:06AM +1000, Dave Chinner wrote:
> > > On Wed, May 19, 2021 at 02:02:05PM -0700, Darrick J. Wong wrote:
> > > > Hm.  Does anyone /else/ see failures with the new test xfs/168 (the fs
> > > > shrink tests) on a 1k blocksize?  It looks as though we shrink the AG so
> > > > small that we trip the assert at the end of xfs_ag_resv_init that checks
> > > > that the reservations for an AG don't exceed the free space in that AG,
> > > > but tripping that doesn't return any error code, so xfs_ag_shrink_space
> > > > commits the new fs size and presses on with even more shrinking until
> > > > we've depleted AG 1 so thoroughly that the fs won't mount anymore.
> > > 
> > > Yup, now that I've got the latest fstests I see that failure, too.
> > > 
> > > [58972.431760] Call Trace:
> > > [58972.432467]  xfs_ag_resv_init+0x1d3/0x240
> > > [58972.433611]  xfs_ag_shrink_space+0x1bf/0x360
> > > [58972.434801]  xfs_growfs_data+0x413/0x640
> > > [58972.435894]  xfs_file_ioctl+0x32f/0xd30
> > > [58972.439289]  __x64_sys_ioctl+0x8e/0xc0
> > > [58972.440337]  do_syscall_64+0x3a/0x70
> > > [58972.441347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [58972.442741] RIP: 0033:0x7f7021755d87
> > > 
> > > > At a bare minimum we probably need to check the same thing the assert
> > > > does and bail out of the shrink; or maybe we just need to create a
> > > > function to adjust an AG's reservation to make that function less
> > > > complicated.
> > > 
> > > So if I'm reading xfs_ag_shrink_space() correctly, it doesn't
> > > check what the new reservation will be and so it's purely looking at
> > > whether the physical range can be freed or not? And when freeing
> > > that physical range results in less free space in the AG than the
> > > reservation requires, we pop an assert failure rather than failing
> > > the reservation and undoing the shrink like the code is supposed to
> > > do?
> > 
> > Yes.  I've wondered for a while now if that assert in xfs_ag_resv_init
> > should get turned into an ENOSPC return so that callers can decide what
> > they want to do with it.
> 
> Thanks for the detailed analysis (sorry that I didn't check the 1k blocksize
> case before), I'm now renting a department in a new city, no xfstests env
> available for now.
> 
> But if I read/understand correctly, the following code might resolve the issue?
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 6c5f8d10589c..1f918afd5e91 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -312,10 +312,12 @@ xfs_ag_resv_init(
>  	if (error)
>  		return error;
>  
> -	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
> -	       xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <=
> -	       pag->pagf_freeblks + pag->pagf_flcount);
>  #endif
> +	if (xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
> +	    xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved >
> +	    pag->pagf_freeblks + pag->pagf_flcount)
> +		return -ENOSPC;
> +
>  out:
>  	return error;
>  }
> 
> If that works, could you kindly send out it (or some better/sane solution),
> many thanks in advance!

That does seem to fix the symptoms, though I'm gonna take a closer look
at the error handling elsewhere in that function.

--D

> 
> Thanks,
> Gao Xiang
> 
> > 
> > --D
> > 
> > > IOWs, the problem is the ASSERT firing on debug kernels, not the
> > > actual shrink code that does handle this reservation ENOSPC error
> > > case properly? i.e. we've got something like an uncaught overflow
> > > in xfs_ag_resv_init() that is tripping the assert? (e.g. used >
> > > ask)
> > > 
> > > So I'm not sure that the problem is the shrink code here - it should
> > > undo a reservation failure just fine, but the reservation code is
> > > failing before we get there on a debug kernel...
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-20 16:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 21:02 regressions in xfs/168? Darrick J. Wong
2021-05-19 22:20 ` Dave Chinner
2021-05-20  0:08   ` Darrick J. Wong
2021-05-20  8:23     ` Gao Xiang
2021-05-20 16:44       ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.