All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] xfs: fix sb write verify for lazysbcount
@ 2022-10-22  2:03 Long Li
  2022-10-22  2:14 ` Darrick J. Wong
  2022-10-25  9:15 ` [PATCH v2] " Long Li
  0 siblings, 2 replies; 17+ messages in thread
From: Long Li @ 2022-10-22  2:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Chandan Babu R, Eric Sandeen, Bill O'Donnell,
	linux-xfs, leo.lilong, yi.zhang, houtao1, guoxuenan

When lazysbcount is enabled, multiple threads stress test the xfs report
the following problems:

XFS (loop0): SB summary counter sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
	     +0x13b/0x460, xfs_sb block 0x0
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed

The cause of the problem is that during the log recovery process, incorrect
icount and ifree are recovered from the log and fail to pass the size check
in xfs_validate_sb_write().

With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
the m_ifree between the read m_icount and the m_ifree, this will make the
m_ifree larger than m_icount and written to the log. If we have an unclean
shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
the counters from the AGF block counts, and the correction is later than
log recovery. During log recovery, incorrect ifree/icount may be restored
from the log and written to the super block, since ifree and icount have
not been corrected at this time, the relationship between ifree and icount
cannot be checked in xfs_validate_sb_write().

So, don't check the size between ifree and icount in xfs_validate_sb_write()
when lazysbcount is enabled.

Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/libxfs/xfs_sb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a20cade590e9..b4a4e57361e7 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -245,7 +245,7 @@ xfs_validate_sb_write(
 	if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
 	    (sbp->sb_fdblocks > sbp->sb_dblocks ||
 	     !xfs_verify_icount(mp, sbp->sb_icount) ||
-	     sbp->sb_ifree > sbp->sb_icount)) {
+	     (!xfs_has_lazysbcount(mp) && sbp->sb_ifree > sbp->sb_icount))) {
 		xfs_warn(mp, "SB summary counter sanity check failed");
 		return -EFSCORRUPTED;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-22  2:03 [PATCH v1] xfs: fix sb write verify for lazysbcount Long Li
@ 2022-10-22  2:14 ` Darrick J. Wong
  2022-10-22 12:01   ` Long Li
  2022-10-25  9:15 ` [PATCH v2] " Long Li
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2022-10-22  2:14 UTC (permalink / raw)
  To: Long Li
  Cc: Dave Chinner, Chandan Babu R, Eric Sandeen, Bill O'Donnell,
	linux-xfs, yi.zhang, houtao1, guoxuenan

On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> When lazysbcount is enabled, multiple threads stress test the xfs report
> the following problems:
> 
> XFS (loop0): SB summary counter sanity check failed
> XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> 	     +0x13b/0x460, xfs_sb block 0x0
> XFS (loop0): Unmount and run xfs_repair
> XFS (loop0): First 128 bytes of corrupted metadata buffer:
> 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> XFS (loop0): log mount/recovery failed: error -117
> XFS (loop0): log mount failed
> 
> The cause of the problem is that during the log recovery process, incorrect
> icount and ifree are recovered from the log and fail to pass the size check

Are you saying that the log contained a transaction in which ifree >
icount?

> in xfs_validate_sb_write().
> 
> With lazysbcount is enabled, There is no additional lock protection for
> reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> the m_ifree between the read m_icount and the m_ifree, this will make the
> m_ifree larger than m_icount and written to the log. If we have an unclean
> shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> the counters from the AGF block counts, and the correction is later than
> log recovery. During log recovery, incorrect ifree/icount may be restored
> from the log and written to the super block, since ifree and icount have
> not been corrected at this time, the relationship between ifree and icount
> cannot be checked in xfs_validate_sb_write().
> 
> So, don't check the size between ifree and icount in xfs_validate_sb_write()
> when lazysbcount is enabled.

Um, doesn't that neuter a sanity check on all V5 filesystems?

--D

> Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/libxfs/xfs_sb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index a20cade590e9..b4a4e57361e7 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -245,7 +245,7 @@ xfs_validate_sb_write(
>  	if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
>  	    (sbp->sb_fdblocks > sbp->sb_dblocks ||
>  	     !xfs_verify_icount(mp, sbp->sb_icount) ||
> -	     sbp->sb_ifree > sbp->sb_icount)) {
> +	     (!xfs_has_lazysbcount(mp) && sbp->sb_ifree > sbp->sb_icount))) {
>  		xfs_warn(mp, "SB summary counter sanity check failed");
>  		return -EFSCORRUPTED;
>  	}
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-22  2:14 ` Darrick J. Wong
@ 2022-10-22 12:01   ` Long Li
  2022-10-22 21:16     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2022-10-22 12:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Chandan Babu R, Eric Sandeen, Bill O'Donnell,
	linux-xfs, leo.lilong, yi.zhang, houtao1, guoxuenan

On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > When lazysbcount is enabled, multiple threads stress test the xfs report
> > the following problems:
> > 
> > XFS (loop0): SB summary counter sanity check failed
> > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > 	     +0x13b/0x460, xfs_sb block 0x0
> > XFS (loop0): Unmount and run xfs_repair
> > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > XFS (loop0): log mount/recovery failed: error -117
> > XFS (loop0): log mount failed
> > 
> > The cause of the problem is that during the log recovery process, incorrect
> > icount and ifree are recovered from the log and fail to pass the size check
> 
> Are you saying that the log contained a transaction in which ifree >
> icount?

Yes, this situation is possible. For example consider the following sequence:

 CPU0				    CPU1
 xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
 ----------			    ------------------------------
 percpu_counter_sum(&mp->m_icount)
				    percpu_counter_add(&mp->m_icount, idelta)
				    percpu_counter_add_batch(&mp->m_icount,
						  idelta, XFS_ICOUNT_BATCH)
 percpu_counter_sum(&mp->m_ifree)

> 
> > in xfs_validate_sb_write().
> > 
> > With lazysbcount is enabled, There is no additional lock protection for
> > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> > the m_ifree between the read m_icount and the m_ifree, this will make the
> > m_ifree larger than m_icount and written to the log. If we have an unclean
> > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> > the counters from the AGF block counts, and the correction is later than
> > log recovery. During log recovery, incorrect ifree/icount may be restored
> > from the log and written to the super block, since ifree and icount have
> > not been corrected at this time, the relationship between ifree and icount
> > cannot be checked in xfs_validate_sb_write().
> > 
> > So, don't check the size between ifree and icount in xfs_validate_sb_write()
> > when lazysbcount is enabled.
> 
> Um, doesn't that neuter a sanity check on all V5 filesystems?

Yes, such modifications don't look like the best way, all sb write checks 
will be affect. Maybe it can increase the judgment of clean mount and reduce
the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after
re-initialise incore superblock counters, like this:

--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -242,10 +242,12 @@ xfs_validate_sb_write(
 	 * secondary superblocks, so allow this usage to continue because
 	 * we never read counters from such superblocks.
 	 */
+	bool check = !(xfs_has_lazysbcount(mp) && !xfs_is_clean(mp));
+
 	if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
 	    (sbp->sb_fdblocks > sbp->sb_dblocks ||
 	     !xfs_verify_icount(mp, sbp->sb_icount) ||
-	     sbp->sb_ifree > sbp->sb_icount)) {
+	     (check && sbp->sb_ifree > sbp->sb_icount))) {
 		xfs_warn(mp, "SB summary counter sanity check failed");
 		return -EFSCORRUPTED;
 	}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index e8bb3c2e847e..0637e5d01e72 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -535,6 +535,8 @@ xfs_check_summary_counts(
 			return error;
 	}
 
+	set_bit(XFS_OPSTATE_CLEAN, &mp->m_opstate);
+
 	return 0;
 }
> 
> --D
> 
> > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/libxfs/xfs_sb.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index a20cade590e9..b4a4e57361e7 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -245,7 +245,7 @@ xfs_validate_sb_write(
> >  	if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
> >  	    (sbp->sb_fdblocks > sbp->sb_dblocks ||
> >  	     !xfs_verify_icount(mp, sbp->sb_icount) ||
> > -	     sbp->sb_ifree > sbp->sb_icount)) {
> > +	     (!xfs_has_lazysbcount(mp) && sbp->sb_ifree > sbp->sb_icount))) {
> >  		xfs_warn(mp, "SB summary counter sanity check failed");
> >  		return -EFSCORRUPTED;
> >  	}
> > -- 
> > 2.31.1
> > 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-22 12:01   ` Long Li
@ 2022-10-22 21:16     ` Dave Chinner
  2022-10-24  4:07       ` Darrick J. Wong
  2022-10-24  5:06       ` Long Li
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2022-10-22 21:16 UTC (permalink / raw)
  To: Long Li
  Cc: Darrick J. Wong, Dave Chinner, Chandan Babu R, Eric Sandeen,
	Bill O'Donnell, linux-xfs, yi.zhang, houtao1, guoxuenan

On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > the following problems:

We've had lazy sb counters for 15 years and just about every XFS
filesystem in production uses them, so providing us with some idea
of the scope of the problem and how to reproduce it would be greatly
appreciated.

What stress test are you running? What filesystem config does it
manifest on (other than lazysbcount=1)?  How long does the stress
test run for, and where/why does log recovery get run in this stress
test?

> > > XFS (loop0): SB summary counter sanity check failed
> > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > 	     +0x13b/0x460, xfs_sb block 0x0
> > > XFS (loop0): Unmount and run xfs_repair
> > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > XFS (loop0): log mount/recovery failed: error -117
> > > XFS (loop0): log mount failed
> > > 
> > > The cause of the problem is that during the log recovery process, incorrect
> > > icount and ifree are recovered from the log and fail to pass the size check
> > 
> > Are you saying that the log contained a transaction in which ifree >
> > icount?
> 
> Yes, this situation is possible. For example consider the following sequence:
> 
>  CPU0				    CPU1
>  xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
>  ----------			    ------------------------------
>  percpu_counter_sum(&mp->m_icount)
> 				    percpu_counter_add(&mp->m_icount, idelta)
> 				    percpu_counter_add_batch(&mp->m_icount,
> 						  idelta, XFS_ICOUNT_BATCH)
>  percpu_counter_sum(&mp->m_ifree)

What caused the xfs_log_sb() to be called? Very few things
actually log the superblock this way at runtime - it's generally
only logged directly like this when a feature bit changes during a
transaction (rare) or at a synchronisation point when everything
else is idle and there's no chance of a race like this occurring...

I can see a couple of routes to this occurring via feature bit
modification, but I don't see them being easy to hit or something
that would exist for very long in the journal. Hence I'm wondering
if there should be runtime protection for xfs_log_sb() to avoid
these problems....

> > > in xfs_validate_sb_write().
> > > 
> > > With lazysbcount is enabled, There is no additional lock protection for
> > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> > > the m_ifree between the read m_icount and the m_ifree, this will make the
> > > m_ifree larger than m_icount and written to the log. If we have an unclean
> > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> > > the counters from the AGF block counts, and the correction is later than
> > > log recovery. During log recovery, incorrect ifree/icount may be restored
> > > from the log and written to the super block, since ifree and icount have
> > > not been corrected at this time, the relationship between ifree and icount
> > > cannot be checked in xfs_validate_sb_write().
> > > 
> > > So, don't check the size between ifree and icount in xfs_validate_sb_write()
> > > when lazysbcount is enabled.
> > 
> > Um, doesn't that neuter a sanity check on all V5 filesystems?
>
> Yes, such modifications don't look like the best way, all sb write checks 
> will be affect. Maybe it can increase the judgment of clean mount and reduce
> the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after
> re-initialise incore superblock counters, like this:

I'm not sure that silencing the warning and continuing log recovery
with an invalid accounting state is the best way to proceed. We
haven't yet recovered unlinked inodes at the time this warning
fires. If ifree really is larger than icount, then we've got a
real problem at this point in log recovery.

Hence I suspect that we should be looking at fixing the runtime race
condition that leads to the problem, not trying to work around
inconsistency in the recovery code.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-22 21:16     ` Dave Chinner
@ 2022-10-24  4:07       ` Darrick J. Wong
  2022-10-24  5:43         ` Dave Chinner
  2022-10-24  5:06       ` Long Li
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2022-10-24  4:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Long Li, Dave Chinner, Chandan Babu R, Eric Sandeen,
	Bill O'Donnell, linux-xfs, yi.zhang, houtao1, guoxuenan

On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote:
> On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > > the following problems:
> 
> We've had lazy sb counters for 15 years and just about every XFS
> filesystem in production uses them, so providing us with some idea
> of the scope of the problem and how to reproduce it would be greatly
> appreciated.
> 
> What stress test are you running? What filesystem config does it
> manifest on (other than lazysbcount=1)?  How long does the stress
> test run for, and where/why does log recovery get run in this stress
> test?
> 
> > > > XFS (loop0): SB summary counter sanity check failed
> > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > > 	     +0x13b/0x460, xfs_sb block 0x0
> > > > XFS (loop0): Unmount and run xfs_repair
> > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > XFS (loop0): log mount/recovery failed: error -117
> > > > XFS (loop0): log mount failed
> > > > 
> > > > The cause of the problem is that during the log recovery process, incorrect
> > > > icount and ifree are recovered from the log and fail to pass the size check
> > > 
> > > Are you saying that the log contained a transaction in which ifree >
> > > icount?
> > 
> > Yes, this situation is possible. For example consider the following sequence:
> > 
> >  CPU0				    CPU1
> >  xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
> >  ----------			    ------------------------------
> >  percpu_counter_sum(&mp->m_icount)
> > 				    percpu_counter_add(&mp->m_icount, idelta)
> > 				    percpu_counter_add_batch(&mp->m_icount,
> > 						  idelta, XFS_ICOUNT_BATCH)
> >  percpu_counter_sum(&mp->m_ifree)
> 
> What caused the xfs_log_sb() to be called? Very few things
> actually log the superblock this way at runtime - it's generally
> only logged directly like this when a feature bit changes during a
> transaction (rare) or at a synchronisation point when everything
> else is idle and there's no chance of a race like this occurring...
> 
> I can see a couple of routes to this occurring via feature bit
> modification, but I don't see them being easy to hit or something
> that would exist for very long in the journal. Hence I'm wondering
> if there should be runtime protection for xfs_log_sb() to avoid
> these problems....

Maybe.  Or perhaps we sample m_i{count,free} until they come up with a
value that will pass the verifier, and only then log the new values to
the primary super xfs_buf?

> > > > in xfs_validate_sb_write().
> > > > 
> > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> > > > the m_ifree between the read m_icount and the m_ifree, this will make the
> > > > m_ifree larger than m_icount and written to the log. If we have an unclean
> > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> > > > the counters from the AGF block counts, and the correction is later than
> > > > log recovery. During log recovery, incorrect ifree/icount may be restored
> > > > from the log and written to the super block, since ifree and icount have
> > > > not been corrected at this time, the relationship between ifree and icount
> > > > cannot be checked in xfs_validate_sb_write().
> > > > 
> > > > So, don't check the size between ifree and icount in xfs_validate_sb_write()
> > > > when lazysbcount is enabled.
> > > 
> > > Um, doesn't that neuter a sanity check on all V5 filesystems?
> >
> > Yes, such modifications don't look like the best way, all sb write checks 
> > will be affect. Maybe it can increase the judgment of clean mount and reduce
> > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after
> > re-initialise incore superblock counters, like this:
> 
> I'm not sure that silencing the warning and continuing log recovery
> with an invalid accounting state is the best way to proceed. We
> haven't yet recovered unlinked inodes at the time this warning
> fires. If ifree really is larger than icount, then we've got a
> real problem at this point in log recovery.
> 
> Hence I suspect that we should be looking at fixing the runtime race
> condition that leads to the problem, not trying to work around
> inconsistency in the recovery code.

Agreed.  Higher level code shouldn't be storing garbage ifree/icount to
the primary superblock xfs_buf, period.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-22 21:16     ` Dave Chinner
  2022-10-24  4:07       ` Darrick J. Wong
@ 2022-10-24  5:06       ` Long Li
  1 sibling, 0 replies; 17+ messages in thread
From: Long Li @ 2022-10-24  5:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Dave Chinner, Chandan Babu R, Eric Sandeen,
	Bill O'Donnell, linux-xfs, yi.zhang, houtao1, guoxuenan

On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote:
> On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > > the following problems:
> 
> We've had lazy sb counters for 15 years and just about every XFS
> filesystem in production uses them, so providing us with some idea
> of the scope of the problem and how to reproduce it would be greatly
> appreciated.

Ok, here is my simplified script that could reproducer the bug:

#!/bin/bash

device=/dev/sda
testdir=/mnt/test
round=0

function fail()
{
	echo "$*"
	exit 1
}

while [ $round -lt 10000 ]
do
	echo "******* round $round ********"
	mkfs.xfs -f $device
	mkdir -p $testdir
	mount $device $testdir || fail "mount failed!"
	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
	sleep 4
	killall -w fsstress
	umount $testdir
	xfs_repair -e $device > /dev/null
	if [ $? -eq 2 ];then
		echo "ERR CODE 2: Dirty log exception encountered during repair."
		exit 1
	fi
	sleep 1
	round=$(($round+1))
done

> 
> What stress test are you running? What filesystem config does it
> manifest on (other than lazysbcount=1)?  How long does the stress
> test run for, and where/why does log recovery get run in this stress
> test?

I tested with the fsstress tool and used the default mount options. 
More configuration are as follows:

CONFIG_XFS_FS=y
CONFIG_XFS_SUPPORT_V4=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y

xfs_db> info
meta-data=/dev/sda               isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=1
	 =                       crc=1        finobt=1, sparse=1, rmapbt=0
	 =                       reflink=1
data     =                       bsize=4096   blocks=262144, imaxpct=25
	 =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
	 =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

With the above test script, it will take several days to reproduce the bug.
If we add a delay in xfs_log_sb(), this will quickly reproduce the problem:

 	if (xfs_has_lazysbcount(mp)) {
 		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
+		udelay(2000);
 		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
 		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
 	}

> 
> > > > XFS (loop0): SB summary counter sanity check failed
> > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > > 	     +0x13b/0x460, xfs_sb block 0x0
> > > > XFS (loop0): Unmount and run xfs_repair
> > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > XFS (loop0): log mount/recovery failed: error -117
> > > > XFS (loop0): log mount failed
> > > > 
> > > > The cause of the problem is that during the log recovery process, incorrect
> > > > icount and ifree are recovered from the log and fail to pass the size check
> > > 
> > > Are you saying that the log contained a transaction in which ifree >
> > > icount?
> > 
> > Yes, this situation is possible. For example consider the following sequence:
> > 
> >  CPU0				    CPU1
> >  xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
> >  ----------			    ------------------------------
> >  percpu_counter_sum(&mp->m_icount)
> > 				    percpu_counter_add(&mp->m_icount, idelta)
> > 				    percpu_counter_add_batch(&mp->m_icount,
> > 						  idelta, XFS_ICOUNT_BATCH)
> >  percpu_counter_sum(&mp->m_ifree)
> 
> What caused the xfs_log_sb() to be called? Very few things
> actually log the superblock this way at runtime - it's generally
> only logged directly like this when a feature bit changes during a
> transaction (rare) or at a synchronisation point when everything
> else is idle and there's no chance of a race like this occurring...
> 
> I can see a couple of routes to this occurring via feature bit
> modification, but I don't see them being easy to hit or something
> that would exist for very long in the journal. Hence I'm wondering
> if there should be runtime protection for xfs_log_sb() to avoid
> these problems....
> 
> > > > in xfs_validate_sb_write().
> > > > 
> > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> > > > the m_ifree between the read m_icount and the m_ifree, this will make the
> > > > m_ifree larger than m_icount and written to the log. If we have an unclean
> > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> > > > the counters from the AGF block counts, and the correction is later than
> > > > log recovery. During log recovery, incorrect ifree/icount may be restored
> > > > from the log and written to the super block, since ifree and icount have
> > > > not been corrected at this time, the relationship between ifree and icount
> > > > cannot be checked in xfs_validate_sb_write().
> > > > 
> > > > So, don't check the size between ifree and icount in xfs_validate_sb_write()
> > > > when lazysbcount is enabled.
> > > 
> > > Um, doesn't that neuter a sanity check on all V5 filesystems?
> >
> > Yes, such modifications don't look like the best way, all sb write checks 
> > will be affect. Maybe it can increase the judgment of clean mount and reduce
> > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after
> > re-initialise incore superblock counters, like this:
> 
> I'm not sure that silencing the warning and continuing log recovery
> with an invalid accounting state is the best way to proceed. We
> haven't yet recovered unlinked inodes at the time this warning
> fires. If ifree really is larger than icount, then we've got a
> real problem at this point in log recovery.
> 
> Hence I suspect that we should be looking at fixing the runtime race
> condition that leads to the problem, not trying to work around
> inconsistency in the recovery code.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-24  4:07       ` Darrick J. Wong
@ 2022-10-24  5:43         ` Dave Chinner
  2022-10-24 12:28           ` Long Li
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2022-10-24  5:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Long Li, Dave Chinner, Chandan Babu R, Eric Sandeen,
	Bill O'Donnell, linux-xfs, yi.zhang, houtao1, guoxuenan

On Sun, Oct 23, 2022 at 09:07:42PM -0700, Darrick J. Wong wrote:
> On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote:
> > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > > > the following problems:
> > 
> > We've had lazy sb counters for 15 years and just about every XFS
> > filesystem in production uses them, so providing us with some idea
> > of the scope of the problem and how to reproduce it would be greatly
> > appreciated.
> > 
> > What stress test are you running? What filesystem config does it
> > manifest on (other than lazysbcount=1)?  How long does the stress
> > test run for, and where/why does log recovery get run in this stress
> > test?
> > 
> > > > > XFS (loop0): SB summary counter sanity check failed
> > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > > > 	     +0x13b/0x460, xfs_sb block 0x0
> > > > > XFS (loop0): Unmount and run xfs_repair
> > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > > XFS (loop0): log mount/recovery failed: error -117
> > > > > XFS (loop0): log mount failed
> > > > > 
> > > > > The cause of the problem is that during the log recovery process, incorrect
> > > > > icount and ifree are recovered from the log and fail to pass the size check
> > > > 
> > > > Are you saying that the log contained a transaction in which ifree >
> > > > icount?
> > > 
> > > Yes, this situation is possible. For example consider the following sequence:
> > > 
> > >  CPU0				    CPU1
> > >  xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
> > >  ----------			    ------------------------------
> > >  percpu_counter_sum(&mp->m_icount)
> > > 				    percpu_counter_add(&mp->m_icount, idelta)
> > > 				    percpu_counter_add_batch(&mp->m_icount,
> > > 						  idelta, XFS_ICOUNT_BATCH)
> > >  percpu_counter_sum(&mp->m_ifree)
> > 
> > What caused the xfs_log_sb() to be called? Very few things
> > actually log the superblock this way at runtime - it's generally
> > only logged directly like this when a feature bit changes during a
> > transaction (rare) or at a synchronisation point when everything
> > else is idle and there's no chance of a race like this occurring...
> > 
> > I can see a couple of routes to this occurring via feature bit
> > modification, but I don't see them being easy to hit or something
> > that would exist for very long in the journal. Hence I'm wondering
> > if there should be runtime protection for xfs_log_sb() to avoid
> > these problems....
> 
> Maybe.  Or perhaps we sample m_i{count,free} until they come up with a
> value that will pass the verifier, and only then log the new values to
> the primary super xfs_buf?

I suspect the simplest thing to do is this:

	mp->m_sb.sb_ifree = min_t(uint64_t, percpu_counter_sum(&mp->m_ifree),
				mp->m_sb.sb.icount);

That way ifree will never be logged as being greater than icount.
Neither icount or ifree will be accurate if we are racing with other
updates, but it will guarantee that what we write to the journal
won't trigger corruption warnings.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
  2022-10-24  5:43         ` Dave Chinner
@ 2022-10-24 12:28           ` Long Li
  0 siblings, 0 replies; 17+ messages in thread
From: Long Li @ 2022-10-24 12:28 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong
  Cc: Dave Chinner, Chandan Babu R, Eric Sandeen, Bill O'Donnell,
	linux-xfs, yi.zhang, houtao1, guoxuenan

On Mon, Oct 24, 2022 at 04:43:45PM +1100, Dave Chinner wrote:
> On Sun, Oct 23, 2022 at 09:07:42PM -0700, Darrick J. Wong wrote:
> > On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote:
> > > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> > > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > > > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > > > > the following problems:
> > > 
> > > We've had lazy sb counters for 15 years and just about every XFS
> > > filesystem in production uses them, so providing us with some idea
> > > of the scope of the problem and how to reproduce it would be greatly
> > > appreciated.
> > > 
> > > What stress test are you running? What filesystem config does it
> > > manifest on (other than lazysbcount=1)?  How long does the stress
> > > test run for, and where/why does log recovery get run in this stress
> > > test?
> > > 
> > > > > > XFS (loop0): SB summary counter sanity check failed
> > > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > > > > 	     +0x13b/0x460, xfs_sb block 0x0
> > > > > > XFS (loop0): Unmount and run xfs_repair
> > > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > > > XFS (loop0): log mount/recovery failed: error -117
> > > > > > XFS (loop0): log mount failed
> > > > > > 
> > > > > > The cause of the problem is that during the log recovery process, incorrect
> > > > > > icount and ifree are recovered from the log and fail to pass the size check
> > > > > 
> > > > > Are you saying that the log contained a transaction in which ifree >
> > > > > icount?
> > > > 
> > > > Yes, this situation is possible. For example consider the following sequence:
> > > > 
> > > >  CPU0				    CPU1
> > > >  xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
> > > >  ----------			    ------------------------------
> > > >  percpu_counter_sum(&mp->m_icount)
> > > > 				    percpu_counter_add(&mp->m_icount, idelta)
> > > > 				    percpu_counter_add_batch(&mp->m_icount,
> > > > 						  idelta, XFS_ICOUNT_BATCH)
> > > >  percpu_counter_sum(&mp->m_ifree)
> > > 
> > > What caused the xfs_log_sb() to be called? Very few things
> > > actually log the superblock this way at runtime - it's generally
> > > only logged directly like this when a feature bit changes during a
> > > transaction (rare) or at a synchronisation point when everything
> > > else is idle and there's no chance of a race like this occurring...
> > > 
> > > I can see a couple of routes to this occurring via feature bit
> > > modification, but I don't see them being easy to hit or something
> > > that would exist for very long in the journal. Hence I'm wondering
> > > if there should be runtime protection for xfs_log_sb() to avoid
> > > these problems....
> > 
> > Maybe.  Or perhaps we sample m_i{count,free} until they come up with a
> > value that will pass the verifier, and only then log the new values to
> > the primary super xfs_buf?
> 
> I suspect the simplest thing to do is this:
> 
> 	mp->m_sb.sb_ifree = min_t(uint64_t, percpu_counter_sum(&mp->m_ifree),
> 				mp->m_sb.sb.icount);
> 
> That way ifree will never be logged as being greater than icount.
> Neither icount or ifree will be accurate if we are racing with other
> updates, but it will guarantee that what we write to the journal
> won't trigger corruption warnings.
> 

Agree with you, this is the simplest and cleanest fix method, there will
be no more impact. This can be fixed at the point where the problem occurs
rather than after the problem has occurred. I would like to resend a patch
and attach the reproduce script. Thanks for your advice.

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-22  2:03 [PATCH v1] xfs: fix sb write verify for lazysbcount Long Li
  2022-10-22  2:14 ` Darrick J. Wong
@ 2022-10-25  9:15 ` Long Li
  2022-10-25 18:16   ` Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Long Li @ 2022-10-25  9:15 UTC (permalink / raw)
  To: leo.lilong
  Cc: billodo, chandan.babu, dchinner, djwong, guoxuenan, houtao1,
	linux-xfs, sandeen, yi.zhang

When lazysbcount is enabled, fsstress and loop mount/unmount test report
the following problems:

XFS (loop0): SB summary counter sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
	xfs_sb block 0x0
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed

This will make the file system unmountable, the cause of the problem is
that during the log recovery process, incorrect count (ifree > icount)
are recovered from the log and fail to pass the boundary check in
xfs_validate_sb_write(). The following script can reproduce the problem,
but it may take a long time.

device=/dev/sda
testdir=/mnt/test
round=0

function fail()
{
	echo "$*"
	exit 1
}

mkdir -p $testdir
while [ $round -lt 10000 ]
do
	echo "******* round $round ********"
	mkfs.xfs -f $device
	mount $device $testdir || fail "mount failed!"
	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
	sleep 4
	killall -w fsstress
	umount $testdir
	xfs_repair -e $device > /dev/null
	if [ $? -eq 2 ];then
		echo "ERR CODE 2: Dirty log exception during repair."
		exit 1
	fi
	round=$(($round+1))
done

With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount and written to
the log. For example consider the following sequence:

 CPU0				 CPU1
 xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
 ----------			 ------------------------------
 percpu_counter_sum(&mp->m_icount)
				 percpu_counter_add(&mp->m_icount, idelta)
				 percpu_counter_add_batch(&mp->m_icount,
						idelta, XFS_ICOUNT_BATCH)
 percpu_counter_sum(&mp->m_ifree)

If we have an unclean shutdown, this will be corrected by
xfs_initialize_perag_data() rebuilding the counters from the AGF block
counts, and the correction is later than log recovery. During log recovery,
incorrect ifree/icount may be restored from the log and written sb, since
ifree and icount have not been corrected at this time, sb write check
will fail due to ifree > icount.

Guaranteed that ifree will never be logged as being greater than icount.
Neither icount or ifree will be accurate if we are racing with other
updates, but it will guarantee that what we write to the journal
won't trigger corruption warnings.

Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
v2:
- Add scripts that could reproduce the problem
- Guaranteed that ifree will never be logged as being greater than icount

 fs/xfs/libxfs/xfs_sb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a20cade590e9..1eeecf2eb2a7 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -972,7 +972,9 @@ xfs_log_sb(
 	 */
 	if (xfs_has_lazysbcount(mp)) {
 		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
-		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
+		mp->m_sb.sb_ifree = min_t(uint64_t,
+				percpu_counter_sum(&mp->m_ifree),
+				mp->m_sb.sb_icount);
 		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-25  9:15 ` [PATCH v2] " Long Li
@ 2022-10-25 18:16   ` Darrick J. Wong
  2022-10-26  9:13     ` Long Li
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2022-10-25 18:16 UTC (permalink / raw)
  To: Long Li
  Cc: billodo, chandan.babu, dchinner, guoxuenan, houtao1, linux-xfs,
	sandeen, yi.zhang

On Tue, Oct 25, 2022 at 05:15:27PM +0800, Long Li wrote:
> When lazysbcount is enabled, fsstress and loop mount/unmount test report
> the following problems:
> 
> XFS (loop0): SB summary counter sanity check failed
> XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
> 	xfs_sb block 0x0
> XFS (loop0): Unmount and run xfs_repair
> XFS (loop0): First 128 bytes of corrupted metadata buffer:
> 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> XFS (loop0): log mount/recovery failed: error -117
> XFS (loop0): log mount failed
> 
> This will make the file system unmountable, the cause of the problem is
> that during the log recovery process, incorrect count (ifree > icount)
> are recovered from the log and fail to pass the boundary check in
> xfs_validate_sb_write(). The following script can reproduce the problem,
> but it may take a long time.
> 
> device=/dev/sda
> testdir=/mnt/test
> round=0
> 
> function fail()
> {
> 	echo "$*"
> 	exit 1
> }
> 
> mkdir -p $testdir
> while [ $round -lt 10000 ]
> do
> 	echo "******* round $round ********"
> 	mkfs.xfs -f $device
> 	mount $device $testdir || fail "mount failed!"
> 	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &

What is the backtrace of the xfs_log_sb caller?  I speculate that it's
something along the lines of adding a superblock feature?  attr2 would
be my guess since this is fsstress.

So the other racing thread would be a thread that just freed an inode
cluster, committed the transaction, and now it's committing idelta and
ifreedelta into the incore percpu counters via:

	if (idelta)
		percpu_counter_add_batch(&mp->m_icount, idelta,
					 XFS_ICOUNT_BATCH);

	if (ifreedelta)
		percpu_counter_add(&mp->m_ifree, ifreedelta);

> 	sleep 4
> 	killall -w fsstress
> 	umount $testdir
> 	xfs_repair -e $device > /dev/null
> 	if [ $? -eq 2 ];then
> 		echo "ERR CODE 2: Dirty log exception during repair."
> 		exit 1
> 	fi
> 	round=$(($round+1))
> done
> 
> With lazysbcount is enabled, There is no additional lock protection for
> reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
> m_ifree, this will make the m_ifree greater than m_icount and written to
> the log. For example consider the following sequence:
> 
>  CPU0				 CPU1
>  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
>  ----------			 ------------------------------
>  percpu_counter_sum(&mp->m_icount)
> 				 percpu_counter_add(&mp->m_icount, idelta)

This callsite does not exist ^^^^^^^^^^^ in the codebase, AFAICT.

> 				 percpu_counter_add_batch(&mp->m_icount,
> 						idelta, XFS_ICOUNT_BATCH)
>  percpu_counter_sum(&mp->m_ifree)

I think what's happening here is more like:

1. CPU1 adds a negative idelta to m_icount.
2. CPU0 sums m_icount.
3. CPU0 sums m_ifree.
4. CPU1 adds a negative ideltafree to m_ifree.

Now CPU0 has an ifree > icount, which it writes into the primary
superblock buffer.  Eventually the AIL writes the buffer to disk, only
the write verifier trips over icount < ifree and shuts down the fs.

> If we have an unclean shutdown, this will be corrected by
> xfs_initialize_perag_data() rebuilding the counters from the AGF block
> counts, and the correction is later than log recovery. During log recovery,
> incorrect ifree/icount may be restored from the log and written sb, since
> ifree and icount have not been corrected at this time, sb write check
> will fail due to ifree > icount.
> 
> Guaranteed that ifree will never be logged as being greater than icount.
> Neither icount or ifree will be accurate if we are racing with other
> updates, but it will guarantee that what we write to the journal
> won't trigger corruption warnings.
> 
> Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
> v2:
> - Add scripts that could reproduce the problem
> - Guaranteed that ifree will never be logged as being greater than icount
> 
>  fs/xfs/libxfs/xfs_sb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index a20cade590e9..1eeecf2eb2a7 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -972,7 +972,9 @@ xfs_log_sb(
>  	 */
>  	if (xfs_has_lazysbcount(mp)) {
>  		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> -		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> +		mp->m_sb.sb_ifree = min_t(uint64_t,
> +				percpu_counter_sum(&mp->m_ifree),
> +				mp->m_sb.sb_icount);

This part looks plausible, but I think xfs_unmountfs really ought to
check that m_ifree < m_icount after it's quiesced the rest of the
filesystem and freed the reserve block pool.  If ifree is still larger
than icount, someone has corrupted the incore counters, so we should not
write a clean unmount record.

--D

>  		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
>  	}
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-25 18:16   ` Darrick J. Wong
@ 2022-10-26  9:13     ` Long Li
  2022-10-26 18:52       ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2022-10-26  9:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: billodo, chandan.babu, dchinner, guoxuenan, houtao1, linux-xfs,
	sandeen, yi.zhang

On Tue, Oct 25, 2022 at 11:16:39AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 25, 2022 at 05:15:27PM +0800, Long Li wrote:
> > When lazysbcount is enabled, fsstress and loop mount/unmount test report
> > the following problems:
> > 
> > XFS (loop0): SB summary counter sanity check failed
> > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
> > 	xfs_sb block 0x0
> > XFS (loop0): Unmount and run xfs_repair
> > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > XFS (loop0): log mount/recovery failed: error -117
> > XFS (loop0): log mount failed
> > 
> > This will make the file system unmountable, the cause of the problem is
> > that during the log recovery process, incorrect count (ifree > icount)
> > are recovered from the log and fail to pass the boundary check in
> > xfs_validate_sb_write(). The following script can reproduce the problem,
> > but it may take a long time.
> > 
> > device=/dev/sda
> > testdir=/mnt/test
> > round=0
> > 
> > function fail()
> > {
> > 	echo "$*"
> > 	exit 1
> > }
> > 
> > mkdir -p $testdir
> > while [ $round -lt 10000 ]
> > do
> > 	echo "******* round $round ********"
> > 	mkfs.xfs -f $device
> > 	mount $device $testdir || fail "mount failed!"
> > 	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
> 
> What is the backtrace of the xfs_log_sb caller?  I speculate that it's
> something along the lines of adding a superblock feature?  attr2 would
> be my guess since this is fsstress.

The call trace that I reproduced:
 Call Trace:
  <TASK>
  dump_stack_lvl+0x4d/0x66
  xfs_log_sb.cold+0x2f/0x1af
  xfs_bmap_add_attrfork+0x687/0xb40
  ? get_reg+0x91/0x190
  ? xfs_bmap_add_attrfork+0x0/0xb40
  ? unwind_next_frame+0x115d/0x1b70
  ? xfs_attr_calc_size+0x13c/0x2e0
  xfs_attr_set+0xb51/0x1d50
  ? __kernel_text_address-0xe/0x30
  ? xfs_attr_set+0x0/0x1d50
  ? __kernel_text_address+0xe/0x30
  ? unwind_get_return_address+0x5f/0xa0
  ? stack_trace_consume_entry+0x0/0x160
  ? arch_stack_walk+0x98/0xf0
  xfs_attr_change+0x22d/0x380
  xfs_xattr_set+0xeb/0x160
  ? xfs_xattr_set+0x0/0x160
  ? vmemdup_user+0x27/0xa0
  ? setxattr_copy+0x103/0x1a0
  ? setxattr+0xd1/0x160
  ? path_setxattr+0x168/0x190
  ? __x64_sys_setxattr+0xc5/0x160
  ? xattr_resolve_name+0x23d/0x360
  ? xfs_xattr_set+0x0/0x160
  __vfs_setxattr+0x100/0x160
  ? __vfs_setxattr+0x0/0x160
  __vfs_setxattr_noperm+0x104/0x320
  __vfs_setxattr_locked+0x1ba/0x260

> 
> So the other racing thread would be a thread that just freed an inode
> cluster, committed the transaction, and now it's committing idelta and
> ifreedelta into the incore percpu counters via:
> 
> 	if (idelta)
> 		percpu_counter_add_batch(&mp->m_icount, idelta,
> 					 XFS_ICOUNT_BATCH);
> 
> 	if (ifreedelta)
> 		percpu_counter_add(&mp->m_ifree, ifreedelta);
> 
> > 	sleep 4
> > 	killall -w fsstress
> > 	umount $testdir
> > 	xfs_repair -e $device > /dev/null
> > 	if [ $? -eq 2 ];then
> > 		echo "ERR CODE 2: Dirty log exception during repair."
> > 		exit 1
> > 	fi
> > 	round=$(($round+1))
> > done
> > 
> > With lazysbcount is enabled, There is no additional lock protection for
> > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
> > m_ifree, this will make the m_ifree greater than m_icount and written to
> > the log. For example consider the following sequence:
> > 
> >  CPU0				 CPU1
> >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> >  ----------			 ------------------------------
> >  percpu_counter_sum(&mp->m_icount)
> > 				 percpu_counter_add(&mp->m_icount, idelta)
> 
> This callsite does not exist ^^^^^^^^^^^ in the codebase, AFAICT.
> 
> > 				 percpu_counter_add_batch(&mp->m_icount,
> > 						idelta, XFS_ICOUNT_BATCH)
> >  percpu_counter_sum(&mp->m_ifree)

Sorry, the code I copied is wrong, as it should be:

 CPU0				 CPU1
 xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
 ----------			 ------------------------------
 percpu_counter_sum(&mp->m_icount)
				 percpu_counter_add_batch(&mp->m_icount,
						idelta, XFS_ICOUNT_BATCH)
				 percpu_counter_add(&mp->m_ifree, ifreedelta);
 percpu_counter_sum(&mp->m_ifree)

> 
> I think what's happening here is more like:
> 
> 1. CPU1 adds a negative idelta to m_icount.
> 2. CPU0 sums m_icount.
> 3. CPU0 sums m_ifree.
> 4. CPU1 adds a negative ideltafree to m_ifree.

I tried to reproduce the situation that you said, but it hasn't been
reproduced yet. Only the following sequence is reproduced:

1. CPU0 sums m_icount.
2. CPU1 adds a positive idelta (e.g. 32) to m_icount.
3. CPU1 adds a positive ideltafree (e.g. 32) to m_ifree.
4. CPU0 sums m_ifree. 

> 
> Now CPU0 has an ifree > icount, which it writes into the primary
> superblock buffer.  Eventually the AIL writes the buffer to disk, only
> the write verifier trips over icount < ifree and shuts down the fs.
> 
> > If we have an unclean shutdown, this will be corrected by
> > xfs_initialize_perag_data() rebuilding the counters from the AGF block
> > counts, and the correction is later than log recovery. During log recovery,
> > incorrect ifree/icount may be restored from the log and written sb, since
> > ifree and icount have not been corrected at this time, sb write check
> > will fail due to ifree > icount.
> > 
> > Guaranteed that ifree will never be logged as being greater than icount.
> > Neither icount or ifree will be accurate if we are racing with other
> > updates, but it will guarantee that what we write to the journal
> > won't trigger corruption warnings.
> > 
> > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> > v2:
> > - Add scripts that could reproduce the problem
> > - Guaranteed that ifree will never be logged as being greater than icount
> > 
> >  fs/xfs/libxfs/xfs_sb.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index a20cade590e9..1eeecf2eb2a7 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -972,7 +972,9 @@ xfs_log_sb(
> >  	 */
> >  	if (xfs_has_lazysbcount(mp)) {
> >  		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> > -		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> > +		mp->m_sb.sb_ifree = min_t(uint64_t,
> > +				percpu_counter_sum(&mp->m_ifree),
> > +				mp->m_sb.sb_icount);
> 
> This part looks plausible, but I think xfs_unmountfs really ought to
> check that m_ifree < m_icount after it's quiesced the rest of the
> filesystem and freed the reserve block pool.  If ifree is still larger
> than icount, someone has corrupted the incore counters, so we should not
> write a clean unmount record.
> 
> --D
> 
> >  		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> >  	}
> >  
> > -- 
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-26  9:13     ` Long Li
@ 2022-10-26 18:52       ` Darrick J. Wong
  2022-10-27 13:25         ` Long Li
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2022-10-26 18:52 UTC (permalink / raw)
  To: Long Li
  Cc: billodo, chandan.babu, dchinner, guoxuenan, houtao1, linux-xfs,
	sandeen, yi.zhang

On Wed, Oct 26, 2022 at 05:13:44PM +0800, Long Li wrote:
> On Tue, Oct 25, 2022 at 11:16:39AM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 25, 2022 at 05:15:27PM +0800, Long Li wrote:
> > > When lazysbcount is enabled, fsstress and loop mount/unmount test report
> > > the following problems:
> > > 
> > > XFS (loop0): SB summary counter sanity check failed
> > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
> > > 	xfs_sb block 0x0
> > > XFS (loop0): Unmount and run xfs_repair
> > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > XFS (loop0): log mount/recovery failed: error -117
> > > XFS (loop0): log mount failed
> > > 
> > > This will make the file system unmountable, the cause of the problem is
> > > that during the log recovery process, incorrect count (ifree > icount)
> > > are recovered from the log and fail to pass the boundary check in
> > > xfs_validate_sb_write(). The following script can reproduce the problem,
> > > but it may take a long time.
> > > 
> > > device=/dev/sda
> > > testdir=/mnt/test
> > > round=0
> > > 
> > > function fail()
> > > {
> > > 	echo "$*"
> > > 	exit 1
> > > }
> > > 
> > > mkdir -p $testdir
> > > while [ $round -lt 10000 ]
> > > do
> > > 	echo "******* round $round ********"
> > > 	mkfs.xfs -f $device
> > > 	mount $device $testdir || fail "mount failed!"
> > > 	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
> > 
> > What is the backtrace of the xfs_log_sb caller?  I speculate that it's
> > something along the lines of adding a superblock feature?  attr2 would
> > be my guess since this is fsstress.
> 
> The call trace that I reproduced:
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x4d/0x66
>   xfs_log_sb.cold+0x2f/0x1af
>   xfs_bmap_add_attrfork+0x687/0xb40
>   ? get_reg+0x91/0x190
>   ? xfs_bmap_add_attrfork+0x0/0xb40
>   ? unwind_next_frame+0x115d/0x1b70
>   ? xfs_attr_calc_size+0x13c/0x2e0
>   xfs_attr_set+0xb51/0x1d50
>   ? __kernel_text_address-0xe/0x30
>   ? xfs_attr_set+0x0/0x1d50
>   ? __kernel_text_address+0xe/0x30
>   ? unwind_get_return_address+0x5f/0xa0
>   ? stack_trace_consume_entry+0x0/0x160
>   ? arch_stack_walk+0x98/0xf0
>   xfs_attr_change+0x22d/0x380
>   xfs_xattr_set+0xeb/0x160
>   ? xfs_xattr_set+0x0/0x160
>   ? vmemdup_user+0x27/0xa0
>   ? setxattr_copy+0x103/0x1a0
>   ? setxattr+0xd1/0x160
>   ? path_setxattr+0x168/0x190
>   ? __x64_sys_setxattr+0xc5/0x160
>   ? xattr_resolve_name+0x23d/0x360
>   ? xfs_xattr_set+0x0/0x160
>   __vfs_setxattr+0x100/0x160
>   ? __vfs_setxattr+0x0/0x160
>   __vfs_setxattr_noperm+0x104/0x320
>   __vfs_setxattr_locked+0x1ba/0x260
> 
> > 
> > So the other racing thread would be a thread that just freed an inode
> > cluster, committed the transaction, and now it's committing idelta and
> > ifreedelta into the incore percpu counters via:
> > 
> > 	if (idelta)
> > 		percpu_counter_add_batch(&mp->m_icount, idelta,
> > 					 XFS_ICOUNT_BATCH);
> > 
> > 	if (ifreedelta)
> > 		percpu_counter_add(&mp->m_ifree, ifreedelta);
> > 
> > > 	sleep 4
> > > 	killall -w fsstress
> > > 	umount $testdir
> > > 	xfs_repair -e $device > /dev/null
> > > 	if [ $? -eq 2 ];then
> > > 		echo "ERR CODE 2: Dirty log exception during repair."
> > > 		exit 1
> > > 	fi
> > > 	round=$(($round+1))
> > > done
> > > 
> > > With lazysbcount is enabled, There is no additional lock protection for
> > > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
> > > m_ifree, this will make the m_ifree greater than m_icount and written to
> > > the log. For example consider the following sequence:
> > > 
> > >  CPU0				 CPU1
> > >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> > >  ----------			 ------------------------------
> > >  percpu_counter_sum(&mp->m_icount)
> > > 				 percpu_counter_add(&mp->m_icount, idelta)
> > 
> > This callsite does not exist ^^^^^^^^^^^ in the codebase, AFAICT.
> > 
> > > 				 percpu_counter_add_batch(&mp->m_icount,
> > > 						idelta, XFS_ICOUNT_BATCH)
> > >  percpu_counter_sum(&mp->m_ifree)
> 
> Sorry, the code I copied is wrong, as it should be:
> 
>  CPU0				 CPU1
>  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
>  ----------			 ------------------------------
>  percpu_counter_sum(&mp->m_icount)
> 				 percpu_counter_add_batch(&mp->m_icount,
> 						idelta, XFS_ICOUNT_BATCH)
> 				 percpu_counter_add(&mp->m_ifree, ifreedelta);
>  percpu_counter_sum(&mp->m_ifree)
> 
> > 
> > I think what's happening here is more like:
> > 
> > 1. CPU1 adds a negative idelta to m_icount.
> > 2. CPU0 sums m_icount.
> > 3. CPU0 sums m_ifree.
> > 4. CPU1 adds a negative ideltafree to m_ifree.
> 
> I tried to reproduce the situation that you said, but it hasn't been
> reproduced yet. Only the following sequence is reproduced:
> 
> 1. CPU0 sums m_icount.
> 2. CPU1 adds a positive idelta (e.g. 32) to m_icount.
> 3. CPU1 adds a positive ideltafree (e.g. 32) to m_ifree.
> 4. CPU0 sums m_ifree. 

Aha, that was my second guess as to what was really going on.

Either way, we're racing with updates to two percpu counters.
Now that the source of the bug has been clarified...

> > Now CPU0 has an ifree > icount, which it writes into the primary
> > superblock buffer.  Eventually the AIL writes the buffer to disk, only
> > the write verifier trips over icount < ifree and shuts down the fs.
> > 
> > > If we have an unclean shutdown, this will be corrected by
> > > xfs_initialize_perag_data() rebuilding the counters from the AGF block
> > > counts, and the correction is later than log recovery. During log recovery,
> > > incorrect ifree/icount may be restored from the log and written sb, since
> > > ifree and icount have not been corrected at this time, sb write check
> > > will fail due to ifree > icount.
> > > 
> > > Guaranteed that ifree will never be logged as being greater than icount.
> > > Neither icount or ifree will be accurate if we are racing with other
> > > updates, but it will guarantee that what we write to the journal
> > > won't trigger corruption warnings.
> > > 
> > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > ---
> > > v2:
> > > - Add scripts that could reproduce the problem
> > > - Guaranteed that ifree will never be logged as being greater than icount
> > > 
> > >  fs/xfs/libxfs/xfs_sb.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > index a20cade590e9..1eeecf2eb2a7 100644
> > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > @@ -972,7 +972,9 @@ xfs_log_sb(
> > >  	 */
> > >  	if (xfs_has_lazysbcount(mp)) {
> > >  		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> > > -		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> > > +		mp->m_sb.sb_ifree = min_t(uint64_t,
> > > +				percpu_counter_sum(&mp->m_ifree),
> > > +				mp->m_sb.sb_icount);
> > 
> > This part looks plausible, but I think xfs_unmountfs really ought to
> > check that m_ifree < m_icount after it's quiesced the rest of the
> > filesystem and freed the reserve block pool.  If ifree is still larger
> > than icount, someone has corrupted the incore counters, so we should not
> > write a clean unmount record.

...please update the patch to include this sanity check at unmount so
that I can get this bugfix moving towards upstream.

--D

> > 
> > --D
> > 
> > >  		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> > >  	}
> > >  
> > > -- 
> > > 2.31.1
> > > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-26 18:52       ` Darrick J. Wong
@ 2022-10-27 13:25         ` Long Li
  2022-10-27 16:05           ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2022-10-27 13:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: billodo, chandan.babu, dchinner, guoxuenan, houtao1, linux-xfs,
	sandeen, yi.zhang

On Wed, Oct 26, 2022 at 11:52:29AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 26, 2022 at 05:13:44PM +0800, Long Li wrote:
> > On Tue, Oct 25, 2022 at 11:16:39AM -0700, Darrick J. Wong wrote:
> > > On Tue, Oct 25, 2022 at 05:15:27PM +0800, Long Li wrote:
> > > > When lazysbcount is enabled, fsstress and loop mount/unmount test report
> > > > the following problems:
> > > > 
> > > > XFS (loop0): SB summary counter sanity check failed
> > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
> > > > 	xfs_sb block 0x0
> > > > XFS (loop0): Unmount and run xfs_repair
> > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > XFS (loop0): log mount/recovery failed: error -117
> > > > XFS (loop0): log mount failed
> > > > 
> > > > This will make the file system unmountable, the cause of the problem is
> > > > that during the log recovery process, incorrect count (ifree > icount)
> > > > are recovered from the log and fail to pass the boundary check in
> > > > xfs_validate_sb_write(). The following script can reproduce the problem,
> > > > but it may take a long time.
> > > > 
> > > > device=/dev/sda
> > > > testdir=/mnt/test
> > > > round=0
> > > > 
> > > > function fail()
> > > > {
> > > > 	echo "$*"
> > > > 	exit 1
> > > > }
> > > > 
> > > > mkdir -p $testdir
> > > > while [ $round -lt 10000 ]
> > > > do
> > > > 	echo "******* round $round ********"
> > > > 	mkfs.xfs -f $device
> > > > 	mount $device $testdir || fail "mount failed!"
> > > > 	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
> > > 
> > > What is the backtrace of the xfs_log_sb caller?  I speculate that it's
> > > something along the lines of adding a superblock feature?  attr2 would
> > > be my guess since this is fsstress.
> > 
> > The call trace that I reproduced:
> >  Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x4d/0x66
> >   xfs_log_sb.cold+0x2f/0x1af
> >   xfs_bmap_add_attrfork+0x687/0xb40
> >   ? get_reg+0x91/0x190
> >   ? xfs_bmap_add_attrfork+0x0/0xb40
> >   ? unwind_next_frame+0x115d/0x1b70
> >   ? xfs_attr_calc_size+0x13c/0x2e0
> >   xfs_attr_set+0xb51/0x1d50
> >   ? __kernel_text_address-0xe/0x30
> >   ? xfs_attr_set+0x0/0x1d50
> >   ? __kernel_text_address+0xe/0x30
> >   ? unwind_get_return_address+0x5f/0xa0
> >   ? stack_trace_consume_entry+0x0/0x160
> >   ? arch_stack_walk+0x98/0xf0
> >   xfs_attr_change+0x22d/0x380
> >   xfs_xattr_set+0xeb/0x160
> >   ? xfs_xattr_set+0x0/0x160
> >   ? vmemdup_user+0x27/0xa0
> >   ? setxattr_copy+0x103/0x1a0
> >   ? setxattr+0xd1/0x160
> >   ? path_setxattr+0x168/0x190
> >   ? __x64_sys_setxattr+0xc5/0x160
> >   ? xattr_resolve_name+0x23d/0x360
> >   ? xfs_xattr_set+0x0/0x160
> >   __vfs_setxattr+0x100/0x160
> >   ? __vfs_setxattr+0x0/0x160
> >   __vfs_setxattr_noperm+0x104/0x320
> >   __vfs_setxattr_locked+0x1ba/0x260
> > 
> > > 
> > > So the other racing thread would be a thread that just freed an inode
> > > cluster, committed the transaction, and now it's committing idelta and
> > > ifreedelta into the incore percpu counters via:
> > > 
> > > 	if (idelta)
> > > 		percpu_counter_add_batch(&mp->m_icount, idelta,
> > > 					 XFS_ICOUNT_BATCH);
> > > 
> > > 	if (ifreedelta)
> > > 		percpu_counter_add(&mp->m_ifree, ifreedelta);
> > > 
> > > > 	sleep 4
> > > > 	killall -w fsstress
> > > > 	umount $testdir
> > > > 	xfs_repair -e $device > /dev/null
> > > > 	if [ $? -eq 2 ];then
> > > > 		echo "ERR CODE 2: Dirty log exception during repair."
> > > > 		exit 1
> > > > 	fi
> > > > 	round=$(($round+1))
> > > > done
> > > > 
> > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
> > > > m_ifree, this will make the m_ifree greater than m_icount and written to
> > > > the log. For example consider the following sequence:
> > > > 
> > > >  CPU0				 CPU1
> > > >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> > > >  ----------			 ------------------------------
> > > >  percpu_counter_sum(&mp->m_icount)
> > > > 				 percpu_counter_add(&mp->m_icount, idelta)
> > > 
> > > This callsite does not exist ^^^^^^^^^^^ in the codebase, AFAICT.
> > > 
> > > > 				 percpu_counter_add_batch(&mp->m_icount,
> > > > 						idelta, XFS_ICOUNT_BATCH)
> > > >  percpu_counter_sum(&mp->m_ifree)
> > 
> > Sorry, the code I copied is wrong, as it should be:
> > 
> >  CPU0				 CPU1
> >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> >  ----------			 ------------------------------
> >  percpu_counter_sum(&mp->m_icount)
> > 				 percpu_counter_add_batch(&mp->m_icount,
> > 						idelta, XFS_ICOUNT_BATCH)
> > 				 percpu_counter_add(&mp->m_ifree, ifreedelta);
> >  percpu_counter_sum(&mp->m_ifree)
> > 
> > > 
> > > I think what's happening here is more like:
> > > 
> > > 1. CPU1 adds a negative idelta to m_icount.
> > > 2. CPU0 sums m_icount.
> > > 3. CPU0 sums m_ifree.
> > > 4. CPU1 adds a negative ideltafree to m_ifree.
> > 
> > I tried to reproduce the situation that you said, but it hasn't been
> > reproduced yet. Only the following sequence is reproduced:
> > 
> > 1. CPU0 sums m_icount.
> > 2. CPU1 adds a positive idelta (e.g. 32) to m_icount.
> > 3. CPU1 adds a positive ideltafree (e.g. 32) to m_ifree.
> > 4. CPU0 sums m_ifree. 
> 
> Aha, that was my second guess as to what was really going on.
> 
> Either way, we're racing with updates to two percpu counters.
> Now that the source of the bug has been clarified...
> 
> > > Now CPU0 has an ifree > icount, which it writes into the primary
> > > superblock buffer.  Eventually the AIL writes the buffer to disk, only
> > > the write verifier trips over icount < ifree and shuts down the fs.
> > > 
> > > > If we have an unclean shutdown, this will be corrected by
> > > > xfs_initialize_perag_data() rebuilding the counters from the AGF block
> > > > counts, and the correction is later than log recovery. During log recovery,
> > > > incorrect ifree/icount may be restored from the log and written sb, since
> > > > ifree and icount have not been corrected at this time, sb write check
> > > > will fail due to ifree > icount.
> > > > 
> > > > Guaranteed that ifree will never be logged as being greater than icount.
> > > > Neither icount or ifree will be accurate if we are racing with other
> > > > updates, but it will guarantee that what we write to the journal
> > > > won't trigger corruption warnings.
> > > > 
> > > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > > ---
> > > > v2:
> > > > - Add scripts that could reproduce the problem
> > > > - Guaranteed that ifree will never be logged as being greater than icount
> > > > 
> > > >  fs/xfs/libxfs/xfs_sb.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > > index a20cade590e9..1eeecf2eb2a7 100644
> > > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > > @@ -972,7 +972,9 @@ xfs_log_sb(
> > > >  	 */
> > > >  	if (xfs_has_lazysbcount(mp)) {
> > > >  		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> > > > -		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> > > > +		mp->m_sb.sb_ifree = min_t(uint64_t,
> > > > +				percpu_counter_sum(&mp->m_ifree),
> > > > +				mp->m_sb.sb_icount);
> > > 
> > > This part looks plausible, but I think xfs_unmountfs really ought to
> > > check that m_ifree < m_icount after it's quiesced the rest of the
> > > filesystem and freed the reserve block pool.  If ifree is still larger
> > > than icount, someone has corrupted the incore counters, so we should not
> > > write a clean unmount record.
> 
> ...please update the patch to include this sanity check at unmount so
> that I can get this bugfix moving towards upstream.

I have some questions about this. If we can guarantee that
m_ifree <= m_icount, why do we need add a check at umount?
On the other hand, if m_ifree > m_icount, sb write will triger
a shutdown in xfs_validate_sb_write() because the check did
not pass, therefore it will not write a clean umount record
at umount. I also haven't found a code suitable for adding 
such checks. 
> 
> --D
> 
> > > 
> > > --D
> > > 
> > > >  		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> > > >  	}
> > > >  
> > > > -- 
> > > > 2.31.1
> > > > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-27 13:25         ` Long Li
@ 2022-10-27 16:05           ` Darrick J. Wong
  2022-10-29  7:16             ` Long Li
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2022-10-27 16:05 UTC (permalink / raw)
  To: Long Li
  Cc: billodo, chandan.babu, dchinner, guoxuenan, houtao1, linux-xfs,
	sandeen, yi.zhang

On Thu, Oct 27, 2022 at 09:25:04PM +0800, Long Li wrote:
> On Wed, Oct 26, 2022 at 11:52:29AM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 26, 2022 at 05:13:44PM +0800, Long Li wrote:
> > > On Tue, Oct 25, 2022 at 11:16:39AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 25, 2022 at 05:15:27PM +0800, Long Li wrote:
> > > > > When lazysbcount is enabled, fsstress and loop mount/unmount test report
> > > > > the following problems:
> > > > > 
> > > > > XFS (loop0): SB summary counter sanity check failed
> > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
> > > > > 	xfs_sb block 0x0
> > > > > XFS (loop0): Unmount and run xfs_repair
> > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > > XFS (loop0): log mount/recovery failed: error -117
> > > > > XFS (loop0): log mount failed
> > > > > 
> > > > > This will make the file system unmountable, the cause of the problem is
> > > > > that during the log recovery process, incorrect count (ifree > icount)
> > > > > are recovered from the log and fail to pass the boundary check in
> > > > > xfs_validate_sb_write(). The following script can reproduce the problem,
> > > > > but it may take a long time.
> > > > > 
> > > > > device=/dev/sda
> > > > > testdir=/mnt/test
> > > > > round=0
> > > > > 
> > > > > function fail()
> > > > > {
> > > > > 	echo "$*"
> > > > > 	exit 1
> > > > > }
> > > > > 
> > > > > mkdir -p $testdir
> > > > > while [ $round -lt 10000 ]
> > > > > do
> > > > > 	echo "******* round $round ********"
> > > > > 	mkfs.xfs -f $device
> > > > > 	mount $device $testdir || fail "mount failed!"
> > > > > 	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
> > > > 
> > > > What is the backtrace of the xfs_log_sb caller?  I speculate that it's
> > > > something along the lines of adding a superblock feature?  attr2 would
> > > > be my guess since this is fsstress.
> > > 
> > > The call trace that I reproduced:
> > >  Call Trace:
> > >   <TASK>
> > >   dump_stack_lvl+0x4d/0x66
> > >   xfs_log_sb.cold+0x2f/0x1af
> > >   xfs_bmap_add_attrfork+0x687/0xb40
> > >   ? get_reg+0x91/0x190
> > >   ? xfs_bmap_add_attrfork+0x0/0xb40
> > >   ? unwind_next_frame+0x115d/0x1b70
> > >   ? xfs_attr_calc_size+0x13c/0x2e0
> > >   xfs_attr_set+0xb51/0x1d50
> > >   ? __kernel_text_address-0xe/0x30
> > >   ? xfs_attr_set+0x0/0x1d50
> > >   ? __kernel_text_address+0xe/0x30
> > >   ? unwind_get_return_address+0x5f/0xa0
> > >   ? stack_trace_consume_entry+0x0/0x160
> > >   ? arch_stack_walk+0x98/0xf0
> > >   xfs_attr_change+0x22d/0x380
> > >   xfs_xattr_set+0xeb/0x160
> > >   ? xfs_xattr_set+0x0/0x160
> > >   ? vmemdup_user+0x27/0xa0
> > >   ? setxattr_copy+0x103/0x1a0
> > >   ? setxattr+0xd1/0x160
> > >   ? path_setxattr+0x168/0x190
> > >   ? __x64_sys_setxattr+0xc5/0x160
> > >   ? xattr_resolve_name+0x23d/0x360
> > >   ? xfs_xattr_set+0x0/0x160
> > >   __vfs_setxattr+0x100/0x160
> > >   ? __vfs_setxattr+0x0/0x160
> > >   __vfs_setxattr_noperm+0x104/0x320
> > >   __vfs_setxattr_locked+0x1ba/0x260
> > > 
> > > > 
> > > > So the other racing thread would be a thread that just freed an inode
> > > > cluster, committed the transaction, and now it's committing idelta and
> > > > ifreedelta into the incore percpu counters via:
> > > > 
> > > > 	if (idelta)
> > > > 		percpu_counter_add_batch(&mp->m_icount, idelta,
> > > > 					 XFS_ICOUNT_BATCH);
> > > > 
> > > > 	if (ifreedelta)
> > > > 		percpu_counter_add(&mp->m_ifree, ifreedelta);
> > > > 
> > > > > 	sleep 4
> > > > > 	killall -w fsstress
> > > > > 	umount $testdir
> > > > > 	xfs_repair -e $device > /dev/null
> > > > > 	if [ $? -eq 2 ];then
> > > > > 		echo "ERR CODE 2: Dirty log exception during repair."
> > > > > 		exit 1
> > > > > 	fi
> > > > > 	round=$(($round+1))
> > > > > done
> > > > > 
> > > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
> > > > > m_ifree, this will make the m_ifree greater than m_icount and written to
> > > > > the log. For example consider the following sequence:
> > > > > 
> > > > >  CPU0				 CPU1
> > > > >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> > > > >  ----------			 ------------------------------
> > > > >  percpu_counter_sum(&mp->m_icount)
> > > > > 				 percpu_counter_add(&mp->m_icount, idelta)
> > > > 
> > > > This callsite does not exist ^^^^^^^^^^^ in the codebase, AFAICT.
> > > > 
> > > > > 				 percpu_counter_add_batch(&mp->m_icount,
> > > > > 						idelta, XFS_ICOUNT_BATCH)
> > > > >  percpu_counter_sum(&mp->m_ifree)
> > > 
> > > Sorry, the code I copied is wrong, as it should be:
> > > 
> > >  CPU0				 CPU1
> > >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> > >  ----------			 ------------------------------
> > >  percpu_counter_sum(&mp->m_icount)
> > > 				 percpu_counter_add_batch(&mp->m_icount,
> > > 						idelta, XFS_ICOUNT_BATCH)
> > > 				 percpu_counter_add(&mp->m_ifree, ifreedelta);
> > >  percpu_counter_sum(&mp->m_ifree)
> > > 
> > > > 
> > > > I think what's happening here is more like:
> > > > 
> > > > 1. CPU1 adds a negative idelta to m_icount.
> > > > 2. CPU0 sums m_icount.
> > > > 3. CPU0 sums m_ifree.
> > > > 4. CPU1 adds a negative ideltafree to m_ifree.
> > > 
> > > I tried to reproduce the situation that you said, but it hasn't been
> > > reproduced yet. Only the following sequence is reproduced:
> > > 
> > > 1. CPU0 sums m_icount.
> > > 2. CPU1 adds a positive idelta (e.g. 32) to m_icount.
> > > 3. CPU1 adds a positive ideltafree (e.g. 32) to m_ifree.
> > > 4. CPU0 sums m_ifree. 
> > 
> > Aha, that was my second guess as to what was really going on.
> > 
> > Either way, we're racing with updates to two percpu counters.
> > Now that the source of the bug has been clarified...
> > 
> > > > Now CPU0 has an ifree > icount, which it writes into the primary
> > > > superblock buffer.  Eventually the AIL writes the buffer to disk, only
> > > > the write verifier trips over icount < ifree and shuts down the fs.
> > > > 
> > > > > If we have an unclean shutdown, this will be corrected by
> > > > > xfs_initialize_perag_data() rebuilding the counters from the AGF block
> > > > > counts, and the correction is later than log recovery. During log recovery,
> > > > > incorrect ifree/icount may be restored from the log and written sb, since
> > > > > ifree and icount have not been corrected at this time, sb write check
> > > > > will fail due to ifree > icount.
> > > > > 
> > > > > Guaranteed that ifree will never be logged as being greater than icount.
> > > > > Neither icount or ifree will be accurate if we are racing with other
> > > > > updates, but it will guarantee that what we write to the journal
> > > > > won't trigger corruption warnings.
> > > > > 
> > > > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> > > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > > > ---
> > > > > v2:
> > > > > - Add scripts that could reproduce the problem
> > > > > - Guaranteed that ifree will never be logged as being greater than icount
> > > > > 
> > > > >  fs/xfs/libxfs/xfs_sb.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > > > index a20cade590e9..1eeecf2eb2a7 100644
> > > > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > > > @@ -972,7 +972,9 @@ xfs_log_sb(
> > > > >  	 */
> > > > >  	if (xfs_has_lazysbcount(mp)) {
> > > > >  		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> > > > > -		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> > > > > +		mp->m_sb.sb_ifree = min_t(uint64_t,
> > > > > +				percpu_counter_sum(&mp->m_ifree),
> > > > > +				mp->m_sb.sb_icount);
> > > > 
> > > > This part looks plausible, but I think xfs_unmountfs really ought to
> > > > check that m_ifree < m_icount after it's quiesced the rest of the
> > > > filesystem and freed the reserve block pool.  If ifree is still larger
> > > > than icount, someone has corrupted the incore counters, so we should not
> > > > write a clean unmount record.
> > 
> > ...please update the patch to include this sanity check at unmount so
> > that I can get this bugfix moving towards upstream.
> 
> I have some questions about this. If we can guarantee that
> m_ifree <= m_icount, why do we need add a check at umount?

The code change above guarantees that sb_ifree <= sb_icount (the ondisk
superblock inode counters) when the filesystem wants to update the
primary ondisk superblock.  It does not make any guarantee about the
relationship between m_ifree and m_icount (the incore counters).  As you
point out, the percpu counters can temporarily violate that constraint
when there are other threads allocating or freeing inode chunks.

However, the constraint that m_ifree <= m_icount must be satisfied any
time that there /cannot/ be other threads allocating or freeing inode
chunks.  If the constraint is violated under these circumstances, we
have clear evidence of software bugs or memory corruption.  That is
grounds for shutting down the filesystem immediately.

> On the other hand, if m_ifree > m_icount, sb write will triger
> a shutdown in xfs_validate_sb_write() because the check did

It won't, because xfs_validate_sb_write checks sb_ifree and sb_icount,
not m_ifree and m_icount.  The min_t() in your proposed change protects
any m_ifree > m_icount from being written into the ondisk superblock.

> not pass, therefore it will not write a clean umount record
> at umount. I also haven't found a code suitable for adding
> such checks.

xfs_unmountfs just prior to unmounting the log.

--D

> > 
> > --D
> > 
> > > > 
> > > > --D
> > > > 
> > > > >  		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> > > > >  	}
> > > > >  
> > > > > -- 
> > > > > 2.31.1
> > > > > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-27 16:05           ` Darrick J. Wong
@ 2022-10-29  7:16             ` Long Li
  2022-10-30 22:04               ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Long Li @ 2022-10-29  7:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: billodo, chandan.babu, dchinner, guoxuenan, houtao1, linux-xfs,
	sandeen, yi.zhang

On Thu, Oct 27, 2022 at 09:05:21AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 27, 2022 at 09:25:04PM +0800, Long Li wrote:
> > On Wed, Oct 26, 2022 at 11:52:29AM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 26, 2022 at 05:13:44PM +0800, Long Li wrote:
> > > > On Tue, Oct 25, 2022 at 11:16:39AM -0700, Darrick J. Wong wrote:
> > > > > On Tue, Oct 25, 2022 at 05:15:27PM +0800, Long Li wrote:
> > > > > > When lazysbcount is enabled, fsstress and loop mount/unmount test report
> > > > > > the following problems:
> > > > > > 
> > > > > > XFS (loop0): SB summary counter sanity check failed
> > > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
> > > > > > 	xfs_sb block 0x0
> > > > > > XFS (loop0): Unmount and run xfs_repair
> > > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > > > XFS (loop0): log mount/recovery failed: error -117
> > > > > > XFS (loop0): log mount failed
> > > > > > 
> > > > > > This will make the file system unmountable, the cause of the problem is
> > > > > > that during the log recovery process, incorrect count (ifree > icount)
> > > > > > are recovered from the log and fail to pass the boundary check in
> > > > > > xfs_validate_sb_write(). The following script can reproduce the problem,
> > > > > > but it may take a long time.
> > > > > > 
> > > > > > device=/dev/sda
> > > > > > testdir=/mnt/test
> > > > > > round=0
> > > > > > 
> > > > > > function fail()
> > > > > > {
> > > > > > 	echo "$*"
> > > > > > 	exit 1
> > > > > > }
> > > > > > 
> > > > > > mkdir -p $testdir
> > > > > > while [ $round -lt 10000 ]
> > > > > > do
> > > > > > 	echo "******* round $round ********"
> > > > > > 	mkfs.xfs -f $device
> > > > > > 	mount $device $testdir || fail "mount failed!"
> > > > > > 	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
> > > > > 
> > > > > What is the backtrace of the xfs_log_sb caller?  I speculate that it's
> > > > > something along the lines of adding a superblock feature?  attr2 would
> > > > > be my guess since this is fsstress.
> > > > 
> > > > The call trace that I reproduced:
> > > >  Call Trace:
> > > >   <TASK>
> > > >   dump_stack_lvl+0x4d/0x66
> > > >   xfs_log_sb.cold+0x2f/0x1af
> > > >   xfs_bmap_add_attrfork+0x687/0xb40
> > > >   ? get_reg+0x91/0x190
> > > >   ? xfs_bmap_add_attrfork+0x0/0xb40
> > > >   ? unwind_next_frame+0x115d/0x1b70
> > > >   ? xfs_attr_calc_size+0x13c/0x2e0
> > > >   xfs_attr_set+0xb51/0x1d50
> > > >   ? __kernel_text_address-0xe/0x30
> > > >   ? xfs_attr_set+0x0/0x1d50
> > > >   ? __kernel_text_address+0xe/0x30
> > > >   ? unwind_get_return_address+0x5f/0xa0
> > > >   ? stack_trace_consume_entry+0x0/0x160
> > > >   ? arch_stack_walk+0x98/0xf0
> > > >   xfs_attr_change+0x22d/0x380
> > > >   xfs_xattr_set+0xeb/0x160
> > > >   ? xfs_xattr_set+0x0/0x160
> > > >   ? vmemdup_user+0x27/0xa0
> > > >   ? setxattr_copy+0x103/0x1a0
> > > >   ? setxattr+0xd1/0x160
> > > >   ? path_setxattr+0x168/0x190
> > > >   ? __x64_sys_setxattr+0xc5/0x160
> > > >   ? xattr_resolve_name+0x23d/0x360
> > > >   ? xfs_xattr_set+0x0/0x160
> > > >   __vfs_setxattr+0x100/0x160
> > > >   ? __vfs_setxattr+0x0/0x160
> > > >   __vfs_setxattr_noperm+0x104/0x320
> > > >   __vfs_setxattr_locked+0x1ba/0x260
> > > > 
> > > > > 
> > > > > So the other racing thread would be a thread that just freed an inode
> > > > > cluster, committed the transaction, and now it's committing idelta and
> > > > > ifreedelta into the incore percpu counters via:
> > > > > 
> > > > > 	if (idelta)
> > > > > 		percpu_counter_add_batch(&mp->m_icount, idelta,
> > > > > 					 XFS_ICOUNT_BATCH);
> > > > > 
> > > > > 	if (ifreedelta)
> > > > > 		percpu_counter_add(&mp->m_ifree, ifreedelta);
> > > > > 
> > > > > > 	sleep 4
> > > > > > 	killall -w fsstress
> > > > > > 	umount $testdir
> > > > > > 	xfs_repair -e $device > /dev/null
> > > > > > 	if [ $? -eq 2 ];then
> > > > > > 		echo "ERR CODE 2: Dirty log exception during repair."
> > > > > > 		exit 1
> > > > > > 	fi
> > > > > > 	round=$(($round+1))
> > > > > > done
> > > > > > 
> > > > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > > > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
> > > > > > m_ifree, this will make the m_ifree greater than m_icount and written to
> > > > > > the log. For example consider the following sequence:
> > > > > > 
> > > > > >  CPU0				 CPU1
> > > > > >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> > > > > >  ----------			 ------------------------------
> > > > > >  percpu_counter_sum(&mp->m_icount)
> > > > > > 				 percpu_counter_add(&mp->m_icount, idelta)
> > > > > 
> > > > > This callsite does not exist ^^^^^^^^^^^ in the codebase, AFAICT.
> > > > > 
> > > > > > 				 percpu_counter_add_batch(&mp->m_icount,
> > > > > > 						idelta, XFS_ICOUNT_BATCH)
> > > > > >  percpu_counter_sum(&mp->m_ifree)
> > > > 
> > > > Sorry, the code I copied is wrong, as it should be:
> > > > 
> > > >  CPU0				 CPU1
> > > >  xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
> > > >  ----------			 ------------------------------
> > > >  percpu_counter_sum(&mp->m_icount)
> > > > 				 percpu_counter_add_batch(&mp->m_icount,
> > > > 						idelta, XFS_ICOUNT_BATCH)
> > > > 				 percpu_counter_add(&mp->m_ifree, ifreedelta);
> > > >  percpu_counter_sum(&mp->m_ifree)
> > > > 
> > > > > 
> > > > > I think what's happening here is more like:
> > > > > 
> > > > > 1. CPU1 adds a negative idelta to m_icount.
> > > > > 2. CPU0 sums m_icount.
> > > > > 3. CPU0 sums m_ifree.
> > > > > 4. CPU1 adds a negative ideltafree to m_ifree.
> > > > 
> > > > I tried to reproduce the situation that you said, but it hasn't been
> > > > reproduced yet. Only the following sequence is reproduced:
> > > > 
> > > > 1. CPU0 sums m_icount.
> > > > 2. CPU1 adds a positive idelta (e.g. 32) to m_icount.
> > > > 3. CPU1 adds a positive ideltafree (e.g. 32) to m_ifree.
> > > > 4. CPU0 sums m_ifree. 
> > > 
> > > Aha, that was my second guess as to what was really going on.
> > > 
> > > Either way, we're racing with updates to two percpu counters.
> > > Now that the source of the bug has been clarified...
> > > 
> > > > > Now CPU0 has an ifree > icount, which it writes into the primary
> > > > > superblock buffer.  Eventually the AIL writes the buffer to disk, only
> > > > > the write verifier trips over icount < ifree and shuts down the fs.
> > > > > 
> > > > > > If we have an unclean shutdown, this will be corrected by
> > > > > > xfs_initialize_perag_data() rebuilding the counters from the AGF block
> > > > > > counts, and the correction is later than log recovery. During log recovery,
> > > > > > incorrect ifree/icount may be restored from the log and written sb, since
> > > > > > ifree and icount have not been corrected at this time, sb write check
> > > > > > will fail due to ifree > icount.
> > > > > > 
> > > > > > Guaranteed that ifree will never be logged as being greater than icount.
> > > > > > Neither icount or ifree will be accurate if we are racing with other
> > > > > > updates, but it will guarantee that what we write to the journal
> > > > > > won't trigger corruption warnings.
> > > > > > 
> > > > > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks")
> > > > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > > > > ---
> > > > > > v2:
> > > > > > - Add scripts that could reproduce the problem
> > > > > > - Guaranteed that ifree will never be logged as being greater than icount
> > > > > > 
> > > > > >  fs/xfs/libxfs/xfs_sb.c | 4 +++-
> > > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > > > > index a20cade590e9..1eeecf2eb2a7 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > > > > @@ -972,7 +972,9 @@ xfs_log_sb(
> > > > > >  	 */
> > > > > >  	if (xfs_has_lazysbcount(mp)) {
> > > > > >  		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
> > > > > > -		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
> > > > > > +		mp->m_sb.sb_ifree = min_t(uint64_t,
> > > > > > +				percpu_counter_sum(&mp->m_ifree),
> > > > > > +				mp->m_sb.sb_icount);
> > > > > 
> > > > > This part looks plausible, but I think xfs_unmountfs really ought to
> > > > > check that m_ifree < m_icount after it's quiesced the rest of the
> > > > > filesystem and freed the reserve block pool.  If ifree is still larger
> > > > > than icount, someone has corrupted the incore counters, so we should not
> > > > > write a clean unmount record.
> > > 
> > > ...please update the patch to include this sanity check at unmount so
> > > that I can get this bugfix moving towards upstream.
> > 
> > I have some questions about this. If we can guarantee that
> > m_ifree <= m_icount, why do we need add a check at umount?
> 
> The code change above guarantees that sb_ifree <= sb_icount (the ondisk
> superblock inode counters) when the filesystem wants to update the
> primary ondisk superblock.  It does not make any guarantee about the
> relationship between m_ifree and m_icount (the incore counters).  As you
> point out, the percpu counters can temporarily violate that constraint
> when there are other threads allocating or freeing inode chunks.
> 
> However, the constraint that m_ifree <= m_icount must be satisfied any
> time that there /cannot/ be other threads allocating or freeing inode
> chunks.  If the constraint is violated under these circumstances, we
> have clear evidence of software bugs or memory corruption.  That is
> grounds for shutting down the filesystem immediately.
> 
> > On the other hand, if m_ifree > m_icount, sb write will triger
> > a shutdown in xfs_validate_sb_write() because the check did
> 
> It won't, because xfs_validate_sb_write checks sb_ifree and sb_icount,
> not m_ifree and m_icount.  The min_t() in your proposed change protects
> any m_ifree > m_icount from being written into the ondisk superblock.
> 
> > not pass, therefore it will not write a clean umount record
> > at umount. I also haven't found a code suitable for adding
> > such checks.
> 
> xfs_unmountfs just prior to unmounting the log.


I tried to add an extra check in xfs_log_unmount_write, when m_icount <
m_ifree, it will not write a umount log record, after which the summary
counters will be recalculated at next mount. If m_ifree greater than
m_icount in memory, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect even after unmount filesystem. After adding such checks,
it can be corrected on the next mount, instead of going undetected in
subsequent mounts.

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f1f44c006ab3..e4903c15019e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1038,7 +1038,9 @@ xfs_log_unmount_write(
 	 * more details.
 	 */
 	if (XFS_TEST_ERROR(xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS), mp,
-			XFS_ERRTAG_FORCE_SUMMARY_RECALC)) {
+			XFS_ERRTAG_FORCE_SUMMARY_RECALC) ||
+			(percpu_counter_sum(&mp->m_icount) <
+			 percpu_counter_sum(&mp->m_ifree))) {
 		xfs_alert(mp, "%s: will fix summary counters at next mount",
 				__func__);
 		return;


> 
> --D
> 
> > > 
> > > --D
> > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > >  		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> > > > > >  	}
> > > > > >  
> > > > > > -- 
> > > > > > 2.31.1
> > > > > > 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-29  7:16             ` Long Li
@ 2022-10-30 22:04               ` Dave Chinner
  2022-10-31 14:17                 ` Long Li
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2022-10-30 22:04 UTC (permalink / raw)
  To: Long Li
  Cc: Darrick J. Wong, billodo, chandan.babu, dchinner, guoxuenan,
	houtao1, linux-xfs, sandeen, yi.zhang

On Sat, Oct 29, 2022 at 03:16:01PM +0800, Long Li wrote:
> On Thu, Oct 27, 2022 at 09:05:21AM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 27, 2022 at 09:25:04PM +0800, Long Li wrote:
> > > not pass, therefore it will not write a clean umount record
> > > at umount. I also haven't found a code suitable for adding
> > > such checks.
> > 
> > xfs_unmountfs just prior to unmounting the log.
> 
> 
> I tried to add an extra check in xfs_log_unmount_write, when m_icount <
> m_ifree, it will not write a umount log record, after which the summary
> counters will be recalculated at next mount. If m_ifree greater than
> m_icount in memory, sb_i{count,free} (the ondisk superblock inode counters)
> maybe incorrect even after unmount filesystem. After adding such checks,
> it can be corrected on the next mount, instead of going undetected in
> subsequent mounts.
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f1f44c006ab3..e4903c15019e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1038,7 +1038,9 @@ xfs_log_unmount_write(
>  	 * more details.
>  	 */
>  	if (XFS_TEST_ERROR(xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS), mp,
> -			XFS_ERRTAG_FORCE_SUMMARY_RECALC)) {
> +			XFS_ERRTAG_FORCE_SUMMARY_RECALC) ||
> +			(percpu_counter_sum(&mp->m_icount) <
> +			 percpu_counter_sum(&mp->m_ifree))) {
>  		xfs_alert(mp, "%s: will fix summary counters at next mount",
>  				__func__);
>  		return;

The log code is not the layer at which the mount structures
should be verified. xfs_unmountfs() is where the mount is cleaned up
and all activity is flushed and waited on. THis is where the mount
counters should be checked, before we unmount the log.

Indeed, if you check the mount counters prior to calling
xfs_log_unmount_write(), you could call this:

	xfs_alert(mp, "ifree/icount mismatch at unmount");
	xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS);

i.e. check the mount state at the correct level and propagate the
sickness into the mount state and the log code will just do the
right thing....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] xfs: fix sb write verify for lazysbcount
  2022-10-30 22:04               ` Dave Chinner
@ 2022-10-31 14:17                 ` Long Li
  0 siblings, 0 replies; 17+ messages in thread
From: Long Li @ 2022-10-31 14:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, billodo, chandan.babu, dchinner, guoxuenan,
	houtao1, linux-xfs, sandeen, yi.zhang

On Mon, Oct 31, 2022 at 09:04:41AM +1100, Dave Chinner wrote:
> On Sat, Oct 29, 2022 at 03:16:01PM +0800, Long Li wrote:
> > On Thu, Oct 27, 2022 at 09:05:21AM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 27, 2022 at 09:25:04PM +0800, Long Li wrote:
> > > > not pass, therefore it will not write a clean umount record
> > > > at umount. I also haven't found a code suitable for adding
> > > > such checks.
> > > 
> > > xfs_unmountfs just prior to unmounting the log.
> > 
> > 
> > I tried to add an extra check in xfs_log_unmount_write, when m_icount <
> > m_ifree, it will not write a umount log record, after which the summary
> > counters will be recalculated at next mount. If m_ifree greater than
> > m_icount in memory, sb_i{count,free} (the ondisk superblock inode counters)
> > maybe incorrect even after unmount filesystem. After adding such checks,
> > it can be corrected on the next mount, instead of going undetected in
> > subsequent mounts.
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index f1f44c006ab3..e4903c15019e 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -1038,7 +1038,9 @@ xfs_log_unmount_write(
> >  	 * more details.
> >  	 */
> >  	if (XFS_TEST_ERROR(xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS), mp,
> > -			XFS_ERRTAG_FORCE_SUMMARY_RECALC)) {
> > +			XFS_ERRTAG_FORCE_SUMMARY_RECALC) ||
> > +			(percpu_counter_sum(&mp->m_icount) <
> > +			 percpu_counter_sum(&mp->m_ifree))) {
> >  		xfs_alert(mp, "%s: will fix summary counters at next mount",
> >  				__func__);
> >  		return;
> 
> The log code is not the layer at which the mount structures
> should be verified. xfs_unmountfs() is where the mount is cleaned up
> and all activity is flushed and waited on. THis is where the mount
> counters should be checked, before we unmount the log.
> 
> Indeed, if you check the mount counters prior to calling
> xfs_log_unmount_write(), you could call this:
> 
> 	xfs_alert(mp, "ifree/icount mismatch at unmount");
> 	xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS);
> 
> i.e. check the mount state at the correct level and propagate the
> sickness into the mount state and the log code will just do the
> right thing....
> 
> Cheers,
> 
> Dave.
 
Ok, I'll resend a patch and fix the above issue, thanks a lot.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-10-31 13:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-22  2:03 [PATCH v1] xfs: fix sb write verify for lazysbcount Long Li
2022-10-22  2:14 ` Darrick J. Wong
2022-10-22 12:01   ` Long Li
2022-10-22 21:16     ` Dave Chinner
2022-10-24  4:07       ` Darrick J. Wong
2022-10-24  5:43         ` Dave Chinner
2022-10-24 12:28           ` Long Li
2022-10-24  5:06       ` Long Li
2022-10-25  9:15 ` [PATCH v2] " Long Li
2022-10-25 18:16   ` Darrick J. Wong
2022-10-26  9:13     ` Long Li
2022-10-26 18:52       ` Darrick J. Wong
2022-10-27 13:25         ` Long Li
2022-10-27 16:05           ` Darrick J. Wong
2022-10-29  7:16             ` Long Li
2022-10-30 22:04               ` Dave Chinner
2022-10-31 14:17                 ` Long Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.