All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/2] xfs: fix corruption of free rt extent count
@ 2022-04-07 20:46 Darrick J. Wong
  2022-04-07 20:46 ` [PATCH 1/2] xfs: recalculate free rt extents after log recovery Darrick J. Wong
  2022-04-07 20:47 ` [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations Darrick J. Wong
  0 siblings, 2 replies; 10+ messages in thread
From: Darrick J. Wong @ 2022-04-07 20:46 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david

Hi all,

I've been noticing sporadic failures with djwong-dev with xfs/141, which
is a looping log recovery test.  The primary symptoms have been that
online fsck reports incorrect free extent counts after some number of
recovery loops.  The root cause seems to be the use of sb_frextents in
the xfs_mount for incore reservations to transactions -- if someone
calls xfs_log_sb while there's a transaction with an rtx reservation
running, the artificially low value will then get logged to disk!
If that's the /last/ time anyone logs the superblock before the log
goes down, then recovery will replay the incorrect value into the live
superblock.  Effectively, we leak the rt extents.

So, the first thing to do is to fix log recovery to recompute frextents
from the rt bitmap so that we can catch and correct ondisk metadata; and
the second fix is to create a percpu counter to track both ondisk and
incore rtextent counts, similar to how m_fdblocks relates to
sb_fdblocks.

The next thing to do after this is to fix xfs_repair to check frextents
and the rt bitmap/summary files, since it doesn't check the ondisk
values against its own observations.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=frextents-fixes-5.18
---
 fs/xfs/xfs_fsops.c   |    5 +---
 fs/xfs/xfs_icache.c  |    9 ++++---
 fs/xfs/xfs_mount.c   |   38 +++++++++++++++++++++-------
 fs/xfs/xfs_mount.h   |    2 +
 fs/xfs/xfs_rtalloc.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_super.c   |   14 +++++++++-
 fs/xfs/xfs_trans.c   |   37 ++++++++++++++++++++++-----
 7 files changed, 146 insertions(+), 28 deletions(-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] xfs: recalculate free rt extents after log recovery
  2022-04-07 20:46 [PATCHSET 0/2] xfs: fix corruption of free rt extent count Darrick J. Wong
@ 2022-04-07 20:46 ` Darrick J. Wong
  2022-04-07 21:56   ` Dave Chinner
  2022-04-07 20:47 ` [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations Darrick J. Wong
  1 sibling, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-04-07 20:46 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david

From: Darrick J. Wong <djwong@kernel.org>

I've been observing periodic corruption reports from xfs_scrub involving
the free rt extent counter (frextents) while running xfs/141.  That test
uses an error injection knob to induce a torn write to the log, and an
arbitrary number of recovery mounts, frextents will count fewer free rt
extents than can be found the rtbitmap.

The root cause of the problem is a combination of the misuse of
sb_frextents in the incore mount to reflect both incore reservations
made by running transactions as well as the actual count of free rt
extents on disk.  The following sequence can reproduce the undercount:

Thread 1			Thread 2
xfs_trans_alloc(rtextents=3)
xfs_mod_frextents(-3)
<blocks>
				xfs_attr_set()
				xfs_bmap_attr_addfork()
				xfs_add_attr2()
				xfs_log_sb()
				xfs_sb_to_disk()
				xfs_trans_commit()
<log flushed to disk>
<log goes down>

Note that thread 1 subtracts 3 from sb_frextents even though it never
commits to using that space.  Thread 2 writes the undercounted value to
the ondisk superblock and logs it to the xattr transaction, which is
then flushed to disk.  At next mount, log recovery will find the logged
superblock and write that back into the filesystem.  At the end of log
recovery, we reread the superblock and install the recovered
undercounted frextents value into the incore superblock.  From that
point on, we've effectively leaked thread 1's transaction reservation.

The correct fix for this is to separate the incore reservation from the
ondisk usage, but that's a matter for the next patch.  Because the
kernel has been logging superblocks with undercounted frextents for a
very long time and we don't demand that sysadmins run xfs_repair after a
crash, fix the undercount by recomputing frextents after log recovery.

Gating this on log recovery is a reasonable balance (I think) between
correcting the problem and slowing down every mount attempt.  Note that
xfs_repair will fix undercounted frextents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index b8c79ee791af..c9ab4f333472 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -19,6 +19,7 @@
 #include "xfs_icache.h"
 #include "xfs_rtalloc.h"
 #include "xfs_sb.h"
+#include "xfs_log_priv.h"
 
 /*
  * Read and return the summary information for a given extent size,
@@ -1284,6 +1285,43 @@ xfs_rtmount_init(
 	return 0;
 }
 
+static inline int
+xfs_rtalloc_count_frextent(
+	struct xfs_trans		*tp,
+	const struct xfs_rtalloc_rec	*rec,
+	void				*priv)
+{
+	uint64_t			*valp = priv;
+
+	*valp += rec->ar_extcount;
+	return 0;
+}
+
+/* Reinitialize the number of free realtime extents from the realtime bitmap. */
+STATIC int
+xfs_rtalloc_reinit_frextents(
+	struct xfs_mount	*mp)
+{
+	struct xfs_trans	*tp;
+	uint64_t		val = 0;
+	int			error;
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL);
+	error = xfs_rtalloc_query_all(tp, xfs_rtalloc_count_frextent, &val);
+	xfs_iunlock(mp->m_rbmip, XFS_ILOCK_EXCL);
+	xfs_trans_cancel(tp);
+	if (error)
+		return error;
+
+	spin_lock(&mp->m_sb_lock);
+	mp->m_sb.sb_frextents = val;
+	spin_unlock(&mp->m_sb_lock);
+	return 0;
+}
 /*
  * Get the bitmap and summary inodes and the summary cache into the mount
  * structure at mount time.
@@ -1302,13 +1340,35 @@ xfs_rtmount_inodes(
 	ASSERT(mp->m_rbmip != NULL);
 
 	error = xfs_iget(mp, NULL, sbp->sb_rsumino, 0, 0, &mp->m_rsumip);
-	if (error) {
-		xfs_irele(mp->m_rbmip);
-		return error;
-	}
+	if (error)
+		goto out_rbm;
 	ASSERT(mp->m_rsumip != NULL);
+
+	/*
+	 * Older kernels misused sb_frextents to reflect both incore
+	 * reservations made by running transactions and the actual count of
+	 * free rt extents in the ondisk metadata.  Transactions committed
+	 * during runtime can therefore contain a superblock update that
+	 * undercounts the number of free rt extents tracked in the rt bitmap.
+	 * A clean unmount record will have the correct frextents value since
+	 * there can be no other transactions running at that point.
+	 *
+	 * If we're mounting the rt volume after recovering the log, recompute
+	 * frextents from the rtbitmap file to fix the inconsistency.
+	 */
+	if (xfs_has_realtime(mp) && xlog_recovery_needed(mp->m_log)) {
+		error = xfs_rtalloc_reinit_frextents(mp);
+		if (error)
+			goto out_rsum;
+	}
+
 	xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
 	return 0;
+out_rsum:
+	xfs_irele(mp->m_rsumip);
+out_rbm:
+	xfs_irele(mp->m_rbmip);
+	return error;
 }
 
 void


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations
  2022-04-07 20:46 [PATCHSET 0/2] xfs: fix corruption of free rt extent count Darrick J. Wong
  2022-04-07 20:46 ` [PATCH 1/2] xfs: recalculate free rt extents after log recovery Darrick J. Wong
@ 2022-04-07 20:47 ` Darrick J. Wong
  2022-04-07 23:17   ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-04-07 20:47 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david

From: Darrick J. Wong <djwong@kernel.org>

As mentioned in the previous commit, the kernel misuses sb_frextents in
the incore mount to reflect both incore reservations made by running
transactions as well as the actual count of free rt extents on disk.
This results in the superblock being written to the log with an
underestimate of the number of rt extents that are marked free in the
rtbitmap.

Teaching XFS to recompute frextents after log recovery avoids
operational problems in the current mount, but it doesn't solve the
problem of us writing undercounted frextents which are then recovered by
an older kernel that doesn't have that fix.

Create an incore percpu counter to mirror the ondisk frextents.  This
new counter will track transaction reservations and the only time we
will touch the incore super counter (i.e the one that gets logged) is
when those transactions commit updates to the rt bitmap.  This is in
contrast to the lazysbcount counters (e.g. fdblocks), where we know that
log recovery will always fix any incorrect counter that we log.
As a bonus, we only take m_sb_lock at transaction commit time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c   |    5 +----
 fs/xfs/xfs_icache.c  |    9 ++++++---
 fs/xfs/xfs_mount.c   |   38 +++++++++++++++++++++++++++++---------
 fs/xfs/xfs_mount.h   |    2 ++
 fs/xfs/xfs_rtalloc.c |    1 +
 fs/xfs/xfs_super.c   |   14 ++++++++++++--
 fs/xfs/xfs_trans.c   |   37 +++++++++++++++++++++++++++++++------
 7 files changed, 82 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 68f74549fa22..a0d7aa7fbbff 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -349,10 +349,7 @@ xfs_fs_counts(
 	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
 	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
 						xfs_fdblocks_unavailable(mp);
-
-	spin_lock(&mp->m_sb_lock);
-	cnt->freertx = mp->m_sb.sb_frextents;
-	spin_unlock(&mp->m_sb_lock);
+	cnt->freertx = percpu_counter_read_positive(&mp->m_frextents);
 }
 
 /*
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 6c7267451b82..f16ccc3b0e98 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1916,13 +1916,16 @@ xfs_inodegc_want_queue_rt_file(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	uint64_t		freertx;
 
 	if (!XFS_IS_REALTIME_INODE(ip))
 		return false;
 
-	freertx = READ_ONCE(mp->m_sb.sb_frextents);
-	return freertx < mp->m_low_rtexts[XFS_LOWSP_5_PCNT];
+	if (__percpu_counter_compare(&mp->m_frextents,
+				mp->m_low_rtexts[XFS_LOWSP_5_PCNT],
+				XFS_FDBLOCKS_BATCH) < 0)
+		return true;
+
+	return false;
 }
 #else
 # define xfs_inodegc_want_queue_rt_file(ip)	(false)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c5f153c3693f..d5463728c305 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1183,17 +1183,37 @@ xfs_mod_frextents(
 	struct xfs_mount	*mp,
 	int64_t			delta)
 {
-	int64_t			lcounter;
-	int			ret = 0;
+	int			batch;
 
-	spin_lock(&mp->m_sb_lock);
-	lcounter = mp->m_sb.sb_frextents + delta;
-	if (lcounter < 0)
-		ret = -ENOSPC;
+	if (delta > 0) {
+		percpu_counter_add(&mp->m_frextents, delta);
+		return 0;
+	}
+
+	/*
+	 * Taking blocks away, need to be more accurate the closer we
+	 * are to zero.
+	 *
+	 * If the counter has a value of less than 2 * max batch size,
+	 * then make everything serialise as we are real close to
+	 * ENOSPC.
+	 */
+	if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH,
+				     XFS_FDBLOCKS_BATCH) < 0)
+		batch = 1;
 	else
-		mp->m_sb.sb_frextents = lcounter;
-	spin_unlock(&mp->m_sb_lock);
-	return ret;
+		batch = XFS_FDBLOCKS_BATCH;
+
+	percpu_counter_add_batch(&mp->m_frextents, delta, batch);
+	if (__percpu_counter_compare(&mp->m_frextents, 0,
+				     XFS_FDBLOCKS_BATCH) >= 0) {
+		/* we had space! */
+		return 0;
+	}
+
+	/* oops, negative free space, put that back! */
+	percpu_counter_add(&mp->m_frextents, -delta);
+	return -ENOSPC;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f6dc19de8322..b1833f71fb3e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -183,6 +183,8 @@ typedef struct xfs_mount {
 	struct percpu_counter	m_icount;	/* allocated inodes counter */
 	struct percpu_counter	m_ifree;	/* free inodes counter */
 	struct percpu_counter	m_fdblocks;	/* free block counter */
+	struct percpu_counter	m_frextents;	/* free rt extent counter */
+
 	/*
 	 * Count of data device blocks reserved for delayed allocations,
 	 * including indlen blocks.  Does not include allocated CoW staging
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index c9ab4f333472..3673be83391a 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1320,6 +1320,7 @@ xfs_rtalloc_reinit_frextents(
 	spin_lock(&mp->m_sb_lock);
 	mp->m_sb.sb_frextents = val;
 	spin_unlock(&mp->m_sb_lock);
+	percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents);
 	return 0;
 }
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 54be9d64093e..cc95768eb8e1 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -843,9 +843,11 @@ xfs_fs_statfs(
 
 	if (XFS_IS_REALTIME_MOUNT(mp) &&
 	    (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) {
+		s64	freertx;
+
 		statp->f_blocks = sbp->sb_rblocks;
-		statp->f_bavail = statp->f_bfree =
-			sbp->sb_frextents * sbp->sb_rextsize;
+		freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents));
+		statp->f_bavail = statp->f_bfree = freertx * sbp->sb_rextsize;
 	}
 
 	return 0;
@@ -1015,8 +1017,14 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_fdblocks;
 
+	error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL);
+	if (error)
+		goto free_delalloc;
+
 	return 0;
 
+free_delalloc:
+	percpu_counter_destroy(&mp->m_delalloc_blks);
 free_fdblocks:
 	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
@@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters(
 	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
 	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
 	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
+	percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents);
 }
 
 static void
@@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters(
 	ASSERT(xfs_is_shutdown(mp) ||
 	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
 	percpu_counter_destroy(&mp->m_delalloc_blks);
+	percpu_counter_destroy(&mp->m_frextents);
 }
 
 static int
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 0ac717aad380..63a4d3a24340 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas(
 			be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta);
 	}
 
-	if (tp->t_frextents_delta)
-		be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta);
-	if (tp->t_res_frextents_delta)
-		be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta);
+	/*
+	 * Updating frextents requires careful handling because it does not
+	 * behave like the lazysb counters because we cannot rely on log
+	 * recovery in older kenels to recompute the value from the rtbitmap.
+	 * This means that the ondisk frextents must be consistent with the
+	 * rtbitmap.
+	 *
+	 * Therefore, log the frextents change to the ondisk superblock and
+	 * update the incore superblock so that future calls to xfs_log_sb
+	 * write the correct value ondisk.
+	 *
+	 * Don't touch m_frextents because it includes incore reservations,
+	 * and those are handled by the unreserve function.
+	 */
+	if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
+		struct xfs_mount	*mp = tp->t_mountp;
+		int64_t			rtxdelta;
+
+		rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta;
+
+		spin_lock(&mp->m_sb_lock);
+		be64_add_cpu(&sbp->sb_frextents, rtxdelta);
+		mp->m_sb.sb_frextents += rtxdelta;
+		spin_unlock(&mp->m_sb_lock);
+	}
 
 	if (tp->t_dblocks_delta) {
 		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
@@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb(
 	if (ifreedelta)
 		percpu_counter_add(&mp->m_ifree, ifreedelta);
 
-	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
+	if (rtxdelta) {
+		error = xfs_mod_frextents(mp, rtxdelta);
+		ASSERT(!error);
+	}
+
+	if (!(tp->t_flags & XFS_TRANS_SB_DIRTY))
 		return;
 
 	/* apply remaining deltas */
@@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb(
 	mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta;
 	mp->m_sb.sb_icount += idelta;
 	mp->m_sb.sb_ifree += ifreedelta;
-	mp->m_sb.sb_frextents += rtxdelta;
 	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
 	mp->m_sb.sb_agcount += tp->t_agcount_delta;
 	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] xfs: recalculate free rt extents after log recovery
  2022-04-07 20:46 ` [PATCH 1/2] xfs: recalculate free rt extents after log recovery Darrick J. Wong
@ 2022-04-07 21:56   ` Dave Chinner
  2022-04-07 23:39     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-04-07 21:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 07, 2022 at 01:46:56PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> I've been observing periodic corruption reports from xfs_scrub involving
> the free rt extent counter (frextents) while running xfs/141.  That test
> uses an error injection knob to induce a torn write to the log, and an
> arbitrary number of recovery mounts, frextents will count fewer free rt
> extents than can be found the rtbitmap.
> 
> The root cause of the problem is a combination of the misuse of
> sb_frextents in the incore mount to reflect both incore reservations
> made by running transactions as well as the actual count of free rt
> extents on disk.  The following sequence can reproduce the undercount:
> 
> Thread 1			Thread 2
> xfs_trans_alloc(rtextents=3)
> xfs_mod_frextents(-3)
> <blocks>
> 				xfs_attr_set()
> 				xfs_bmap_attr_addfork()
> 				xfs_add_attr2()
> 				xfs_log_sb()
> 				xfs_sb_to_disk()
> 				xfs_trans_commit()
> <log flushed to disk>
> <log goes down>
> 
> Note that thread 1 subtracts 3 from sb_frextents even though it never
> commits to using that space.  Thread 2 writes the undercounted value to
> the ondisk superblock and logs it to the xattr transaction, which is
> then flushed to disk.  At next mount, log recovery will find the logged
> superblock and write that back into the filesystem.  At the end of log
> recovery, we reread the superblock and install the recovered
> undercounted frextents value into the incore superblock.  From that
> point on, we've effectively leaked thread 1's transaction reservation.
> 
> The correct fix for this is to separate the incore reservation from the
> ondisk usage, but that's a matter for the next patch.  Because the
> kernel has been logging superblocks with undercounted frextents for a
> very long time and we don't demand that sysadmins run xfs_repair after a
> crash, fix the undercount by recomputing frextents after log recovery.
> 
> Gating this on log recovery is a reasonable balance (I think) between
> correcting the problem and slowing down every mount attempt.  Note that
> xfs_repair will fix undercounted frextents.

That seems reasonable to me. I agree with the method used to fix the
problem, but I have some concerns/questions over where/how it is
implemented.

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_rtalloc.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 64 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index b8c79ee791af..c9ab4f333472 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -19,6 +19,7 @@
>  #include "xfs_icache.h"
>  #include "xfs_rtalloc.h"
>  #include "xfs_sb.h"
> +#include "xfs_log_priv.h"
>  
>  /*
>   * Read and return the summary information for a given extent size,
> @@ -1284,6 +1285,43 @@ xfs_rtmount_init(
>  	return 0;
>  }
>  
> +static inline int
> +xfs_rtalloc_count_frextent(
> +	struct xfs_trans		*tp,
> +	const struct xfs_rtalloc_rec	*rec,
> +	void				*priv)
> +{
> +	uint64_t			*valp = priv;
> +
> +	*valp += rec->ar_extcount;
> +	return 0;
> +}
> +
> +/* Reinitialize the number of free realtime extents from the realtime bitmap. */
> +STATIC int
> +xfs_rtalloc_reinit_frextents(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_trans	*tp;
> +	uint64_t		val = 0;
> +	int			error;
> +
> +	error = xfs_trans_alloc_empty(mp, &tp);
> +	if (error)
> +		return error;

At this point in the mount, the transaction subsystem is not
really available for use. I understand that this is an empty
transaction and it doesn't reserve log space, but I don't like
creating an implicit requirement that xfs_trans_alloc_empty() never
touches log state for this code to work correctly into the future.

That is, the log isn't in a state where we can run normal
transactions because we can still have pending intent recovery
queued in the AIL. These can't be moved from the tail of the AIL
until xfs_log_mount_finish() is called to remove and process them.
They are dependent on there being enough log space remaining for the
transaction reservation they make to succeed and there may only be
just enough space for the first intent to make that reservation.
Hence if we consume any log space before we recover intents, we can
deadlock on log space when then trying to recover the intents. And
because the AIL can't push on intents, the same reservation deadlock
could occur with any transaction reservation we try to run before
xfs_log_mount_finish() has been run.

Yes, I know this is an empty transaction and so it doesn't push on
the log or consume any space. But I think it's a bad precedent even
to use transactions at all when we know we are in the middle of log
recovery and the transaction subsystem is not open for general use
yet.  Hence I think using transaction handles - even empty ones -
before xfs_log_mount_finish() is run is a pattern we want to avoid
if at all possible.

> +
> +	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL);
> +	error = xfs_rtalloc_query_all(tp, xfs_rtalloc_count_frextent, &val);

Looking at the code here, AFAICT the only thing the tp is absolutely
required for here is to hold the mount pointer - all the internal
bitmap walk operations grab and release buffers using interfaces
that can handle a null tp being passed.

Hence it looks to me like a transaction context isn't necessary for
this read-only bitmap walk. Is that correct?  Converting
xfs_rtalloc_query_all() to take a mount and a trans would also make
this code becomes a lot simpler....

> +	xfs_iunlock(mp->m_rbmip, XFS_ILOCK_EXCL);
> +	xfs_trans_cancel(tp);
> +	if (error)
> +		return error;
> +
> +	spin_lock(&mp->m_sb_lock);
> +	mp->m_sb.sb_frextents = val;
> +	spin_unlock(&mp->m_sb_lock);
> +	return 0;
> +}
>  /*
>   * Get the bitmap and summary inodes and the summary cache into the mount
>   * structure at mount time.
> @@ -1302,13 +1340,35 @@ xfs_rtmount_inodes(
>  	ASSERT(mp->m_rbmip != NULL);
>  
>  	error = xfs_iget(mp, NULL, sbp->sb_rsumino, 0, 0, &mp->m_rsumip);
> -	if (error) {
> -		xfs_irele(mp->m_rbmip);
> -		return error;
> -	}
> +	if (error)
> +		goto out_rbm;

		goto out_rele_bitmap;

>  	ASSERT(mp->m_rsumip != NULL);
> +
> +	/*
> +	 * Older kernels misused sb_frextents to reflect both incore
> +	 * reservations made by running transactions and the actual count of
> +	 * free rt extents in the ondisk metadata.  Transactions committed
> +	 * during runtime can therefore contain a superblock update that
> +	 * undercounts the number of free rt extents tracked in the rt bitmap.
> +	 * A clean unmount record will have the correct frextents value since
> +	 * there can be no other transactions running at that point.
> +	 *
> +	 * If we're mounting the rt volume after recovering the log, recompute
> +	 * frextents from the rtbitmap file to fix the inconsistency.
> +	 */
> +	if (xfs_has_realtime(mp) && xlog_recovery_needed(mp->m_log)) {
> +		error = xfs_rtalloc_reinit_frextents(mp);
> +		if (error)
> +			goto out_rsum;

			goto out_rele_summary;
> +	}

This is pretty much doing the same thing that
xfs_check_summary_counts() does for the data device. That is done
before we mount the rt inodes, but it could be done after this, too.

Does it make any sense to move xfs_check_summary_counts() and then
centralise all the summary counter reinitialisation in the one
place under the same runtime recovery context?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations
  2022-04-07 20:47 ` [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations Darrick J. Wong
@ 2022-04-07 23:17   ` Dave Chinner
  2022-04-07 23:45     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-04-07 23:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As mentioned in the previous commit, the kernel misuses sb_frextents in
> the incore mount to reflect both incore reservations made by running
> transactions as well as the actual count of free rt extents on disk.
> This results in the superblock being written to the log with an
> underestimate of the number of rt extents that are marked free in the
> rtbitmap.
> 
> Teaching XFS to recompute frextents after log recovery avoids
> operational problems in the current mount, but it doesn't solve the
> problem of us writing undercounted frextents which are then recovered by
> an older kernel that doesn't have that fix.
> 
> Create an incore percpu counter to mirror the ondisk frextents.  This
> new counter will track transaction reservations and the only time we
> will touch the incore super counter (i.e the one that gets logged) is
> when those transactions commit updates to the rt bitmap.  This is in
> contrast to the lazysbcount counters (e.g. fdblocks), where we know that
> log recovery will always fix any incorrect counter that we log.
> As a bonus, we only take m_sb_lock at transaction commit time.

Again, the concept looks fine as does most of the code. Some
comments on the implementation below.

> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index c5f153c3693f..d5463728c305 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1183,17 +1183,37 @@ xfs_mod_frextents(
>  	struct xfs_mount	*mp,
>  	int64_t			delta)
>  {
> -	int64_t			lcounter;
> -	int			ret = 0;
> +	int			batch;
>  
> -	spin_lock(&mp->m_sb_lock);
> -	lcounter = mp->m_sb.sb_frextents + delta;
> -	if (lcounter < 0)
> -		ret = -ENOSPC;
> +	if (delta > 0) {
> +		percpu_counter_add(&mp->m_frextents, delta);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Taking blocks away, need to be more accurate the closer we
> +	 * are to zero.
> +	 *
> +	 * If the counter has a value of less than 2 * max batch size,
> +	 * then make everything serialise as we are real close to
> +	 * ENOSPC.
> +	 */
> +	if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH,
> +				     XFS_FDBLOCKS_BATCH) < 0)
> +		batch = 1;
>  	else
> -		mp->m_sb.sb_frextents = lcounter;
> -	spin_unlock(&mp->m_sb_lock);
> -	return ret;
> +		batch = XFS_FDBLOCKS_BATCH;
> +
> +	percpu_counter_add_batch(&mp->m_frextents, delta, batch);
> +	if (__percpu_counter_compare(&mp->m_frextents, 0,
> +				     XFS_FDBLOCKS_BATCH) >= 0) {
> +		/* we had space! */
> +		return 0;
> +	}
> +
> +	/* oops, negative free space, put that back! */
> +	percpu_counter_add(&mp->m_frextents, -delta);
> +	return -ENOSPC;
>  }

Ok, so this looks like a copy-pasta of xfs_mod_fdblocks() with the
reservation pool stuff stripped. I'd kinda prefer to factor
xfs_mod_fdblocks() so that we aren't blowing out the instruction
cache footprint on this hot path - we're frequently modifying
both RT and fd block counts in the same transaction, so having them
run the same code would be good.

Something like:

int
xfs_mod_blocks(
	struct xfs_mount	*mp,
	struct pcp_counter	*pccnt,
	int64_t			delta,
	bool			use_resv_pool,
	bool			rsvd)
{
	int64_t                 lcounter;
	long long               res_used;
	s32                     batch;
	uint64_t                set_aside = 0;

	if (delta > 0) {
	       /*
		* If the reserve pool is depleted, put blocks back into it
		* first. Most of the time the pool is full.
		*/
	       if (likely(!use_resv_pool || mp->m_resblks == mp->m_resblks_avail)) {
		       percpu_counter_add(pccnt, delta);
		       return 0;
	       }

	       spin_lock(&mp->m_sb_lock);
	       res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);

	       if (res_used > delta) {
		       mp->m_resblks_avail += delta;
	       } else {
		       delta -= res_used;
		       mp->m_resblks_avail = mp->m_resblks;
		       percpu_counter_add(&mp->m_fdblocks, delta);
	       }
	       spin_unlock(&mp->m_sb_lock);
	       return 0;
	}

	/*
	* Taking blocks away, need to be more accurate the closer we
	* are to zero.
	*
	* If the counter has a value of less than 2 * max batch size,
	* then make everything serialise as we are real close to
	* ENOSPC.
	*/
	if (__percpu_counter_compare(pccnt, 2 * XFS_FDBLOCKS_BATCH,
				    XFS_FDBLOCKS_BATCH) < 0)
	       batch = 1;
	else
	       batch = XFS_FDBLOCKS_BATCH;

	/*
	* Set aside allocbt blocks because these blocks are tracked as free
	* space but not available for allocation. Technically this means that a
	* single reservation cannot consume all remaining free space, but the
	* ratio of allocbt blocks to usable free blocks should be rather small.
	* The tradeoff without this is that filesystems that maintain high
	* perag block reservations can over reserve physical block availability
	* and fail physical allocation, which leads to much more serious
	* problems (i.e. transaction abort, pagecache discards, etc.) than
	* slightly premature -ENOSPC.
	*/
	if (use_resv_pool)
		set_aside = xfs_fdblocks_unavailable(mp);
	percpu_counter_add_batch(pccnt, delta, batch);
	if (__percpu_counter_compare(&pccnt, set_aside,
				    XFS_FDBLOCKS_BATCH) >= 0) {
	       /* we had space! */
	       return 0;
	}

	/*
	* lock up the sb for dipping into reserves before releasing the space
	* that took us to ENOSPC.
	*/
	spin_lock(&mp->m_sb_lock);
	percpu_counter_add(pccnt, -delta);
	if (!use_resv_pool || !rsvd)
	       goto fdblocks_enospc;

	lcounter = (long long)mp->m_resblks_avail + delta;
	if (lcounter >= 0) {
	       mp->m_resblks_avail = lcounter;
	       spin_unlock(&mp->m_sb_lock);
	       return 0;
	}
	xfs_warn_once(mp,
"Reserve blocks depleted! Consider increasing reserve pool size.");

fdblocks_enospc:
	spin_unlock(&mp->m_sb_lock);
	return -ENOSPC;
}

And in the relevant header file:

int xfs_mod_blocks(struct xfs_mount *mp, struct pcp_counter *pccnt,
		int64_t delta, bool use_resv_pool, bool rsvd);

static inline int
xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool rsvd)
{
	return xfs_mod_blocks(mp, &mp->m_fdblocks, delta, true, resvd);
}

static inline int
xfs_mod_frextents(struct xfs_mount *mp, int64_t delta)
{
	return xfs_mod_blocks(mp, &mp->m_frextents, delta, false, false);
}

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 54be9d64093e..cc95768eb8e1 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -843,9 +843,11 @@ xfs_fs_statfs(
>  
>  	if (XFS_IS_REALTIME_MOUNT(mp) &&
>  	    (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) {
> +		s64	freertx;
> +
>  		statp->f_blocks = sbp->sb_rblocks;
> -		statp->f_bavail = statp->f_bfree =
> -			sbp->sb_frextents * sbp->sb_rextsize;
> +		freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents));

percpu_counter_sum_positive()

>  	if (error)
>  		goto free_fdblocks;
>  
> +	error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL);
> +	if (error)
> +		goto free_delalloc;
> +
>  	return 0;
>  
> +free_delalloc:
> +	percpu_counter_destroy(&mp->m_delalloc_blks);
>  free_fdblocks:
>  	percpu_counter_destroy(&mp->m_fdblocks);
>  free_ifree:
> @@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters(
>  	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
>  	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
>  	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
> +	percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents);
>  }
>  
>  static void
> @@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters(
>  	ASSERT(xfs_is_shutdown(mp) ||
>  	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
>  	percpu_counter_destroy(&mp->m_delalloc_blks);
> +	percpu_counter_destroy(&mp->m_frextents);
>  }
>  
>  static int
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 0ac717aad380..63a4d3a24340 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas(
>  			be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta);
>  	}
>  
> -	if (tp->t_frextents_delta)
> -		be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta);
> -	if (tp->t_res_frextents_delta)
> -		be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta);
> +	/*
> +	 * Updating frextents requires careful handling because it does not
> +	 * behave like the lazysb counters because we cannot rely on log
> +	 * recovery in older kenels to recompute the value from the rtbitmap.
> +	 * This means that the ondisk frextents must be consistent with the
> +	 * rtbitmap.
> +	 *
> +	 * Therefore, log the frextents change to the ondisk superblock and
> +	 * update the incore superblock so that future calls to xfs_log_sb
> +	 * write the correct value ondisk.
> +	 *
> +	 * Don't touch m_frextents because it includes incore reservations,
> +	 * and those are handled by the unreserve function.
> +	 */
> +	if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
> +		struct xfs_mount	*mp = tp->t_mountp;
> +		int64_t			rtxdelta;
> +
> +		rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta;
> +
> +		spin_lock(&mp->m_sb_lock);
> +		be64_add_cpu(&sbp->sb_frextents, rtxdelta);
> +		mp->m_sb.sb_frextents += rtxdelta;
> +		spin_unlock(&mp->m_sb_lock);
> +	}

Hmmmm.  This wants a comment in xfs_log_sb() to explain why we
aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we
do with all the other per-cpu counters tracking resource usage.

>  
>  	if (tp->t_dblocks_delta) {
>  		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb(
>  	if (ifreedelta)
>  		percpu_counter_add(&mp->m_ifree, ifreedelta);
>  
> -	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
> +	if (rtxdelta) {
> +		error = xfs_mod_frextents(mp, rtxdelta);
> +		ASSERT(!error);
> +	}
> +
> +	if (!(tp->t_flags & XFS_TRANS_SB_DIRTY))
>  		return;
>  
>  	/* apply remaining deltas */
> @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb(
>  	mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta;
>  	mp->m_sb.sb_icount += idelta;
>  	mp->m_sb.sb_ifree += ifreedelta;
> -	mp->m_sb.sb_frextents += rtxdelta;

This makes my head hurt trying to work out if this is necessary or
not. (the lazy sb stuff in these functions has always strained my
cognitive abilities, even though I wrote it in the first place!)

A comment explaining why we don't need to update
mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful
in the above if (rtxdelta) update above.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] xfs: recalculate free rt extents after log recovery
  2022-04-07 21:56   ` Dave Chinner
@ 2022-04-07 23:39     ` Darrick J. Wong
  2022-04-08  0:06       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-04-07 23:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 08, 2022 at 07:56:05AM +1000, Dave Chinner wrote:
> On Thu, Apr 07, 2022 at 01:46:56PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > I've been observing periodic corruption reports from xfs_scrub involving
> > the free rt extent counter (frextents) while running xfs/141.  That test
> > uses an error injection knob to induce a torn write to the log, and an
> > arbitrary number of recovery mounts, frextents will count fewer free rt
> > extents than can be found the rtbitmap.
> > 
> > The root cause of the problem is a combination of the misuse of
> > sb_frextents in the incore mount to reflect both incore reservations
> > made by running transactions as well as the actual count of free rt
> > extents on disk.  The following sequence can reproduce the undercount:
> > 
> > Thread 1			Thread 2
> > xfs_trans_alloc(rtextents=3)
> > xfs_mod_frextents(-3)
> > <blocks>
> > 				xfs_attr_set()
> > 				xfs_bmap_attr_addfork()
> > 				xfs_add_attr2()
> > 				xfs_log_sb()
> > 				xfs_sb_to_disk()
> > 				xfs_trans_commit()
> > <log flushed to disk>
> > <log goes down>
> > 
> > Note that thread 1 subtracts 3 from sb_frextents even though it never
> > commits to using that space.  Thread 2 writes the undercounted value to
> > the ondisk superblock and logs it to the xattr transaction, which is
> > then flushed to disk.  At next mount, log recovery will find the logged
> > superblock and write that back into the filesystem.  At the end of log
> > recovery, we reread the superblock and install the recovered
> > undercounted frextents value into the incore superblock.  From that
> > point on, we've effectively leaked thread 1's transaction reservation.
> > 
> > The correct fix for this is to separate the incore reservation from the
> > ondisk usage, but that's a matter for the next patch.  Because the
> > kernel has been logging superblocks with undercounted frextents for a
> > very long time and we don't demand that sysadmins run xfs_repair after a
> > crash, fix the undercount by recomputing frextents after log recovery.
> > 
> > Gating this on log recovery is a reasonable balance (I think) between
> > correcting the problem and slowing down every mount attempt.  Note that
> > xfs_repair will fix undercounted frextents.
> 
> That seems reasonable to me. I agree with the method used to fix the
> problem, but I have some concerns/questions over where/how it is
> implemented.
> 
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_rtalloc.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 64 insertions(+), 4 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> > index b8c79ee791af..c9ab4f333472 100644
> > --- a/fs/xfs/xfs_rtalloc.c
> > +++ b/fs/xfs/xfs_rtalloc.c
> > @@ -19,6 +19,7 @@
> >  #include "xfs_icache.h"
> >  #include "xfs_rtalloc.h"
> >  #include "xfs_sb.h"
> > +#include "xfs_log_priv.h"
> >  
> >  /*
> >   * Read and return the summary information for a given extent size,
> > @@ -1284,6 +1285,43 @@ xfs_rtmount_init(
> >  	return 0;
> >  }
> >  
> > +static inline int
> > +xfs_rtalloc_count_frextent(
> > +	struct xfs_trans		*tp,
> > +	const struct xfs_rtalloc_rec	*rec,
> > +	void				*priv)
> > +{
> > +	uint64_t			*valp = priv;
> > +
> > +	*valp += rec->ar_extcount;
> > +	return 0;
> > +}
> > +
> > +/* Reinitialize the number of free realtime extents from the realtime bitmap. */
> > +STATIC int
> > +xfs_rtalloc_reinit_frextents(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_trans	*tp;
> > +	uint64_t		val = 0;
> > +	int			error;
> > +
> > +	error = xfs_trans_alloc_empty(mp, &tp);
> > +	if (error)
> > +		return error;
> 
> At this point in the mount, the transaction subsystem is not
> really available for use. I understand that this is an empty
> transaction and it doesn't reserve log space, but I don't like
> creating an implicit requirement that xfs_trans_alloc_empty() never
> touches log state for this code to work correctly into the future.

<nod> I'd wondered if it was really a good idea to be creating a
transaction at the "end" of log recovery.  Recovery itself does it, but
at least that's a carefully controlled box...

> That is, the log isn't in a state where we can run normal
> transactions because we can still have pending intent recovery
> queued in the AIL. These can't be moved from the tail of the AIL
> until xfs_log_mount_finish() is called to remove and process them.
> They are dependent on there being enough log space remaining for the
> transaction reservation they make to succeed and there may only be
> just enough space for the first intent to make that reservation.
> Hence if we consume any log space before we recover intents, we can
> deadlock on log space when then trying to recover the intents. And
> because the AIL can't push on intents, the same reservation deadlock
> could occur with any transaction reservation we try to run before
> xfs_log_mount_finish() has been run.
> 
> Yes, I know this is an empty transaction and so it doesn't push on
> the log or consume any space. But I think it's a bad precedent even
> to use transactions at all when we know we are in the middle of log
> recovery and the transaction subsystem is not open for general use
> yet.  Hence I think using transaction handles - even empty ones -
> before xfs_log_mount_finish() is run is a pattern we want to avoid
> if at all possible.

...yes...

> > +
> > +	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL);
> > +	error = xfs_rtalloc_query_all(tp, xfs_rtalloc_count_frextent, &val);
> 
> Looking at the code here, AFAICT the only thing the tp is absolutely
> required for here is to hold the mount pointer - all the internal
> bitmap walk operations grab and release buffers using interfaces
> that can handle a null tp being passed.
> 
> Hence it looks to me like a transaction context isn't necessary for
> this read-only bitmap walk. Is that correct?  Converting
> xfs_rtalloc_query_all() to take a mount and a trans would also make
> this code becomes a lot simpler....

...and I was about to say "Yeah, we could make xfs_rtalloc_query_all
take the mount and an optional transaction", but it occurred to me:

The rtalloc queries need to read the rtbitmap file's data fork mappings
into memory to find the blocks.  If the rtbitmap is fragmented enough
that the data fork is in btree format and the bmbt contains an upward
cycle, loading the bmbt will livelock the mount attempt.

Granted, xfs_bmapi_read doesn't take a transaction, so this means that
we'd either have to change it to take one, or we'd have to change the
rtmount code to create an empty transaction and call iread_extents to
detect a bmbt cycle.

So yeah, I think the answer to your question is "yes", but then there's
this other issue... :(

> 
> > +	xfs_iunlock(mp->m_rbmip, XFS_ILOCK_EXCL);
> > +	xfs_trans_cancel(tp);
> > +	if (error)
> > +		return error;
> > +
> > +	spin_lock(&mp->m_sb_lock);
> > +	mp->m_sb.sb_frextents = val;
> > +	spin_unlock(&mp->m_sb_lock);
> > +	return 0;
> > +}
> >  /*
> >   * Get the bitmap and summary inodes and the summary cache into the mount
> >   * structure at mount time.
> > @@ -1302,13 +1340,35 @@ xfs_rtmount_inodes(
> >  	ASSERT(mp->m_rbmip != NULL);
> >  
> >  	error = xfs_iget(mp, NULL, sbp->sb_rsumino, 0, 0, &mp->m_rsumip);
> > -	if (error) {
> > -		xfs_irele(mp->m_rbmip);
> > -		return error;
> > -	}
> > +	if (error)
> > +		goto out_rbm;
> 
> 		goto out_rele_bitmap;
> 
> >  	ASSERT(mp->m_rsumip != NULL);
> > +
> > +	/*
> > +	 * Older kernels misused sb_frextents to reflect both incore
> > +	 * reservations made by running transactions and the actual count of
> > +	 * free rt extents in the ondisk metadata.  Transactions committed
> > +	 * during runtime can therefore contain a superblock update that
> > +	 * undercounts the number of free rt extents tracked in the rt bitmap.
> > +	 * A clean unmount record will have the correct frextents value since
> > +	 * there can be no other transactions running at that point.
> > +	 *
> > +	 * If we're mounting the rt volume after recovering the log, recompute
> > +	 * frextents from the rtbitmap file to fix the inconsistency.
> > +	 */
> > +	if (xfs_has_realtime(mp) && xlog_recovery_needed(mp->m_log)) {
> > +		error = xfs_rtalloc_reinit_frextents(mp);
> > +		if (error)
> > +			goto out_rsum;
> 
> 			goto out_rele_summary;
> > +	}
> 
> This is pretty much doing the same thing that
> xfs_check_summary_counts() does for the data device. That is done
> before we mount the rt inodes, but it could be done after this, too.
> 
> Does it make any sense to move xfs_check_summary_counts() and then
> centralise all the summary counter reinitialisation in the one
> place under the same runtime recovery context?

Yes, that would be a nice hoist for the end of the patchset.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations
  2022-04-07 23:17   ` Dave Chinner
@ 2022-04-07 23:45     ` Darrick J. Wong
  2022-04-08  0:12       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-04-07 23:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 08, 2022 at 09:17:25AM +1000, Dave Chinner wrote:
> On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > As mentioned in the previous commit, the kernel misuses sb_frextents in
> > the incore mount to reflect both incore reservations made by running
> > transactions as well as the actual count of free rt extents on disk.
> > This results in the superblock being written to the log with an
> > underestimate of the number of rt extents that are marked free in the
> > rtbitmap.
> > 
> > Teaching XFS to recompute frextents after log recovery avoids
> > operational problems in the current mount, but it doesn't solve the
> > problem of us writing undercounted frextents which are then recovered by
> > an older kernel that doesn't have that fix.
> > 
> > Create an incore percpu counter to mirror the ondisk frextents.  This
> > new counter will track transaction reservations and the only time we
> > will touch the incore super counter (i.e the one that gets logged) is
> > when those transactions commit updates to the rt bitmap.  This is in
> > contrast to the lazysbcount counters (e.g. fdblocks), where we know that
> > log recovery will always fix any incorrect counter that we log.
> > As a bonus, we only take m_sb_lock at transaction commit time.
> 
> Again, the concept looks fine as does most of the code. Some
> comments on the implementation below.
> 
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index c5f153c3693f..d5463728c305 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -1183,17 +1183,37 @@ xfs_mod_frextents(
> >  	struct xfs_mount	*mp,
> >  	int64_t			delta)
> >  {
> > -	int64_t			lcounter;
> > -	int			ret = 0;
> > +	int			batch;
> >  
> > -	spin_lock(&mp->m_sb_lock);
> > -	lcounter = mp->m_sb.sb_frextents + delta;
> > -	if (lcounter < 0)
> > -		ret = -ENOSPC;
> > +	if (delta > 0) {
> > +		percpu_counter_add(&mp->m_frextents, delta);
> > +		return 0;
> > +	}
> > +
> > +	/*
> > +	 * Taking blocks away, need to be more accurate the closer we
> > +	 * are to zero.
> > +	 *
> > +	 * If the counter has a value of less than 2 * max batch size,
> > +	 * then make everything serialise as we are real close to
> > +	 * ENOSPC.
> > +	 */
> > +	if (__percpu_counter_compare(&mp->m_frextents, 2 * XFS_FDBLOCKS_BATCH,
> > +				     XFS_FDBLOCKS_BATCH) < 0)
> > +		batch = 1;
> >  	else
> > -		mp->m_sb.sb_frextents = lcounter;
> > -	spin_unlock(&mp->m_sb_lock);
> > -	return ret;
> > +		batch = XFS_FDBLOCKS_BATCH;
> > +
> > +	percpu_counter_add_batch(&mp->m_frextents, delta, batch);
> > +	if (__percpu_counter_compare(&mp->m_frextents, 0,
> > +				     XFS_FDBLOCKS_BATCH) >= 0) {
> > +		/* we had space! */
> > +		return 0;
> > +	}
> > +
> > +	/* oops, negative free space, put that back! */
> > +	percpu_counter_add(&mp->m_frextents, -delta);
> > +	return -ENOSPC;
> >  }
> 
> Ok, so this looks like a copy-pasta of xfs_mod_fdblocks() with the
> reservation pool stuff stripped. I'd kinda prefer to factor
> xfs_mod_fdblocks() so that we aren't blowing out the instruction
> cache footprint on this hot path - we're frequently modifying
> both RT and fd block counts in the same transaction, so having them
> run the same code would be good.
> 
> Something like:
> 
> int
> xfs_mod_blocks(
> 	struct xfs_mount	*mp,
> 	struct pcp_counter	*pccnt,
> 	int64_t			delta,
> 	bool			use_resv_pool,
> 	bool			rsvd)
> {
> 	int64_t                 lcounter;
> 	long long               res_used;
> 	s32                     batch;
> 	uint64_t                set_aside = 0;
> 
> 	if (delta > 0) {
> 	       /*
> 		* If the reserve pool is depleted, put blocks back into it
> 		* first. Most of the time the pool is full.
> 		*/
> 	       if (likely(!use_resv_pool || mp->m_resblks == mp->m_resblks_avail)) {
> 		       percpu_counter_add(pccnt, delta);
> 		       return 0;
> 	       }
> 
> 	       spin_lock(&mp->m_sb_lock);
> 	       res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);
> 
> 	       if (res_used > delta) {
> 		       mp->m_resblks_avail += delta;
> 	       } else {
> 		       delta -= res_used;
> 		       mp->m_resblks_avail = mp->m_resblks;
> 		       percpu_counter_add(&mp->m_fdblocks, delta);
> 	       }
> 	       spin_unlock(&mp->m_sb_lock);
> 	       return 0;
> 	}
> 
> 	/*
> 	* Taking blocks away, need to be more accurate the closer we
> 	* are to zero.
> 	*
> 	* If the counter has a value of less than 2 * max batch size,
> 	* then make everything serialise as we are real close to
> 	* ENOSPC.
> 	*/
> 	if (__percpu_counter_compare(pccnt, 2 * XFS_FDBLOCKS_BATCH,
> 				    XFS_FDBLOCKS_BATCH) < 0)
> 	       batch = 1;
> 	else
> 	       batch = XFS_FDBLOCKS_BATCH;
> 
> 	/*
> 	* Set aside allocbt blocks because these blocks are tracked as free
> 	* space but not available for allocation. Technically this means that a
> 	* single reservation cannot consume all remaining free space, but the
> 	* ratio of allocbt blocks to usable free blocks should be rather small.
> 	* The tradeoff without this is that filesystems that maintain high
> 	* perag block reservations can over reserve physical block availability
> 	* and fail physical allocation, which leads to much more serious
> 	* problems (i.e. transaction abort, pagecache discards, etc.) than
> 	* slightly premature -ENOSPC.
> 	*/
> 	if (use_resv_pool)
> 		set_aside = xfs_fdblocks_unavailable(mp);
> 	percpu_counter_add_batch(pccnt, delta, batch);
> 	if (__percpu_counter_compare(&pccnt, set_aside,
> 				    XFS_FDBLOCKS_BATCH) >= 0) {
> 	       /* we had space! */
> 	       return 0;
> 	}
> 
> 	/*
> 	* lock up the sb for dipping into reserves before releasing the space
> 	* that took us to ENOSPC.
> 	*/
> 	spin_lock(&mp->m_sb_lock);
> 	percpu_counter_add(pccnt, -delta);
> 	if (!use_resv_pool || !rsvd)
> 	       goto fdblocks_enospc;
> 
> 	lcounter = (long long)mp->m_resblks_avail + delta;
> 	if (lcounter >= 0) {
> 	       mp->m_resblks_avail = lcounter;
> 	       spin_unlock(&mp->m_sb_lock);
> 	       return 0;
> 	}
> 	xfs_warn_once(mp,
> "Reserve blocks depleted! Consider increasing reserve pool size.");
> 
> fdblocks_enospc:
> 	spin_unlock(&mp->m_sb_lock);
> 	return -ENOSPC;
> }
> 
> And in the relevant header file:
> 
> int xfs_mod_blocks(struct xfs_mount *mp, struct pcp_counter *pccnt,
> 		int64_t delta, bool use_resv_pool, bool rsvd);
> 
> static inline int
> xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool rsvd)
> {
> 	return xfs_mod_blocks(mp, &mp->m_fdblocks, delta, true, resvd);
> }
> 
> static inline int
> xfs_mod_frextents(struct xfs_mount *mp, int64_t delta)
> {
> 	return xfs_mod_blocks(mp, &mp->m_frextents, delta, false, false);
> }

Looks good to me.

> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 54be9d64093e..cc95768eb8e1 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -843,9 +843,11 @@ xfs_fs_statfs(
> >  
> >  	if (XFS_IS_REALTIME_MOUNT(mp) &&
> >  	    (ip->i_diflags & (XFS_DIFLAG_RTINHERIT | XFS_DIFLAG_REALTIME))) {
> > +		s64	freertx;
> > +
> >  		statp->f_blocks = sbp->sb_rblocks;
> > -		statp->f_bavail = statp->f_bfree =
> > -			sbp->sb_frextents * sbp->sb_rextsize;
> > +		freertx = max_t(s64, 0, percpu_counter_sum(&mp->m_frextents));
> 
> percpu_counter_sum_positive()

Ok.

> >  	if (error)
> >  		goto free_fdblocks;
> >  
> > +	error = percpu_counter_init(&mp->m_frextents, 0, GFP_KERNEL);
> > +	if (error)
> > +		goto free_delalloc;
> > +
> >  	return 0;
> >  
> > +free_delalloc:
> > +	percpu_counter_destroy(&mp->m_delalloc_blks);
> >  free_fdblocks:
> >  	percpu_counter_destroy(&mp->m_fdblocks);
> >  free_ifree:
> > @@ -1033,6 +1041,7 @@ xfs_reinit_percpu_counters(
> >  	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
> >  	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
> >  	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
> > +	percpu_counter_set(&mp->m_frextents, mp->m_sb.sb_frextents);
> >  }
> >  
> >  static void
> > @@ -1045,6 +1054,7 @@ xfs_destroy_percpu_counters(
> >  	ASSERT(xfs_is_shutdown(mp) ||
> >  	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
> >  	percpu_counter_destroy(&mp->m_delalloc_blks);
> > +	percpu_counter_destroy(&mp->m_frextents);
> >  }
> >  
> >  static int
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 0ac717aad380..63a4d3a24340 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas(
> >  			be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta);
> >  	}
> >  
> > -	if (tp->t_frextents_delta)
> > -		be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta);
> > -	if (tp->t_res_frextents_delta)
> > -		be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta);
> > +	/*
> > +	 * Updating frextents requires careful handling because it does not
> > +	 * behave like the lazysb counters because we cannot rely on log
> > +	 * recovery in older kenels to recompute the value from the rtbitmap.
> > +	 * This means that the ondisk frextents must be consistent with the
> > +	 * rtbitmap.
> > +	 *
> > +	 * Therefore, log the frextents change to the ondisk superblock and
> > +	 * update the incore superblock so that future calls to xfs_log_sb
> > +	 * write the correct value ondisk.
> > +	 *
> > +	 * Don't touch m_frextents because it includes incore reservations,
> > +	 * and those are handled by the unreserve function.
> > +	 */
> > +	if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
> > +		struct xfs_mount	*mp = tp->t_mountp;
> > +		int64_t			rtxdelta;
> > +
> > +		rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta;
> > +
> > +		spin_lock(&mp->m_sb_lock);
> > +		be64_add_cpu(&sbp->sb_frextents, rtxdelta);
> > +		mp->m_sb.sb_frextents += rtxdelta;
> > +		spin_unlock(&mp->m_sb_lock);
> > +	}
> 
> Hmmmm.  This wants a comment in xfs_log_sb() to explain why we
> aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we
> do with all the other per-cpu counters tracking resource usage.

Ok.  How about this for xfs_log_sb:

/*
 * Do not update sb_frextents here because it is not part of the lazy sb
 * counters (despite having a percpu counter) and therefore must be
 * consistent with the ondisk rtbitmap.
 */

> >  
> >  	if (tp->t_dblocks_delta) {
> >  		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> > @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb(
> >  	if (ifreedelta)
> >  		percpu_counter_add(&mp->m_ifree, ifreedelta);
> >  
> > -	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
> > +	if (rtxdelta) {
> > +		error = xfs_mod_frextents(mp, rtxdelta);
> > +		ASSERT(!error);
> > +	}
> > +
> > +	if (!(tp->t_flags & XFS_TRANS_SB_DIRTY))
> >  		return;
> >  
> >  	/* apply remaining deltas */
> > @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb(
> >  	mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta;
> >  	mp->m_sb.sb_icount += idelta;
> >  	mp->m_sb.sb_ifree += ifreedelta;
> > -	mp->m_sb.sb_frextents += rtxdelta;
> 
> This makes my head hurt trying to work out if this is necessary or
> not. (the lazy sb stuff in these functions has always strained my
> cognitive abilities, even though I wrote it in the first place!)
> 
> A comment explaining why we don't need to update
> mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful
> in the above if (rtxdelta) update above.

And how about this?

/*
 * Do not touch sb_frextents here because we are dealing with incore
 * reservation.  sb_frextents is not part of the lazy sb counters so it
 * must be consistent with the ondisk rtibitmap and must never include
 * incore reservations.
 */

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] xfs: recalculate free rt extents after log recovery
  2022-04-07 23:39     ` Darrick J. Wong
@ 2022-04-08  0:06       ` Dave Chinner
  2022-04-08 17:42         ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-04-08  0:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 07, 2022 at 04:39:41PM -0700, Darrick J. Wong wrote:
> On Fri, Apr 08, 2022 at 07:56:05AM +1000, Dave Chinner wrote:
> > On Thu, Apr 07, 2022 at 01:46:56PM -0700, Darrick J. Wong wrote:
> > > @@ -1284,6 +1285,43 @@ xfs_rtmount_init(
> > >  	return 0;
> > >  }
> > >  
> > > +static inline int
> > > +xfs_rtalloc_count_frextent(
> > > +	struct xfs_trans		*tp,
> > > +	const struct xfs_rtalloc_rec	*rec,
> > > +	void				*priv)
> > > +{
> > > +	uint64_t			*valp = priv;
> > > +
> > > +	*valp += rec->ar_extcount;
> > > +	return 0;
> > > +}
> > > +
> > > +/* Reinitialize the number of free realtime extents from the realtime bitmap. */
> > > +STATIC int
> > > +xfs_rtalloc_reinit_frextents(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct xfs_trans	*tp;
> > > +	uint64_t		val = 0;
> > > +	int			error;
> > > +
> > > +	error = xfs_trans_alloc_empty(mp, &tp);
> > > +	if (error)
> > > +		return error;
> > 
> > At this point in the mount, the transaction subsystem is not
> > really available for use. I understand that this is an empty
> > transaction and it doesn't reserve log space, but I don't like
> > creating an implicit requirement that xfs_trans_alloc_empty() never
> > touches log state for this code to work correctly into the future.
> 
> <nod> I'd wondered if it was really a good idea to be creating a
> transaction at the "end" of log recovery.  Recovery itself does it, but
> at least that's a carefully controlled box...
> 
> > That is, the log isn't in a state where we can run normal
> > transactions because we can still have pending intent recovery
> > queued in the AIL. These can't be moved from the tail of the AIL
> > until xfs_log_mount_finish() is called to remove and process them.
> > They are dependent on there being enough log space remaining for the
> > transaction reservation they make to succeed and there may only be
> > just enough space for the first intent to make that reservation.
> > Hence if we consume any log space before we recover intents, we can
> > deadlock on log space when then trying to recover the intents. And
> > because the AIL can't push on intents, the same reservation deadlock
> > could occur with any transaction reservation we try to run before
> > xfs_log_mount_finish() has been run.
> > 
> > Yes, I know this is an empty transaction and so it doesn't push on
> > the log or consume any space. But I think it's a bad precedent even
> > to use transactions at all when we know we are in the middle of log
> > recovery and the transaction subsystem is not open for general use
> > yet.  Hence I think using transaction handles - even empty ones -
> > before xfs_log_mount_finish() is run is a pattern we want to avoid
> > if at all possible.
> 
> ...yes...
> 
> > > +
> > > +	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL);
> > > +	error = xfs_rtalloc_query_all(tp, xfs_rtalloc_count_frextent, &val);
> > 
> > Looking at the code here, AFAICT the only thing the tp is absolutely
> > required for here is to hold the mount pointer - all the internal
> > bitmap walk operations grab and release buffers using interfaces
> > that can handle a null tp being passed.
> > 
> > Hence it looks to me like a transaction context isn't necessary for
> > this read-only bitmap walk. Is that correct?  Converting
> > xfs_rtalloc_query_all() to take a mount and a trans would also make
> > this code becomes a lot simpler....
> 
> ...and I was about to say "Yeah, we could make xfs_rtalloc_query_all
> take the mount and an optional transaction", but it occurred to me:
> 
> The rtalloc queries need to read the rtbitmap file's data fork mappings
> into memory to find the blocks.  If the rtbitmap is fragmented enough
> that the data fork is in btree format and the bmbt contains an upward
> cycle, loading the bmbt will livelock the mount attempt.

"upward cycle", as in a corrupted btree?

> Granted, xfs_bmapi_read doesn't take a transaction, so this means that
> we'd either have to change it to take one, or we'd have to change the
> rtmount code to create an empty transaction and call iread_extents to
> detect a bmbt cycle.


> So yeah, I think the answer to your question is "yes", but then there's
> this other issue... :(

Yup, but keep in mind that:

xfs_fs_reserve_ag_blocks(tp == NULL)
  xfs_ag_resv_init
    xfs_finobt_calc_reserves
      xfs_inobt_count_blocks
        xfs_btree_count_blocks

Will do a finobt btree walk without a transaction pointer if
we don't have finobt block counts recorded in the AGI. So it
seems to me that we are already at risk of cycles in corrupted
btrees being able to cause mount hangs.

Hence from that perspective, I think worrying about a corrupted
rtbitmap BMBT btree is largely outside the scope of this patchset;
it's a generic issue with transaction-less btree walks, and one we
are already exposed to in the same phase of the mount/log recovery
process. Hence I think we should consider how to solve this problem
seperately rather that tie this patchset into twisty knots trying to
deal with here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations
  2022-04-07 23:45     ` Darrick J. Wong
@ 2022-04-08  0:12       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2022-04-08  0:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 07, 2022 at 04:45:28PM -0700, Darrick J. Wong wrote:
> On Fri, Apr 08, 2022 at 09:17:25AM +1000, Dave Chinner wrote:
> > On Thu, Apr 07, 2022 at 01:47:02PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -498,10 +498,31 @@ xfs_trans_apply_sb_deltas(
> > >  			be64_add_cpu(&sbp->sb_fdblocks, tp->t_res_fdblocks_delta);
> > >  	}
> > >  
> > > -	if (tp->t_frextents_delta)
> > > -		be64_add_cpu(&sbp->sb_frextents, tp->t_frextents_delta);
> > > -	if (tp->t_res_frextents_delta)
> > > -		be64_add_cpu(&sbp->sb_frextents, tp->t_res_frextents_delta);
> > > +	/*
> > > +	 * Updating frextents requires careful handling because it does not
> > > +	 * behave like the lazysb counters because we cannot rely on log
> > > +	 * recovery in older kenels to recompute the value from the rtbitmap.
> > > +	 * This means that the ondisk frextents must be consistent with the
> > > +	 * rtbitmap.
> > > +	 *
> > > +	 * Therefore, log the frextents change to the ondisk superblock and
> > > +	 * update the incore superblock so that future calls to xfs_log_sb
> > > +	 * write the correct value ondisk.
> > > +	 *
> > > +	 * Don't touch m_frextents because it includes incore reservations,
> > > +	 * and those are handled by the unreserve function.
> > > +	 */
> > > +	if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
> > > +		struct xfs_mount	*mp = tp->t_mountp;
> > > +		int64_t			rtxdelta;
> > > +
> > > +		rtxdelta = tp->t_frextents_delta + tp->t_res_frextents_delta;
> > > +
> > > +		spin_lock(&mp->m_sb_lock);
> > > +		be64_add_cpu(&sbp->sb_frextents, rtxdelta);
> > > +		mp->m_sb.sb_frextents += rtxdelta;
> > > +		spin_unlock(&mp->m_sb_lock);
> > > +	}
> > 
> > Hmmmm.  This wants a comment in xfs_log_sb() to explain why we
> > aren't updating mp->m_sb.sb_frextents from mp->m_frextents like we
> > do with all the other per-cpu counters tracking resource usage.
> 
> Ok.  How about this for xfs_log_sb:
> 
> /*
>  * Do not update sb_frextents here because it is not part of the lazy sb
>  * counters (despite having a percpu counter) and therefore must be
>  * consistent with the ondisk rtbitmap.
>  */

Good! But i think we can do better. :)

/*
 * Do not update sb_frextents here because it is not part of the
 * lazy sb counters (despite having a percpu counter). It is always
 * kept consistent with the ondisk rtbitmap by
 * xfs_trans_apply_sb_deltas() and hence we don't need have to
 * update it here.
 */

> > >  
> > >  	if (tp->t_dblocks_delta) {
> > >  		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> > > @@ -614,7 +635,12 @@ xfs_trans_unreserve_and_mod_sb(
> > >  	if (ifreedelta)
> > >  		percpu_counter_add(&mp->m_ifree, ifreedelta);
> > >  
> > > -	if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY))
> > > +	if (rtxdelta) {
> > > +		error = xfs_mod_frextents(mp, rtxdelta);
> > > +		ASSERT(!error);
> > > +	}
> > > +
> > > +	if (!(tp->t_flags & XFS_TRANS_SB_DIRTY))
> > >  		return;
> > >  
> > >  	/* apply remaining deltas */
> > > @@ -622,7 +648,6 @@ xfs_trans_unreserve_and_mod_sb(
> > >  	mp->m_sb.sb_fdblocks += tp->t_fdblocks_delta + tp->t_res_fdblocks_delta;
> > >  	mp->m_sb.sb_icount += idelta;
> > >  	mp->m_sb.sb_ifree += ifreedelta;
> > > -	mp->m_sb.sb_frextents += rtxdelta;
> > 
> > This makes my head hurt trying to work out if this is necessary or
> > not. (the lazy sb stuff in these functions has always strained my
> > cognitive abilities, even though I wrote it in the first place!)
> > 
> > A comment explaining why we don't need to update
> > mp->m_sb.sb_frextents when XFS_TRANS_SB_DIRTY is set would be useful
> > in the above if (rtxdelta) update above.
> 
> And how about this?
> 
> /*
>  * Do not touch sb_frextents here because we are dealing with incore
>  * reservation.  sb_frextents is not part of the lazy sb counters so it
>  * must be consistent with the ondisk rtibitmap and must never include
>  * incore reservations.
>  */

Yup, makes sense :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] xfs: recalculate free rt extents after log recovery
  2022-04-08  0:06       ` Dave Chinner
@ 2022-04-08 17:42         ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2022-04-08 17:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Apr 08, 2022 at 10:06:05AM +1000, Dave Chinner wrote:
> On Thu, Apr 07, 2022 at 04:39:41PM -0700, Darrick J. Wong wrote:
> > On Fri, Apr 08, 2022 at 07:56:05AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 07, 2022 at 01:46:56PM -0700, Darrick J. Wong wrote:
> > > > @@ -1284,6 +1285,43 @@ xfs_rtmount_init(
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static inline int
> > > > +xfs_rtalloc_count_frextent(
> > > > +	struct xfs_trans		*tp,
> > > > +	const struct xfs_rtalloc_rec	*rec,
> > > > +	void				*priv)
> > > > +{
> > > > +	uint64_t			*valp = priv;
> > > > +
> > > > +	*valp += rec->ar_extcount;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/* Reinitialize the number of free realtime extents from the realtime bitmap. */
> > > > +STATIC int
> > > > +xfs_rtalloc_reinit_frextents(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	struct xfs_trans	*tp;
> > > > +	uint64_t		val = 0;
> > > > +	int			error;
> > > > +
> > > > +	error = xfs_trans_alloc_empty(mp, &tp);
> > > > +	if (error)
> > > > +		return error;
> > > 
> > > At this point in the mount, the transaction subsystem is not
> > > really available for use. I understand that this is an empty
> > > transaction and it doesn't reserve log space, but I don't like
> > > creating an implicit requirement that xfs_trans_alloc_empty() never
> > > touches log state for this code to work correctly into the future.
> > 
> > <nod> I'd wondered if it was really a good idea to be creating a
> > transaction at the "end" of log recovery.  Recovery itself does it, but
> > at least that's a carefully controlled box...
> > 
> > > That is, the log isn't in a state where we can run normal
> > > transactions because we can still have pending intent recovery
> > > queued in the AIL. These can't be moved from the tail of the AIL
> > > until xfs_log_mount_finish() is called to remove and process them.
> > > They are dependent on there being enough log space remaining for the
> > > transaction reservation they make to succeed and there may only be
> > > just enough space for the first intent to make that reservation.
> > > Hence if we consume any log space before we recover intents, we can
> > > deadlock on log space when then trying to recover the intents. And
> > > because the AIL can't push on intents, the same reservation deadlock
> > > could occur with any transaction reservation we try to run before
> > > xfs_log_mount_finish() has been run.
> > > 
> > > Yes, I know this is an empty transaction and so it doesn't push on
> > > the log or consume any space. But I think it's a bad precedent even
> > > to use transactions at all when we know we are in the middle of log
> > > recovery and the transaction subsystem is not open for general use
> > > yet.  Hence I think using transaction handles - even empty ones -
> > > before xfs_log_mount_finish() is run is a pattern we want to avoid
> > > if at all possible.
> > 
> > ...yes...
> > 
> > > > +
> > > > +	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL);
> > > > +	error = xfs_rtalloc_query_all(tp, xfs_rtalloc_count_frextent, &val);
> > > 
> > > Looking at the code here, AFAICT the only thing the tp is absolutely
> > > required for here is to hold the mount pointer - all the internal
> > > bitmap walk operations grab and release buffers using interfaces
> > > that can handle a null tp being passed.
> > > 
> > > Hence it looks to me like a transaction context isn't necessary for
> > > this read-only bitmap walk. Is that correct?  Converting
> > > xfs_rtalloc_query_all() to take a mount and a trans would also make
> > > this code becomes a lot simpler....
> > 
> > ...and I was about to say "Yeah, we could make xfs_rtalloc_query_all
> > take the mount and an optional transaction", but it occurred to me:
> > 
> > The rtalloc queries need to read the rtbitmap file's data fork mappings
> > into memory to find the blocks.  If the rtbitmap is fragmented enough
> > that the data fork is in btree format and the bmbt contains an upward
> > cycle, loading the bmbt will livelock the mount attempt.
> 
> "upward cycle", as in a corrupted btree?

Yes.

> > Granted, xfs_bmapi_read doesn't take a transaction, so this means that
> > we'd either have to change it to take one, or we'd have to change the
> > rtmount code to create an empty transaction and call iread_extents to
> > detect a bmbt cycle.
> 
> 
> > So yeah, I think the answer to your question is "yes", but then there's
> > this other issue... :(
> 
> Yup, but keep in mind that:
> 
> xfs_fs_reserve_ag_blocks(tp == NULL)
>   xfs_ag_resv_init
>     xfs_finobt_calc_reserves
>       xfs_inobt_count_blocks
>         xfs_btree_count_blocks
> 
> Will do a finobt btree walk without a transaction pointer if
> we don't have finobt block counts recorded in the AGI. So it
> seems to me that we are already at risk of cycles in corrupted
> btrees being able to cause mount hangs.
> 
> Hence from that perspective, I think worrying about a corrupted
> rtbitmap BMBT btree is largely outside the scope of this patchset;
> it's a generic issue with transaction-less btree walks, and one we
> are already exposed to in the same phase of the mount/log recovery
> process. Hence I think we should consider how to solve this problem
> seperately rather that tie this patchset into twisty knots trying to
> deal with here...

<nod> I've long wanted to do a full audit of all the places we walk
ondisk graph metadata without a transaction, but still haven't gotten
to it... :(

Anyway, I'll leave /that/ for a future series.  In the meantime, I'll
amend the rtalloc query functions to take mp and tp.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-08 17:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-07 20:46 [PATCHSET 0/2] xfs: fix corruption of free rt extent count Darrick J. Wong
2022-04-07 20:46 ` [PATCH 1/2] xfs: recalculate free rt extents after log recovery Darrick J. Wong
2022-04-07 21:56   ` Dave Chinner
2022-04-07 23:39     ` Darrick J. Wong
2022-04-08  0:06       ` Dave Chinner
2022-04-08 17:42         ` Darrick J. Wong
2022-04-07 20:47 ` [PATCH 2/2] xfs: use a separate frextents counter for rt extent reservations Darrick J. Wong
2022-04-07 23:17   ` Dave Chinner
2022-04-07 23:45     ` Darrick J. Wong
2022-04-08  0:12       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.