[PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze
@ 2023-05-03  3:02 Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Darrick J. Wong @ 2023-05-03  3:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, mcgrof, ruansy.fnst, linux-fsdevel

Hi all,

A longstanding deficiency in the online fs summary counter scrubbing
code is that it hasn't any means to quiesce the incore percpu counters
while it's running.  There is no way to coordinate with other threads
are reserving or freeing free space simultaneously, which leads to false
error reports.  Right now, if the discrepancy is large, we just sort of
shrug and bail out with an incomplete flag, but this is lame.

For repair activity, we actually /do/ need to stabilize the counters to
get an accurate reading and install it in the percpu counter.  To
improve the former and enable the latter, allow the fscounters online
fsck code to perform an exclusive mini-freeze on the filesystem.  The
exclusivity prevents userspace from thawing while we're running, and the
mini-freeze means that we don't wait for the log to quiesce, which will
make both speedier.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-fscounters
---
 fs/super.c                       |   74 ++++++++++++++-
 fs/xfs/Makefile                  |    1 
 fs/xfs/scrub/fscounters.c        |  185 ++++++++++++++++++++++++++++----------
 fs/xfs/scrub/fscounters.h        |   20 ++++
 fs/xfs/scrub/fscounters_repair.c |   72 +++++++++++++++
 fs/xfs/scrub/repair.h            |    2 
 fs/xfs/scrub/scrub.c             |    8 +-
 fs/xfs/scrub/scrub.h             |    1 
 fs/xfs/scrub/trace.c             |    1 
 fs/xfs/scrub/trace.h             |   22 ++++-
 fs/xfs/xfs_super.c               |   20 ++++
 fs/xfs/xfs_super.h               |    2 
 include/linux/fs.h               |    3 +
 13 files changed, 348 insertions(+), 63 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters.h
 create mode 100644 fs/xfs/scrub/fscounters_repair.c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs
  2023-05-03  3:02 [PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze Darrick J. Wong
@ 2023-05-03  3:02 ` Darrick J. Wong
  2023-05-07  5:23   ` Luis Chamberlain
  2023-05-03  3:02 ` [PATCH 2/4] vfs: allow exclusive freezing of filesystems Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2023-05-03  3:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, mcgrof, ruansy.fnst, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add a tag field to sb_writers so that the freeze code tracks who froze a
filesystem.  For now the only tag is for userspace-initiated freezing,
but in the next few patches we'll introduce the ability for in-kernel
callers to freeze an fs exclusively.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/super.c         |   41 ++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h |    1 +
 2 files changed, 35 insertions(+), 7 deletions(-)


diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7dfe..01891f9e6d5e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -39,7 +39,10 @@
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
-static int thaw_super_locked(struct super_block *sb);
+static int thaw_super_locked(struct super_block *sb, unsigned long cookie);
+
+/* Freeze cookie denoting that userspace froze the filesystem. */
+#define USERSPACE_FREEZE_COOKIE	((unsigned long)freeze_super)
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -1027,7 +1030,7 @@ static void do_thaw_all_callback(struct super_block *sb)
 	down_write(&sb->s_umount);
 	if (sb->s_root && sb->s_flags & SB_BORN) {
 		emergency_thaw_bdev(sb);
-		thaw_super_locked(sb);
+		thaw_super_locked(sb, USERSPACE_FREEZE_COOKIE);
 	} else {
 		up_write(&sb->s_umount);
 	}
@@ -1636,13 +1639,18 @@ static void sb_freeze_unlock(struct super_block *sb, int level)
 }
 
 /**
- * freeze_super - lock the filesystem and force it into a consistent state
+ * __freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
+ * @cookie: magic value telling us who tried to freeze the fs
  *
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs will return
  * -EBUSY.
  *
+ * If a filesystem freeze is initiated, the sb->s_writers.freeze_cookie value
+ * is set to the @cookie.  The filesystem can only be thawed with the same
+ * cookie value.
+ *
  * During this function, sb->s_writers.frozen goes through these values:
  *
  * SB_UNFROZEN: File system is normal, all writes progress as usual.
@@ -1668,7 +1676,7 @@ static void sb_freeze_unlock(struct super_block *sb, int level)
  *
  * sb->s_writers.frozen is protected by sb->s_umount.
  */
-int freeze_super(struct super_block *sb)
+static int __freeze_super(struct super_block *sb, unsigned long cookie)
 {
 	int ret;
 
@@ -1684,6 +1692,7 @@ int freeze_super(struct super_block *sb)
 		return 0;	/* sic - it's "nothing to do" */
 	}
 
+	sb->s_writers.freeze_cookie = cookie;
 	if (sb_rdonly(sb)) {
 		/* Nothing to do really... */
 		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
@@ -1704,6 +1713,7 @@ int freeze_super(struct super_block *sb)
 	/* All writers are done so after syncing there won't be dirty data */
 	ret = sync_filesystem(sb);
 	if (ret) {
+		sb->s_writers.freeze_cookie = 0;
 		sb->s_writers.frozen = SB_UNFROZEN;
 		sb_freeze_unlock(sb, SB_FREEZE_PAGEFAULT);
 		wake_up(&sb->s_writers.wait_unfrozen);
@@ -1720,6 +1730,7 @@ int freeze_super(struct super_block *sb)
 		if (ret) {
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
+			sb->s_writers.freeze_cookie = 0;
 			sb->s_writers.frozen = SB_UNFROZEN;
 			sb_freeze_unlock(sb, SB_FREEZE_FS);
 			wake_up(&sb->s_writers.wait_unfrozen);
@@ -1736,18 +1747,33 @@ int freeze_super(struct super_block *sb)
 	up_write(&sb->s_umount);
 	return 0;
 }
+
+/*
+ * freeze_super - lock the filesystem and force it into a consistent state
+ * @sb: the super to lock
+ *
+ * Syncs the super to make sure the filesystem is consistent and calls the fs's
+ * freeze_fs.  Subsequent calls to this without first thawing the fs will return
+ * -EBUSY.  See the comment for __freeze_super for more information.
+ */
+int freeze_super(struct super_block *sb)
+{
+	return __freeze_super(sb, USERSPACE_FREEZE_COOKIE);
+}
 EXPORT_SYMBOL(freeze_super);
 
-static int thaw_super_locked(struct super_block *sb)
+static int thaw_super_locked(struct super_block *sb, unsigned long cookie)
 {
 	int error;
 
-	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
+	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE ||
+	    sb->s_writers.freeze_cookie != cookie) {
 		up_write(&sb->s_umount);
 		return -EINVAL;
 	}
 
 	if (sb_rdonly(sb)) {
+		sb->s_writers.freeze_cookie = 0;
 		sb->s_writers.frozen = SB_UNFROZEN;
 		goto out;
 	}
@@ -1765,6 +1791,7 @@ static int thaw_super_locked(struct super_block *sb)
 		}
 	}
 
+	sb->s_writers.freeze_cookie = 0;
 	sb->s_writers.frozen = SB_UNFROZEN;
 	sb_freeze_unlock(sb, SB_FREEZE_FS);
 out:
@@ -1782,7 +1809,7 @@ static int thaw_super_locked(struct super_block *sb)
 int thaw_super(struct super_block *sb)
 {
 	down_write(&sb->s_umount);
-	return thaw_super_locked(sb);
+	return thaw_super_locked(sb, USERSPACE_FREEZE_COOKIE);
 }
 EXPORT_SYMBOL(thaw_super);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..800772361b1e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1129,6 +1129,7 @@ enum {
 
 struct sb_writers {
 	int				frozen;		/* Is sb frozen? */
+	unsigned long			freeze_cookie;	/* who froze us? */
 	wait_queue_head_t		wait_unfrozen;	/* wait for thaw */
 	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
 };


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] vfs: allow exclusive freezing of filesystems
  2023-05-03  3:02 [PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs Darrick J. Wong
@ 2023-05-03  3:02 ` Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 3/4] xfs: stabilize fs summary counters for online fsck Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 4/4] xfs: repair summary counters Darrick J. Wong
  3 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2023-05-03  3:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, mcgrof, ruansy.fnst, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/super.c         |   33 +++++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 35 insertions(+)


diff --git a/fs/super.c b/fs/super.c
index 01891f9e6d5e..46abd21f94ac 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1762,6 +1762,24 @@ int freeze_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(freeze_super);
 
+/**
+ * freeze_super_excl - lock the filesystem exclusively and force it into a
+ * consistent state.
+ * @sb: the super to lock
+ * @cookie: magic cookie to associate with this freeze so that only the caller
+ * can thaw the filesystem
+ *
+ * Syncs the super to make sure the filesystem is consistent and calls the fs's
+ * freeze_fs.  Subsequent calls to this without first thawing the fs will
+ * return -EBUSY.  The filesystem must not already be frozen, and can only be
+ * thawed by passing the same cookie to thaw_super_excl.
+ */
+int freeze_super_excl(struct super_block *sb, unsigned long cookie)
+{
+	return __freeze_super(sb, cookie);
+}
+EXPORT_SYMBOL(freeze_super_excl);
+
 static int thaw_super_locked(struct super_block *sb, unsigned long cookie)
 {
 	int error;
@@ -1813,6 +1831,21 @@ int thaw_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(thaw_super);
 
+/**
+ * thaw_super_excl -- unfreeze filesystem
+ * @sb: the super to thaw
+ * @cookie: magic cookie passed to freeze_super_excl
+ *
+ * Releases an exclusive freeze on a filesystem and marks it writeable again
+ * after freeze_super().
+ */
+int thaw_super_excl(struct super_block *sb, unsigned long cookie)
+{
+	down_write(&sb->s_umount);
+	return thaw_super_locked(sb, cookie);
+}
+EXPORT_SYMBOL(thaw_super_excl);
+
 /*
  * Create workqueue for deferred direct IO completions. We allocate the
  * workqueue when it's first needed. This avoids creating workqueue for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 800772361b1e..3a65bcc45f67 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2272,6 +2272,8 @@ extern int user_statfs(const char __user *, struct kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
+extern int freeze_super_excl(struct super_block *super, unsigned long cookie);
+extern int thaw_super_excl(struct super_block *super, unsigned long cookie);
 extern __printf(2, 3)
 int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
 extern int super_setup_bdi(struct super_block *sb);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] xfs: stabilize fs summary counters for online fsck
  2023-05-03  3:02 [PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 2/4] vfs: allow exclusive freezing of filesystems Darrick J. Wong
@ 2023-05-03  3:02 ` Darrick J. Wong
  2023-05-03  3:02 ` [PATCH 4/4] xfs: repair summary counters Darrick J. Wong
  3 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2023-05-03  3:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, mcgrof, ruansy.fnst, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

If the fscounters scrubber notices incorrect summary counters, it's
entirely possible that scrub is simply racing with other threads that
are updating the incore counters.  There isn't a good way to stabilize
percpu counters or set ourselves up to observe live updates with hooks
like we do for the quotacheck or nlinks scanners, so we instead choose
to freeze the filesystem long enough to walk the incore per-AG
structures.

Past me thought that it was going to be commonplace to have to freeze
the filesystem to perform some kind of repair and set up a whole
separate infrastructure to freeze the filesystem in such a way that
userspace could not unfreeze while we were running.  This involved
adding a mutex and freeze_super/thaw_super functions and dealing with
the fact that the VFS freeze/thaw functions can free the VFS superblock
references on return.

This was all very overwrought, since fscounters turned out to be the
only user of scrub freezes, and it doesn't require the log to quiesce,
only the incore superblock counters.  We prevent other threads from
changing the freeze level by calling freeze_super_excl with a custom
freeze cookie to keep everyone else out of the filesystem.

The end result is that fscounters should be much more efficient.  When
we're checking a busy system and we can't stabilize the counters, the
custom freeze will do less work, which should result in less downtime.
Repair should be similarly speedy, but that's in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/fscounters.c |  172 +++++++++++++++++++++++++++++++++------------
 fs/xfs/scrub/fscounters.h |   20 +++++
 fs/xfs/scrub/scrub.c      |    6 +-
 fs/xfs/scrub/scrub.h      |    1 
 fs/xfs/scrub/trace.h      |    1 
 fs/xfs/xfs_super.c        |   20 +++++
 fs/xfs/xfs_super.h        |    2 +
 7 files changed, 172 insertions(+), 50 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters.h


diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index e382a35e98d8..93498eb38216 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2019-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
@@ -16,9 +16,11 @@
 #include "xfs_ag.h"
 #include "xfs_rtalloc.h"
 #include "xfs_inode.h"
+#include "xfs_icache.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/fscounters.h"
 
 /*
  * FS Summary Counters
@@ -45,16 +47,6 @@
  * our tolerance for mismatch between expected and actual counter values.
  */
 
-struct xchk_fscounters {
-	struct xfs_scrub	*sc;
-	uint64_t		icount;
-	uint64_t		ifree;
-	uint64_t		fdblocks;
-	uint64_t		frextents;
-	unsigned long long	icount_min;
-	unsigned long long	icount_max;
-};
-
 /*
  * Since the expected value computation is lockless but only browses incore
  * values, the percpu counters should be fairly close to each other.  However,
@@ -123,6 +115,58 @@ xchk_fscount_warmup(
 	return error;
 }
 
+/*
+ * We couldn't stabilize the filesystem long enough to sample all the variables
+ * that comprise the summary counters and compare them to the percpu counters.
+ * We need to disable all writer threads, which means taking the first two
+ * freeze levels to put userspace to sleep, and the third freeze level to
+ * prevent background threads from starting new transactions.  Take one level
+ * more to prevent other callers from unfreezing the filesystem while we run.
+ */
+STATIC int
+xchk_fscounters_freeze(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_fscounters	*fsc = sc->buf;
+	struct xfs_mount	*mp = sc->mp;
+	int			error = 0;
+
+	if (sc->flags & XCHK_HAVE_FREEZE_PROT) {
+		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
+		mnt_drop_write_file(sc->file);
+	}
+
+	/* Wait until we're ready to freeze or give up. */
+	while (freeze_super_excl(mp->m_super, XFS_FREEZE_SCRUB_COOKIE(mp))) {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		delay(HZ / 10);
+	}
+
+	fsc->frozen = true;
+	return 0;
+}
+
+/* Thaw the filesystem after checking or repairing fscounters. */
+STATIC void
+xchk_fscounters_cleanup(
+	void			*buf)
+{
+	struct xchk_fscounters	*fsc = buf;
+	struct xfs_scrub	*sc = fsc->sc;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	if (!fsc->frozen)
+		return;
+
+	/* This should always succeed, we froze the fs exclusively. */
+	error = thaw_super_excl(mp->m_super, XFS_FREEZE_SCRUB_COOKIE(mp));
+	if (error)
+		xfs_emerg(mp, "still frozen after scrub, err=%d", error);
+}
+
 int
 xchk_setup_fscounters(
 	struct xfs_scrub	*sc)
@@ -140,6 +184,7 @@ xchk_setup_fscounters(
 	sc->buf = kzalloc(sizeof(struct xchk_fscounters), XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
+	sc->buf_cleanup = xchk_fscounters_cleanup;
 	fsc = sc->buf;
 	fsc->sc = sc;
 
@@ -150,7 +195,18 @@ xchk_setup_fscounters(
 	if (error)
 		return error;
 
-	return xchk_trans_alloc(sc, 0);
+	/*
+	 * Pause all writer activity in the filesystem while we're scrubbing to
+	 * reduce the likelihood of background perturbations to the counters
+	 * throwing off our calculations.
+	 */
+	if (sc->flags & XCHK_TRY_HARDER) {
+		error = xchk_fscounters_freeze(sc);
+		if (error)
+			return error;
+	}
+
+	return xchk_trans_alloc_empty(sc);
 }
 
 /*
@@ -290,8 +346,7 @@ xchk_fscount_aggregate_agcounts(
 	if (fsc->ifree > fsc->icount) {
 		if (tries--)
 			goto retry;
-		xchk_set_incomplete(sc);
-		return 0;
+		return -EDEADLOCK;
 	}
 
 	return 0;
@@ -367,6 +422,8 @@ xchk_fscount_count_frextents(
  * Otherwise, we /might/ have a problem.  If the change in the summations is
  * more than we want to tolerate, the filesystem is probably busy and we should
  * just send back INCOMPLETE and see if userspace will try again.
+ *
+ * If we're repairing then we require an exact match.
  */
 static inline bool
 xchk_fscount_within_range(
@@ -396,21 +453,7 @@ xchk_fscount_within_range(
 	if (expected >= min_value && expected <= max_value)
 		return true;
 
-	/*
-	 * If the difference between the two summations is too large, the fs
-	 * might just be busy and so we'll mark the scrub incomplete.  Return
-	 * true here so that we don't mark the counter corrupt.
-	 *
-	 * XXX: In the future when userspace can grant scrub permission to
-	 * quiesce the filesystem to solve the outsized variance problem, this
-	 * check should be moved up and the return code changed to signal to
-	 * userspace that we need quiesce permission.
-	 */
-	if (max_value - min_value >= XCHK_FSCOUNT_MIN_VARIANCE) {
-		xchk_set_incomplete(sc);
-		return true;
-	}
-
+	/* Everything else is bad. */
 	return false;
 }
 
@@ -422,6 +465,7 @@ xchk_fscounters(
 	struct xfs_mount	*mp = sc->mp;
 	struct xchk_fscounters	*fsc = sc->buf;
 	int64_t			icount, ifree, fdblocks, frextents;
+	bool			try_again = false;
 	int			error;
 
 	/* Snapshot the percpu counters. */
@@ -431,9 +475,26 @@ xchk_fscounters(
 	frextents = percpu_counter_sum(&mp->m_frextents);
 
 	/* No negative values, please! */
-	if (icount < 0 || ifree < 0 || fdblocks < 0 || frextents < 0)
+	if (icount < 0 || ifree < 0)
 		xchk_set_corrupt(sc);
 
+	/*
+	 * If the filesystem is not frozen, the counter summation calls above
+	 * can race with xfs_mod_freecounter, which subtracts a requested space
+	 * reservation from the counter and undoes the subtraction if that made
+	 * the counter go negative.  Therefore, it's possible to see negative
+	 * values here, and we should only flag that as a corruption if we
+	 * froze the fs.  This is much more likely to happen with frextents
+	 * since there are no reserved pools.
+	 */
+	if (fdblocks < 0 || frextents < 0) {
+		if (!fsc->frozen)
+			return -EDEADLOCK;
+
+		xchk_set_corrupt(sc);
+		return 0;
+	}
+
 	/* See if icount is obviously wrong. */
 	if (icount < fsc->icount_min || icount > fsc->icount_max)
 		xchk_set_corrupt(sc);
@@ -446,12 +507,6 @@ xchk_fscounters(
 	if (frextents > mp->m_sb.sb_rextents)
 		xchk_set_corrupt(sc);
 
-	/*
-	 * XXX: We can't quiesce percpu counter updates, so exit early.
-	 * This can be re-enabled when we gain exclusive freeze functionality.
-	 */
-	return 0;
-
 	/*
 	 * If ifree exceeds icount by more than the minimum variance then
 	 * something's probably wrong with the counters.
@@ -463,8 +518,6 @@ xchk_fscounters(
 	error = xchk_fscount_aggregate_agcounts(sc, fsc);
 	if (!xchk_process_error(sc, 0, XFS_SB_BLOCK(mp), &error))
 		return error;
-	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
-		return 0;
 
 	/* Count the free extents counter for rt volumes. */
 	error = xchk_fscount_count_frextents(sc, fsc);
@@ -473,20 +526,45 @@ xchk_fscounters(
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
 		return 0;
 
-	/* Compare the in-core counters with whatever we counted. */
-	if (!xchk_fscount_within_range(sc, icount, &mp->m_icount, fsc->icount))
-		xchk_set_corrupt(sc);
+	/*
+	 * Compare the in-core counters with whatever we counted.  If the fs is
+	 * frozen, we treat the discrepancy as a corruption because the freeze
+	 * should have stabilized the counter values.  Otherwise, we need
+	 * userspace to call us back having granted us freeze permission.
+	 */
+	if (!xchk_fscount_within_range(sc, icount, &mp->m_icount,
+				fsc->icount)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
 
-	if (!xchk_fscount_within_range(sc, ifree, &mp->m_ifree, fsc->ifree))
-		xchk_set_corrupt(sc);
+	if (!xchk_fscount_within_range(sc, ifree, &mp->m_ifree, fsc->ifree)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
 
 	if (!xchk_fscount_within_range(sc, fdblocks, &mp->m_fdblocks,
-			fsc->fdblocks))
-		xchk_set_corrupt(sc);
+			fsc->fdblocks)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
 
 	if (!xchk_fscount_within_range(sc, frextents, &mp->m_frextents,
-			fsc->frextents))
-		xchk_set_corrupt(sc);
+			fsc->frextents)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
+
+	if (try_again)
+		return -EDEADLOCK;
 
 	return 0;
 }
diff --git a/fs/xfs/scrub/fscounters.h b/fs/xfs/scrub/fscounters.h
new file mode 100644
index 000000000000..b36f7157986f
--- /dev/null
+++ b/fs/xfs/scrub/fscounters.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2021-2023 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_FSCOUNTERS_H__
+#define __XFS_SCRUB_FSCOUNTERS_H__
+
+struct xchk_fscounters {
+	struct xfs_scrub	*sc;
+	uint64_t		icount;
+	uint64_t		ifree;
+	uint64_t		fdblocks;
+	uint64_t		frextents;
+	unsigned long long	icount_min;
+	unsigned long long	icount_max;
+	bool			frozen;
+};
+
+#endif /* __XFS_SCRUB_FSCOUNTERS_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index d4396c351373..e487b424b12b 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -187,8 +187,10 @@ xchk_teardown(
 		xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
-	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+	if (sc->flags & XCHK_HAVE_FREEZE_PROT) {
+		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
 		mnt_drop_write_file(sc->file);
+	}
 	if (sc->xfile) {
 		xfile_destroy(sc->xfile);
 		sc->xfile = NULL;
@@ -539,6 +541,8 @@ xfs_scrub_metadata(
 		error = mnt_want_write_file(sc->file);
 		if (error)
 			goto out_sc;
+
+		sc->flags |= XCHK_HAVE_FREEZE_PROT;
 	}
 
 	/* Set up for the operation. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 99b48e996d51..a41ba8d319b6 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -118,6 +118,7 @@ struct xfs_scrub {
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
+#define XCHK_HAVE_FREEZE_PROT	(1 << 1)  /* do we have freeze protection? */
 #define XCHK_FSGATES_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to drain defer ops */
 #define XCHK_FSGATES_QUOTA	(1 << 4)  /* quota live update enabled */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index d4f142929518..6a50e6c89195 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -113,6 +113,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 
 #define XFS_SCRUB_STATE_STRINGS \
 	{ XCHK_TRY_HARDER,			"try_harder" }, \
+	{ XCHK_HAVE_FREEZE_PROT,		"nofreeze" }, \
 	{ XCHK_FSGATES_DRAIN,			"fsgates_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XCHK_FSGATES_QUOTA,			"fsgates_quota" }, \
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 67ebb9d5ed21..cb13b087dcde 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -915,6 +915,15 @@ xfs_fs_freeze(
 	unsigned int		flags;
 	int			ret;
 
+	/*
+	 * Online fsck freezes the filesystem to pause writer threads and
+	 * background garbage collection so that the free space counters do not
+	 * change.  The gc threads are already paused, so return without
+	 * changing the space reservations or flushing the log.
+	 */
+	if (sb->s_writers.freeze_cookie == XFS_FREEZE_SCRUB_COOKIE(mp))
+		return 0;
+
 	/*
 	 * The filesystem is now frozen far enough that memory reclaim
 	 * cannot safely operate on the filesystem. Hence we need to
@@ -946,8 +955,15 @@ xfs_fs_unfreeze(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
-	xfs_restore_resvblks(mp);
-	xfs_log_work_queue(mp);
+	/*
+	 * Online fsck froze the filesystem to pause writer threads to check
+	 * the free space counters.  We didn't pause the log or touch the
+	 * reserve pool, so we only need to reactivate the gc threads.
+	 */
+	if (sb->s_writers.freeze_cookie != XFS_FREEZE_SCRUB_COOKIE(mp)) {
+		xfs_restore_resvblks(mp);
+		xfs_log_work_queue(mp);
+	}
 
 	/*
 	 * Don't reactivate the inodegc worker on a readonly filesystem because
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index 364e2c2648a8..328d8701eae3 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -100,4 +100,6 @@ extern struct workqueue_struct *xfs_discard_wq;
 
 #define XFS_M(sb)		((struct xfs_mount *)((sb)->s_fs_info))
 
+#define XFS_FREEZE_SCRUB_COOKIE(mp)	((unsigned long)(mp) | 1)
+
 #endif	/* __XFS_SUPER_H__ */


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] xfs: repair summary counters
  2023-05-03  3:02 [PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze Darrick J. Wong
                   ` (2 preceding siblings ...)
  2023-05-03  3:02 ` [PATCH 3/4] xfs: stabilize fs summary counters for online fsck Darrick J. Wong
@ 2023-05-03  3:02 ` Darrick J. Wong
  3 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2023-05-03  3:02 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, mcgrof, ruansy.fnst, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Use the same summary counter calculation infrastructure to generate new
values for the in-core summary counters.   The difference between the
scrubber and the repairer is that the repairer will freeze the fs during
setup, which means that the values should match exactly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                  |    1 +
 fs/xfs/scrub/fscounters.c        |   15 +++++++-
 fs/xfs/scrub/fscounters_repair.c |   72 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h            |    2 +
 fs/xfs/scrub/scrub.c             |    2 +
 fs/xfs/scrub/trace.c             |    1 +
 fs/xfs/scrub/trace.h             |   21 +++++++++--
 7 files changed, 107 insertions(+), 7 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b97b7ea74109..ea90abdd9941 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -188,6 +188,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   bmap_repair.o \
 				   cow_repair.o \
+				   fscounters_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   newbt.o \
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index 93498eb38216..913ef40ad7ee 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -199,8 +199,13 @@ xchk_setup_fscounters(
 	 * Pause all writer activity in the filesystem while we're scrubbing to
 	 * reduce the likelihood of background perturbations to the counters
 	 * throwing off our calculations.
+	 *
+	 * If we're repairing, we need to prevent any other thread from
+	 * changing the global fs summary counters while we're repairing them.
+	 * This requires the fs to be frozen, which will disable background
+	 * reclaim and purge all inactive inodes.
 	 */
-	if (sc->flags & XCHK_TRY_HARDER) {
+	if ((sc->flags & XCHK_TRY_HARDER) || xchk_could_repair(sc)) {
 		error = xchk_fscounters_freeze(sc);
 		if (error)
 			return error;
@@ -218,7 +223,9 @@ xchk_setup_fscounters(
  * set the INCOMPLETE flag even when a negative errno is returned.  This care
  * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
  * ECANCELED) that are absorbed into a scrub state flag update by
- * xchk_*_process_error.
+ * xchk_*_process_error.  Scrub and repair share the same incore data
+ * structures, so the INCOMPLETE flag is critical to prevent a repair based on
+ * insufficient information.
  */
 
 /* Count free space btree blocks manually for pre-lazysbcount filesystems. */
@@ -446,6 +453,10 @@ xchk_fscount_within_range(
 	if (curr_value == expected)
 		return true;
 
+	/* We require exact matches when repair is running. */
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+		return false;
+
 	min_value = min(old_value, curr_value);
 	max_value = max(old_value, curr_value);
 
diff --git a/fs/xfs/scrub/fscounters_repair.c b/fs/xfs/scrub/fscounters_repair.c
new file mode 100644
index 000000000000..abd281dcb344
--- /dev/null
+++ b/fs/xfs/scrub/fscounters_repair.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2018-2023 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_health.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/fscounters.h"
+
+/*
+ * FS Summary Counters
+ * ===================
+ *
+ * We correct errors in the filesystem summary counters by setting them to the
+ * values computed during the obligatory scrub phase.  However, we must be
+ * careful not to allow any other thread to change the counters while we're
+ * computing and setting new values.  To achieve this, we freeze the
+ * filesystem for the whole operation if the REPAIR flag is set.  The checking
+ * function is stricter when we've frozen the fs.
+ */
+
+/*
+ * Reset the superblock counters.  Caller is responsible for freezing the
+ * filesystem during the calculation and reset phases.
+ */
+int
+xrep_fscounters(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xchk_fscounters	*fsc = sc->buf;
+
+	/*
+	 * Reinitialize the in-core counters from what we computed.  We froze
+	 * the filesystem, so there shouldn't be anyone else trying to modify
+	 * these counters.
+	 */
+	if (!fsc->frozen) {
+		ASSERT(fsc->frozen);
+		return -EFSCORRUPTED;
+	}
+
+	trace_xrep_reset_counters(mp, fsc);
+
+	percpu_counter_set(&mp->m_icount, fsc->icount);
+	percpu_counter_set(&mp->m_ifree, fsc->ifree);
+	percpu_counter_set(&mp->m_fdblocks, fsc->fdblocks);
+	percpu_counter_set(&mp->m_frextents, fsc->frextents);
+	mp->m_sb.sb_frextents = fsc->frextents;
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index e70b3afde39d..2e72b2557f65 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -109,6 +109,7 @@ int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_bmap_cow(struct xfs_scrub *sc);
 int xrep_nlinks(struct xfs_scrub *sc);
+int xrep_fscounters(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -194,6 +195,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_quota			xrep_notsupported
 #define xrep_quotacheck			xrep_notsupported
 #define xrep_nlinks			xrep_notsupported
+#define xrep_fscounters			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index e487b424b12b..cf8e78c16670 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -367,7 +367,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_fscounters,
 		.scrub	= xchk_fscounters,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_fscounters,
 	},
 	[XFS_SCRUB_TYPE_QUOTACHECK] = {	/* quota counters */
 		.type	= ST_FS,
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index a2511bdc5dba..1fe5c5a9a1ba 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -20,6 +20,7 @@
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
 #include "scrub/nlinks.h"
+#include "scrub/fscounters.h"
 
 /* Figure out which block the btree cursor was pointing to. */
 static inline xfs_fsblock_t
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 6a50e6c89195..4aefa0533a12 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -23,6 +23,7 @@ struct xfarray;
 struct xfarray_sortinfo;
 struct xchk_iscan;
 struct xchk_nlink;
+struct xchk_fscounters;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -1671,16 +1672,28 @@ TRACE_EVENT(xrep_calc_ag_resblks_btsize,
 		  __entry->refcbt_sz)
 )
 TRACE_EVENT(xrep_reset_counters,
-	TP_PROTO(struct xfs_mount *mp),
-	TP_ARGS(mp),
+	TP_PROTO(struct xfs_mount *mp, struct xchk_fscounters *fsc),
+	TP_ARGS(mp, fsc),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(uint64_t, icount)
+		__field(uint64_t, ifree)
+		__field(uint64_t, fdblocks)
+		__field(uint64_t, frextents)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
+		__entry->icount = fsc->icount;
+		__entry->ifree = fsc->ifree;
+		__entry->fdblocks = fsc->fdblocks;
+		__entry->frextents = fsc->frextents;
 	),
-	TP_printk("dev %d:%d",
-		  MAJOR(__entry->dev), MINOR(__entry->dev))
+	TP_printk("dev %d:%d icount %llu ifree %llu fdblocks %llu frextents %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->icount,
+		  __entry->ifree,
+		  __entry->fdblocks,
+		  __entry->frextents)
 )
 
 DECLARE_EVENT_CLASS(xrep_newbt_extent_class,


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs
  2023-05-03  3:02 ` [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs Darrick J. Wong
@ 2023-05-07  5:23   ` Luis Chamberlain
  2023-05-17 17:13     ` Shiyang Ruan
  2023-05-18  6:07     ` Darrick J. Wong
  0 siblings, 2 replies; 8+ messages in thread
From: Luis Chamberlain @ 2023-05-07  5:23 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara, Andreas Gruenbacher
  Cc: linux-xfs, ruansy.fnst, linux-fsdevel

On Tue, May 02, 2023 at 08:02:18PM -0700, Darrick J. Wong wrote:
> diff --git a/fs/super.c b/fs/super.c
> index 04bc62ab7dfe..01891f9e6d5e 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1736,18 +1747,33 @@ int freeze_super(struct super_block *sb)
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> +
> +/*
> + * freeze_super - lock the filesystem and force it into a consistent state
> + * @sb: the super to lock
> + *
> + * Syncs the super to make sure the filesystem is consistent and calls the fs's
> + * freeze_fs.  Subsequent calls to this without first thawing the fs will return
> + * -EBUSY.  See the comment for __freeze_super for more information.
> + */
> +int freeze_super(struct super_block *sb)
> +{
> +	return __freeze_super(sb, USERSPACE_FREEZE_COOKIE);
> +}
>  EXPORT_SYMBOL(freeze_super);
>  
> -static int thaw_super_locked(struct super_block *sb)
> +static int thaw_super_locked(struct super_block *sb, unsigned long cookie)
>  {
>  	int error;
>  
> -	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
> +	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE ||
> +	    sb->s_writers.freeze_cookie != cookie) {
>  		up_write(&sb->s_umount);
>  		return -EINVAL;

We get the same by just having drivers use freeze_super(sb, true) in the
patches I have, ie, we treat it a user-initiated.

On freeze() we have:

int freeze_super(struct super_block *sb, bool usercall)                                              
{                                                                                                    
	int ret;                                                                                     

	if(!usercall && sb_is_frozen(sb))                                                           
		return 0;                                                                            

	if (!sb_is_unfrozen(sb))
	return -EBUSY;
	...
}

On thaw we end up with:

int thaw_super(struct super_block *sb, bool usercall)
{
	int error;

	if (!usercall) {
		/*
		 * If userspace initiated the freeze don't let the kernel
		 *  thaw it on return from a kernel initiated freeze.
		 */
		 if (sb_is_unfrozen(sb) || sb_is_frozen_by_user(sb))
		 	return 0;
	}

	if (!sb_is_frozen(sb))
		return -EINVAL;
	...
}

As I had it, I had made the drivers and the bdev freeze use the usercall as
true and so there is no change.

In case there is a filesystem already frozen then which was initiated by
the filesystem, for whatever reason, the filesystem the kernel auto-freeze
will chug on happy with the system freeze, it bails out withour error
and moves on to the next filesystem to freeze.

Upon thaw, the kernel auto-thaw will detect that the filesystem was
frozen by user on sb_is_frozen_by_user() and so will just bail and not
thaw it.

If the mechanism you want to introduce is to allow a filesystem to even
prevent kernel auto-freeze with -EBUSY it begs the question if that
shouldn't also prevent suspend. Because it would anyway as you have it
right now with your patch but it would return -EINVAL. I also ask because of
the possible issues with the filesystem not going to suspend but the backing
or other possible related devices going to suspend.

Since I think the goal is to prevent the kernel auto-freeze due to
online fsck to complete, then I think you *do* want to prevent full
system suspend from moving forward. In that case, why not just have
the filesystem check for that and return -EBUSY on its respective
filesystem sb->s_op->freeze_fs(sb) callback?

  Luis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs
  2023-05-07  5:23   ` Luis Chamberlain
@ 2023-05-17 17:13     ` Shiyang Ruan
  2023-05-18  6:07     ` Darrick J. Wong
  1 sibling, 0 replies; 8+ messages in thread
From: Shiyang Ruan @ 2023-05-17 17:13 UTC (permalink / raw)
  To: Luis Chamberlain, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, Jan Kara, Andreas Gruenbacher



在 2023/5/7 13:23, Luis Chamberlain 写道:
> On Tue, May 02, 2023 at 08:02:18PM -0700, Darrick J. Wong wrote:
>> diff --git a/fs/super.c b/fs/super.c
>> index 04bc62ab7dfe..01891f9e6d5e 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -1736,18 +1747,33 @@ int freeze_super(struct super_block *sb)
>>   	up_write(&sb->s_umount);
>>   	return 0;
>>   }
>> +
>> +/*
>> + * freeze_super - lock the filesystem and force it into a consistent state
>> + * @sb: the super to lock
>> + *
>> + * Syncs the super to make sure the filesystem is consistent and calls the fs's
>> + * freeze_fs.  Subsequent calls to this without first thawing the fs will return
>> + * -EBUSY.  See the comment for __freeze_super for more information.
>> + */
>> +int freeze_super(struct super_block *sb)
>> +{
>> +	return __freeze_super(sb, USERSPACE_FREEZE_COOKIE);
>> +}
>>   EXPORT_SYMBOL(freeze_super);
>>   
>> -static int thaw_super_locked(struct super_block *sb)
>> +static int thaw_super_locked(struct super_block *sb, unsigned long cookie)
>>   {
>>   	int error;
>>   
>> -	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
>> +	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE ||
>> +	    sb->s_writers.freeze_cookie != cookie) {
>>   		up_write(&sb->s_umount);
>>   		return -EINVAL;
> 
> We get the same by just having drivers use freeze_super(sb, true) in the
> patches I have, ie, we treat it a user-initiated.
> 
> On freeze() we have:
> 
> int freeze_super(struct super_block *sb, bool usercall)
> {
> 	int ret;
> 	
> 	if(!usercall && sb_is_frozen(sb))
> 		return 0;
> 
> 	if (!sb_is_unfrozen(sb))
> 	return -EBUSY;
> 	...
> }
> 
> On thaw we end up with:
> 
> int thaw_super(struct super_block *sb, bool usercall)
> {
> 	int error;
> 
> 	if (!usercall) {
> 		/*
> 		 * If userspace initiated the freeze don't let the kernel
> 		 *  thaw it on return from a kernel initiated freeze.
> 		 */
> 		 if (sb_is_unfrozen(sb) || sb_is_frozen_by_user(sb))
> 		 	return 0;
> 	}
> 
> 	if (!sb_is_frozen(sb))
> 		return -EINVAL;
> 	...
> }
> 
> As I had it, I had made the drivers and the bdev freeze use the usercall as
> true and so there is no change.
> 
> In case there is a filesystem already frozen then which was initiated by
> the filesystem, for whatever reason, the filesystem the kernel auto-freeze
> will chug on happy with the system freeze, it bails out withour error
> and moves on to the next filesystem to freeze.
> 
> Upon thaw, the kernel auto-thaw will detect that the filesystem was
> frozen by user on sb_is_frozen_by_user() and so will just bail and not
> thaw it.

Hi, Luis

Thanks for the great idea.  I also need this upgraded API for a unbind 
mechanism on pmem device, which is finally called in 
xfs_notify_failure.c where we want to freeze the fs to prevent any other 
new file mappings from being created.  In my case, I think we should 
think it as a kernel-initiated freeze, and hope it won't be thaw by 
others, especially userspace-initiated thaw.

In my understanding of your implementation, if there is a 
userspace-initiated thaw, with @usercall is set true, thaw_super(sb, 
true) will ignore any others' freeze and thaw the fs anyway.  But, 
except in my case, I think the order of userspace-initiated freeze/thaw 
may be messed up due to bugs in the user app, then the kernel-initiated 
freeze state could be accidentally broken...  In my opinion, the kernel 
code is more reliable.  Therefore, kernel-initiated freeze should be 
exclusive at least.


--
Thanks,
Ruan.

> 
> If the mechanism you want to introduce is to allow a filesystem to even
> prevent kernel auto-freeze with -EBUSY it begs the question if that
> shouldn't also prevent suspend. Because it would anyway as you have it
> right now with your patch but it would return -EINVAL. I also ask because of
> the possible issues with the filesystem not going to suspend but the backing
> or other possible related devices going to suspend.
> 
> Since I think the goal is to prevent the kernel auto-freeze due to
> online fsck to complete, then I think you *do* want to prevent full
> system suspend from moving forward. In that case, why not just have
> the filesystem check for that and return -EBUSY on its respective
> filesystem sb->s_op->freeze_fs(sb) callback?
> 
>    Luis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs
  2023-05-07  5:23   ` Luis Chamberlain
  2023-05-17 17:13     ` Shiyang Ruan
@ 2023-05-18  6:07     ` Darrick J. Wong
  1 sibling, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2023-05-18  6:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jan Kara, Andreas Gruenbacher, linux-xfs, ruansy.fnst, linux-fsdevel

On Sat, May 06, 2023 at 10:23:13PM -0700, Luis Chamberlain wrote:
> On Tue, May 02, 2023 at 08:02:18PM -0700, Darrick J. Wong wrote:
> > diff --git a/fs/super.c b/fs/super.c
> > index 04bc62ab7dfe..01891f9e6d5e 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1736,18 +1747,33 @@ int freeze_super(struct super_block *sb)
> >  	up_write(&sb->s_umount);
> >  	return 0;
> >  }
> > +
> > +/*
> > + * freeze_super - lock the filesystem and force it into a consistent state
> > + * @sb: the super to lock
> > + *
> > + * Syncs the super to make sure the filesystem is consistent and calls the fs's
> > + * freeze_fs.  Subsequent calls to this without first thawing the fs will return
> > + * -EBUSY.  See the comment for __freeze_super for more information.
> > + */
> > +int freeze_super(struct super_block *sb)
> > +{
> > +	return __freeze_super(sb, USERSPACE_FREEZE_COOKIE);
> > +}
> >  EXPORT_SYMBOL(freeze_super);
> >  
> > -static int thaw_super_locked(struct super_block *sb)
> > +static int thaw_super_locked(struct super_block *sb, unsigned long cookie)
> >  {
> >  	int error;
> >  
> > -	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
> > +	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE ||
> > +	    sb->s_writers.freeze_cookie != cookie) {
> >  		up_write(&sb->s_umount);
> >  		return -EINVAL;
> 
> We get the same by just having drivers use freeze_super(sb, true) in the
> patches I have, ie, we treat it a user-initiated.
> 
> On freeze() we have:
> 
> int freeze_super(struct super_block *sb, bool usercall)                                              
> {                                                                                                    
> 	int ret;                                                                                     
> 	
> 	if(!usercall && sb_is_frozen(sb))                                                           
> 		return 0;                                                                            
> 
> 	if (!sb_is_unfrozen(sb))
> 	return -EBUSY;
> 	...
> }
> 
> On thaw we end up with:
> 
> int thaw_super(struct super_block *sb, bool usercall)
> {
> 	int error;
> 
> 	if (!usercall) {
> 		/*
> 		 * If userspace initiated the freeze don't let the kernel
> 		 *  thaw it on return from a kernel initiated freeze.
> 		 */
> 		 if (sb_is_unfrozen(sb) || sb_is_frozen_by_user(sb))
> 		 	return 0;
> 	}
> 
> 	if (!sb_is_frozen(sb))
> 		return -EINVAL;
> 	...
> }
> 
> As I had it, I had made the drivers and the bdev freeze use the usercall as
> true and so there is no change.
> 
> In case there is a filesystem already frozen then which was initiated by
> the filesystem, for whatever reason, the filesystem the kernel auto-freeze
> will chug on happy with the system freeze, it bails out withour error
> and moves on to the next filesystem to freeze.

Yes.  Your patchset has the nicer behavior that a kernel freeze isn't
preempted by userspace already having frozen the fs, but at a cost that
userspace can thaw the fs while the kernel still thinks it's frozen.

> Upon thaw, the kernel auto-thaw will detect that the filesystem was
> frozen by user on sb_is_frozen_by_user() and so will just bail and not
> thaw it.
> 
> If the mechanism you want to introduce is to allow a filesystem to even
> prevent kernel auto-freeze with -EBUSY it begs the question if that

For the scrub case, yes, we do want to block suspend because (a) we're
running metadata transactions and (b) we haven't quiesced the xfs log,
because scrub doesn't need (or want) to wait for that.  It only needs to
prevent writes, write faults, writeback, and background garbage
collection.  And it can't allow userspace to turn any of that back on.

> shouldn't also prevent suspend. Because it would anyway as you have it
> right now with your patch but it would return -EINVAL.

Huh?  With this patchset, freeze_super returns EBUSY if the fs is
frozen, no matter who froze it, or who wants to freeze it.

> I also ask because of
> the possible issues with the filesystem not going to suspend but the backing
> or other possible related devices going to suspend.

Hm.  Perhaps the freeze state needs to track one bit for userspace
freeze and another for kernel freeze.  Kernel freezes are mutually
exclusive, but they can combine with userspace freezes.  Userspace
freezes remain shared.

Kernel thaw cannot break a userspace thaw, nor can userspace thaws break
a kernel thaw.

Eh, it's late, I'll think about this more tomorrow.

> Since I think the goal is to prevent the kernel auto-freeze due to
> online fsck to complete, then I think you *do* want to prevent full
> system suspend from moving forward. In that case, why not just have
> the filesystem check for that and return -EBUSY on its respective
> filesystem sb->s_op->freeze_fs(sb) callback?

Well... either we forego the fscounters optimization, or I guess we
figure out how to take s_umount and then set a flag.

--D

>   Luis

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-18  6:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  3:02 [PATCHSET RFC v24.6 0/4] xfs: online repair for fs summary counters with exclusive fsfreeze Darrick J. Wong
2023-05-03  3:02 ` [PATCH 1/4] vfs: allow filesystem freeze callers to denote who froze the fs Darrick J. Wong
2023-05-07  5:23   ` Luis Chamberlain
2023-05-17 17:13     ` Shiyang Ruan
2023-05-18  6:07     ` Darrick J. Wong
2023-05-03  3:02 ` [PATCH 2/4] vfs: allow exclusive freezing of filesystems Darrick J. Wong
2023-05-03  3:02 ` [PATCH 3/4] xfs: stabilize fs summary counters for online fsck Darrick J. Wong
2023-05-03  3:02 ` [PATCH 4/4] xfs: repair summary counters Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).