linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v9 00/14] xfs: deferred inode inactivation
@ 2021-08-05  2:06 Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 01/14] xfs: introduce CPU hotplug infrastructure Darrick J. Wong
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch

Hi all,

This patch series implements deferred inode inactivation.  Inactivation
is what happens when an open file loses its last incore reference: if
the file has speculative preallocations, they must be freed, and if the
file is unlinked, all forks must be truncated, and the inode marked
freed in the inode chunk and the inode btrees.

Currently, all of this activity is performed in frontend threads when
the last in-memory reference is lost and/or the vfs decides to drop the
inode.  Three complaints stem from this behavior: first, that the time
to unlink (in the worst case) depends on both the complexity of the
directory as well as the the number of extents in that file; second,
that deleting a directory tree is inefficient and seeky because we free
the inodes in readdir order, not disk order; and third, the upcoming
online repair feature needs to be able to xfs_irele while scanning a
filesystem in transaction context.  It cannot perform inode inactivation
in this context because xfs does not support nested transactions.

The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation.  When inodes need inactivation, we set
NEED_INACTIVE in iflags and add it to a percpu work list.  Eventually, a
bounded percpu workqueue item will be scheduled to perform all the
on-disk metadata updates.  Once the inode has been inactivated, it is
left in the reclaim state and the background reclaim worker (or direct
reclaim) will get to it eventually.

Doing the inactivations from kernel threads solves the first problem by
constraining the amount of work done by the unlink() call to removing
the directory entry.  It solves the third problem by moving inactivation
to a separate process.  Performing the inactivations in batches
decreases the amount of time it takes to let go of an inode cluster if
we're deleting entire directory trees.

There are three big warts I can think of in this series: first, because
the actual freeing of nlink==0 inodes is now done in the background,
this means that the system will be busy making metadata updates for some
time after the unlink() call returns.  This temporarily reduces
available iops.  Second, in order to retain the behavior that deleting
100TB of unshared data should result in a free space gain of 100TB, the
statvfs and quota reporting ioctls wait for inactivation to finish,
which increases the long tail latency of those calls.  This behavior is,
unfortunately, key to not introducing regressions in fstests.  The third
problem is that the deferrals keep memory usage higher for longer.  The
final patch in the series (clumsily) addresses this by forcing the
inodegc workers to run when memory shrinkers get called and by
throttling the frontend xfs_inodegc_queue callers to wait for the
worker.

v1-v2: NYE patchbombs
v3: rebase against 5.12-rc2 for submission.
v4: combine the can/has eofblocks predicates, clean up incore inode tree
    walks, fix inobt deadlock
v5: actually freeze the inode gc threads when we freeze the filesystem,
    consolidate the code that deals with inode tagging, and use
    foreground inactivation during quotaoff to avoid cycling dquots
v6: rebase to 5.13-rc4, fix quotaoff not to require foreground inactivation,
    refactor to use inode walk goals, use atomic bitflags to control the
    scheduling of gc workers
v7: simplify the inodegc worker, which simplifies how flushes work, break
    up the patch into smaller pieces, flush inactive inodes on syncfs to
    simplify freeze/ro-remount handling, separate inode selection filtering
    in iget, refactor inode recycling further, change gc delay to 100ms,
    decrease the gc delay when space or quota are low, move most of the
    destroy_inode logic to mark_reclaimable, get rid of the fallocate flush
    scan thing, get rid of polled flush mode
v8: rebase against 5.14-rc2, hook the memory shrinkers so that we requeue
    inactivation immediately when memory starts to get tight and force
    callers queueing inodes for inactivation to wait for the inactivation
    workers to run (i.e. throttling the frontend) to reduce memory storms,
    add hch's quotaoff removal series as a dependency to shut down arguments
    about quota walks
v9: replace the entire mechanism with percpu lists and workers, clean out
    a ton of ratty code that nobody liked anyway :P

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
---
 fs/xfs/scrub/common.c      |   10 +
 fs/xfs/xfs_dquot.h         |   10 +
 fs/xfs/xfs_icache.c        |  592 ++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_icache.h        |    8 +
 fs/xfs/xfs_inode.c         |   53 ++++
 fs/xfs/xfs_inode.h         |   22 ++
 fs/xfs/xfs_itable.c        |   42 +++
 fs/xfs/xfs_iwalk.c         |   33 ++
 fs/xfs/xfs_log_recover.c   |    7 +
 fs/xfs/xfs_mount.c         |   57 +++-
 fs/xfs/xfs_mount.h         |   62 ++++-
 fs/xfs/xfs_qm.c            |   34 +++
 fs/xfs/xfs_qm_syscalls.c   |    8 +
 fs/xfs/xfs_quota.h         |    2 
 fs/xfs/xfs_super.c         |  253 ++++++++++++++-----
 fs/xfs/xfs_trace.h         |   93 +++++++
 fs/xfs/xfs_trans.c         |    5 
 include/linux/cpuhotplug.h |    1 
 18 files changed, 1164 insertions(+), 128 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/14] xfs: introduce CPU hotplug infrastructure
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-08-05  2:06 ` Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 02/14] xfs: introduce all-mounts list for cpu hotplug notifications Darrick J. Wong
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, david, hch

From: Dave Chinner <dchinner@redhat.com>

We need to move to per-cpu state for both deferred inode
inactivation and CIL tracking, but to do that we
need to handle CPUs being removed from the system by the hot-plug
code. Introduce generic XFS infrastructure to handle CPU hotplug
events that is set up at module init time and torn down at module
exit time.

Initially, we only need CPU dead notifications, so we only set
up a callback for these notifications. The infrastructure can be
updated in future for other CPU hotplug state machine notifications
easily if ever needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: rearrange some macros, fix function prototypes]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_super.c         |   42 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/cpuhotplug.h |    1 +
 2 files changed, 42 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 36fc81e52dc2..d47fac7c8afd 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2111,6 +2111,39 @@ xfs_destroy_workqueues(void)
 	destroy_workqueue(xfs_alloc_wq);
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int
+xfs_cpu_dead(
+	unsigned int		cpu)
+{
+	return 0;
+}
+
+static int __init
+xfs_cpu_hotplug_init(void)
+{
+	int	error;
+
+	error = cpuhp_setup_state_nocalls(CPUHP_XFS_DEAD, "xfs:dead", NULL,
+			xfs_cpu_dead);
+	if (error < 0)
+		xfs_alert(NULL,
+"Failed to initialise CPU hotplug, error %d. XFS is non-functional.",
+			error);
+	return error;
+}
+
+static void
+xfs_cpu_hotplug_destroy(void)
+{
+	cpuhp_remove_state_nocalls(CPUHP_XFS_DEAD);
+}
+
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline int xfs_cpu_hotplug_init(void) { return 0; }
+static inline void xfs_cpu_hotplug_destroy(void) {}
+#endif
+
 STATIC int __init
 init_xfs_fs(void)
 {
@@ -2123,9 +2156,13 @@ init_xfs_fs(void)
 
 	xfs_dir_startup();
 
+	error = xfs_cpu_hotplug_init();
+	if (error)
+		goto out;
+
 	error = xfs_init_zones();
 	if (error)
-		goto out;
+		goto out_destroy_hp;
 
 	error = xfs_init_workqueues();
 	if (error)
@@ -2206,6 +2243,8 @@ init_xfs_fs(void)
 	xfs_destroy_workqueues();
  out_destroy_zones:
 	xfs_destroy_zones();
+ out_destroy_hp:
+	xfs_cpu_hotplug_destroy();
  out:
 	return error;
 }
@@ -2228,6 +2267,7 @@ exit_xfs_fs(void)
 	xfs_destroy_workqueues();
 	xfs_destroy_zones();
 	xfs_uuid_table_free();
+	xfs_cpu_hotplug_destroy();
 }
 
 module_init(init_xfs_fs);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f39b34b13871..439adc05be4e 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -52,6 +52,7 @@ enum cpuhp_state {
 	CPUHP_FS_BUFF_DEAD,
 	CPUHP_PRINTK_DEAD,
 	CPUHP_MM_MEMCQ_DEAD,
+	CPUHP_XFS_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
 	CPUHP_PAGE_ALLOC,


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/14] xfs: introduce all-mounts list for cpu hotplug notifications
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 01/14] xfs: introduce CPU hotplug infrastructure Darrick J. Wong
@ 2021-08-05  2:06 ` Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 03/14] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, david, hch

From: Dave Chinner <dchinner@redhat.com>

The inode inactivation and CIL tracking percpu structures are
per-xfs_mount structures. That means when we get a CPU dead
notification, we need to then iterate all the per-cpu structure
instances to process them. Rather than keeping linked lists of
per-cpu structures in each subsystem, add a list of all xfs_mounts
that the generic xfs_cpu_dead() function will iterate and call into
each subsystem appropriately.

This allows us to handle both per-mount and global XFS percpu state
from xfs_cpu_dead(), and avoids the need to link subsystem
structures that can be easily found from the xfs_mount into their
own global lists.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: expand some comments about mount list setup ordering rules]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c78b63fe779a..ed7064596f94 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -82,6 +82,7 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+	struct list_head	m_mount_list;	/* global mount list */
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d47fac7c8afd..c2c9c02b9d62 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -49,6 +49,28 @@ static struct kset *xfs_kset;		/* top-level xfs sysfs dir */
 static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #endif
 
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(xfs_mount_list);
+static DEFINE_SPINLOCK(xfs_mount_list_lock);
+
+static inline void xfs_mount_list_add(struct xfs_mount *mp)
+{
+	spin_lock(&xfs_mount_list_lock);
+	list_add(&mp->m_mount_list, &xfs_mount_list);
+	spin_unlock(&xfs_mount_list_lock);
+}
+
+static inline void xfs_mount_list_del(struct xfs_mount *mp)
+{
+	spin_lock(&xfs_mount_list_lock);
+	list_del(&mp->m_mount_list);
+	spin_unlock(&xfs_mount_list_lock);
+}
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline void xfs_mount_list_add(struct xfs_mount *mp) {}
+static inline void xfs_mount_list_del(struct xfs_mount *mp) {}
+#endif
+
 enum xfs_dax_mode {
 	XFS_DAX_INODE = 0,
 	XFS_DAX_ALWAYS = 1,
@@ -1038,6 +1060,7 @@ xfs_fs_put_super(
 
 	xfs_freesb(mp);
 	free_percpu(mp->m_stats.xs_stats);
+	xfs_mount_list_del(mp);
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
@@ -1409,6 +1432,13 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_destroy_workqueues;
 
+	/*
+	 * All percpu data structures requiring cleanup when a cpu goes offline
+	 * must be allocated before adding this @mp to the cpu-dead handler's
+	 * mount list.
+	 */
+	xfs_mount_list_add(mp);
+
 	/* Allocate stats memory before we do operations that might use it */
 	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
 	if (!mp->m_stats.xs_stats) {
@@ -1617,6 +1647,7 @@ xfs_fs_fill_super(
  out_free_stats:
 	free_percpu(mp->m_stats.xs_stats);
  out_destroy_counters:
+	xfs_mount_list_del(mp);
 	xfs_destroy_percpu_counters(mp);
  out_destroy_workqueues:
 	xfs_destroy_mount_workqueues(mp);
@@ -2116,6 +2147,15 @@ static int
 xfs_cpu_dead(
 	unsigned int		cpu)
 {
+	struct xfs_mount	*mp, *n;
+
+	spin_lock(&xfs_mount_list_lock);
+	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
+		spin_unlock(&xfs_mount_list_lock);
+		/* xfs_subsys_dead(mp, cpu); */
+		spin_lock(&xfs_mount_list_lock);
+	}
+	spin_unlock(&xfs_mount_list_lock);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/14] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 01/14] xfs: introduce CPU hotplug infrastructure Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 02/14] xfs: introduce all-mounts list for cpu hotplug notifications Darrick J. Wong
@ 2021-08-05  2:06 ` Darrick J. Wong
  2021-08-05  5:29   ` Dave Chinner
  2021-08-05  2:06 ` [PATCH 04/14] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Move the xfs_inactive call and all the other debugging checks and stats
updates into xfs_inode_mark_reclaimable because most of that are
implementation details about the inode cache.  This is preparation for
deferred inactivation that is coming up.  We also move it around
xfs_icache.c in preparation for deferred inactivation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   99 ++++++++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_super.c  |   50 --------------------------
 2 files changed, 74 insertions(+), 75 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 086a88b8dfdb..f0e77ed0b8bb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -292,31 +292,6 @@ xfs_perag_clear_inode_tag(
 	trace_xfs_perag_clear_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
 }
 
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_mark_reclaimable(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
-
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
-}
-
 static inline void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
@@ -1739,3 +1714,77 @@ xfs_icwalk(
 	return last_error;
 	BUILD_BUG_ON(XFS_ICWALK_PRIVATE_FLAGS & XFS_ICWALK_FLAGS_VALID);
 }
+
+#ifdef DEBUG
+static void
+xfs_check_delalloc(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	struct xfs_bmbt_irec	got;
+	struct xfs_iext_cursor	icur;
+
+	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
+		return;
+	do {
+		if (isnullstartblock(got.br_startblock)) {
+			xfs_warn(ip->i_mount,
+	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
+				ip->i_ino,
+				whichfork == XFS_DATA_FORK ? "data" : "cow",
+				got.br_startoff, got.br_blockcount);
+		}
+	} while (xfs_iext_next_extent(ifp, &icur, &got));
+}
+#else
+#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
+#endif
+
+/*
+ * We set the inode flag atomically with the radix tree tag.
+ * Once we get tag lookups on the radix tree, this inode flag
+ * can go away.
+ */
+void
+xfs_inode_mark_reclaimable(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+
+	xfs_inactive(ip);
+
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+		xfs_check_delalloc(ip, XFS_DATA_FORK);
+		xfs_check_delalloc(ip, XFS_COW_FORK);
+		ASSERT(0);
+	}
+
+	XFS_STATS_INC(mp, vn_reclaim);
+
+	/*
+	 * We should never get here with one of the reclaim flags already set.
+	 */
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+
+	/*
+	 * We always use background reclaim here because even if the inode is
+	 * clean, it still may be under IO and hence we have wait for IO
+	 * completion to occur before we can reclaim the inode. The background
+	 * reclaim path handles this more efficiently than we can here, so
+	 * simply let background reclaim tear down all inodes.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			XFS_ICI_RECLAIM_TAG);
+	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+}
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c2c9c02b9d62..e0b97e4c8e16 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -613,32 +613,6 @@ xfs_fs_alloc_inode(
 	return NULL;
 }
 
-#ifdef DEBUG
-static void
-xfs_check_delalloc(
-	struct xfs_inode	*ip,
-	int			whichfork)
-{
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
-	struct xfs_bmbt_irec	got;
-	struct xfs_iext_cursor	icur;
-
-	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
-		return;
-	do {
-		if (isnullstartblock(got.br_startblock)) {
-			xfs_warn(ip->i_mount,
-	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
-				ip->i_ino,
-				whichfork == XFS_DATA_FORK ? "data" : "cow",
-				got.br_startoff, got.br_blockcount);
-		}
-	} while (xfs_iext_next_extent(ifp, &icur, &got));
-}
-#else
-#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
-#endif
-
 /*
  * Now that the generic code is guaranteed not to be accessing
  * the linux inode, we can inactivate and reclaim the inode.
@@ -654,30 +628,6 @@ xfs_fs_destroy_inode(
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
-
-	xfs_inactive(ip);
-
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
-		xfs_check_delalloc(ip, XFS_DATA_FORK);
-		xfs_check_delalloc(ip, XFS_COW_FORK);
-		ASSERT(0);
-	}
-
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
-
-	/*
-	 * We should never get here with one of the reclaim flags already set.
-	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
-
-	/*
-	 * We always use background reclaim here because even if the inode is
-	 * clean, it still may be under IO and hence we have wait for IO
-	 * completion to occur before we can reclaim the inode. The background
-	 * reclaim path handles this more efficiently than we can here, so
-	 * simply let background reclaim tear down all inodes.
-	 */
 	xfs_inode_mark_reclaimable(ip);
 }
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/14] xfs: detach dquots from inode if we don't need to inactivate it
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-08-05  2:06 ` [PATCH 03/14] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
@ 2021-08-05  2:06 ` Darrick J. Wong
  2021-08-05  5:30   ` Dave Chinner
  2021-08-05  2:06 ` [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues Darrick J. Wong
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

If we don't need to inactivate an inode, we can detach the dquots and
move on to reclamation.  This isn't strictly required here; it's a
preparation patch for deferred inactivation per reviewer request[1] to
move the creation of xfs_inode_needs_inactivation into a separate
change.  Eventually this !need_inactive chunk will turn into the code
path for inodes that skip xfs_inactive and go straight to memory
reclaim.

[1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    8 +++++++-
 fs/xfs/xfs_inode.c  |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h  |    2 ++
 3 files changed, 62 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f0e77ed0b8bb..b9214733d0c3 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1752,8 +1752,14 @@ xfs_inode_mark_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	bool			need_inactive = xfs_inode_needs_inactive(ip);
 
-	xfs_inactive(ip);
+	if (!need_inactive) {
+		/* Going straight to reclaim, so drop the dquots. */
+		xfs_qm_dqdetach(ip);
+	} else {
+		xfs_inactive(ip);
+	}
 
 	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 990b72ae3635..3c6ce1f6f643 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1654,6 +1654,59 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/*
+ * Returns true if we need to update the on-disk metadata before we can free
+ * the memory used by this inode.  Updates include freeing post-eof
+ * preallocations; freeing COW staging extents; and marking the inode free in
+ * the inobt if it is on the unlinked list.
+ */
+bool
+xfs_inode_needs_inactive(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*cow_ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+
+	/*
+	 * If the inode is already free, then there can be nothing
+	 * to clean up here.
+	 */
+	if (VFS_I(ip)->i_mode == 0)
+		return false;
+
+	/* If this is a read-only mount, don't do this (would generate I/O) */
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return false;
+
+	/* If the log isn't running, push inodes straight to reclaim. */
+	if (XFS_FORCED_SHUTDOWN(mp) || (mp->m_flags & XFS_MOUNT_NORECOVERY))
+		return false;
+
+	/* Metadata inodes require explicit resource cleanup. */
+	if (xfs_is_metadata_inode(ip))
+		return false;
+
+	/* Want to clean out the cow blocks if there are any. */
+	if (cow_ifp && cow_ifp->if_bytes > 0)
+		return true;
+
+	/* Unlinked files must be freed. */
+	if (VFS_I(ip)->i_nlink == 0)
+		return true;
+
+	/*
+	 * This file isn't being freed, so check if there are post-eof blocks
+	 * to free.  @force is true because we are evicting an inode from the
+	 * cache.  Post-eof blocks must be freed, lest we end up with broken
+	 * free space accounting.
+	 *
+	 * Note: don't bother with iolock here since lockdep complains about
+	 * acquiring it in reclaim context. We have the only reference to the
+	 * inode at this point anyways.
+	 */
+	return xfs_can_free_eofblocks(ip, true);
+}
+
 /*
  * xfs_inactive
  *
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 4b6703dbffb8..e3137bbc7b14 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -493,6 +493,8 @@ extern struct kmem_zone	*xfs_inode_zone;
 /* The default CoW extent size hint. */
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
+bool xfs_inode_needs_inactive(struct xfs_inode *ip);
+
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-08-05  2:06 ` [PATCH 04/14] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
@ 2021-08-05  2:06 ` Darrick J. Wong
  2021-08-05  6:43   ` Dave Chinner
  2021-08-07  0:21   ` Darrick J. Wong
  2021-08-05  2:06 ` [PATCH 06/14] xfs: queue inactivation immediately when free space is tight Darrick J. Wong
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs, david, hch

From: Dave Chinner <dchinner@redhat.com>

Move inode inactivation to background work contexts so that it no
longer runs in the context that releases the final reference to an
inode. This will allow process work that ends up blocking on
inactivation to continue doing work while the filesytem processes
the inactivation in the background.

A typical demonstration of this is unlinking an inode with lots of
extents. The extents are removed during inactivation, so this blocks
the process that unlinked the inode from the directory structure. By
moving the inactivation to the background process, the userspace
applicaiton can keep working (e.g. unlinking the next inode in the
directory) while the inactivation work on the previous inode is
done by a different CPU.

The implementation of the queue is relatively simple. We use a
per-cpu lockless linked list (llist) to queue inodes for
inactivation without requiring serialisation mechanisms, and a work
item to allow the queue to be processed by a CPU bound worker
thread. We also keep a count of the queue depth so that we can
trigger work after a number of deferred inactivations have been
queued.

The use of a bound workqueue with a single work depth allows the
workqueue to run one work item per CPU. We queue the work item on
the CPU we are currently running on, and so this essentially gives
us affine per-cpu worker threads for the per-cpu queues. THis
maintains the effective CPU affinity that occurs within XFS at the
AG level due to all objects in a directory being local to an AG.
Hence inactivation work tends to run on the same CPU that last
accessed all the objects that inactivation accesses and this
maintains hot CPU caches for unlink workloads.

A depth of 32 inodes was chosen to match the number of inodes in an
inode cluster buffer. This hopefully allows sequential
allocation/unlink behaviours to defering inactivation of all the
inodes in a single cluster buffer at a time, further helping
maintain hot CPU and buffer cache accesses while running
inactivations.

A hard per-cpu queue throttle of 256 inode has been set to avoid
runaway queuing when inodes that take a long to time inactivate are
being processed. For example, when unlinking inodes with large
numbers of extents that can take a lot of processing to free.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: tweak comments and tracepoints, convert opflags to state bits]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c    |    7 +
 fs/xfs/xfs_icache.c      |  354 +++++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_icache.h      |    6 +
 fs/xfs/xfs_inode.h       |   20 ++-
 fs/xfs/xfs_log_recover.c |    7 +
 fs/xfs/xfs_mount.c       |   26 +++
 fs/xfs/xfs_mount.h       |   38 +++++
 fs/xfs/xfs_super.c       |  117 ++++++++++++++-
 fs/xfs/xfs_trace.h       |   53 +++++++
 9 files changed, 576 insertions(+), 52 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 8558ca05e11d..06b697f72f23 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -884,6 +884,7 @@ xchk_stop_reaping(
 {
 	sc->flags |= XCHK_REAPING_DISABLED;
 	xfs_blockgc_stop(sc->mp);
+	xfs_inodegc_stop(sc->mp);
 }
 
 /* Restart background reaping of resources. */
@@ -891,6 +892,12 @@ void
 xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
+	/*
+	 * Readonly filesystems do not perform inactivation, so there's no
+	 * need to restart the worker.
+	 */
+	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(sc->mp);
 	xfs_blockgc_start(sc->mp);
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index b9214733d0c3..fedfa40e3cd6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -213,7 +213,7 @@ xfs_blockgc_queue(
 {
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
-		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
+		queue_delayed_work(pag->pag_mount->m_blockgc_wq,
 				   &pag->pag_blockgc_work,
 				   msecs_to_jiffies(xfs_blockgc_secs * 1000));
 	rcu_read_unlock();
@@ -450,6 +450,21 @@ xfs_iget_check_free_state(
 	return 0;
 }
 
+/* Make all pending inactivation work start immediately. */
+static void
+xfs_inodegc_queue_all(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		if (!llist_empty(&gc->list))
+			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+	}
+}
+
 /*
  * Check the validity of the inode we just found it the cache
  */
@@ -482,13 +497,30 @@ xfs_iget_cache_hit(
 	 * reclaimable state, wait for the initialisation to complete
 	 * before continuing.
 	 *
+	 * If we're racing with the inactivation worker we also want to wait.
+	 * If we're creating a new file, it's possible that the worker
+	 * previously marked the inode as free on disk but hasn't finished
+	 * updating the incore state yet.  The AGI buffer will be dirty and
+	 * locked to the icreate transaction, so a synchronous push of the
+	 * inodegc workers would result in deadlock.  For a regular iget, the
+	 * worker is running already, so we might as well wait.
+	 *
 	 * XXX(hch): eventually we should do something equivalent to
 	 *	     wait_on_inode to wait for these flags to be cleared
 	 *	     instead of polling for it.
 	 */
-	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM))
+	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING))
 		goto out_skip;
 
+	if (ip->i_flags & XFS_NEED_INACTIVE) {
+		/* Unlinked inodes cannot be re-grabbed. */
+		if (VFS_I(ip)->i_nlink == 0) {
+			error = -ENOENT;
+			goto out_error;
+		}
+		goto out_inodegc_flush;
+	}
+
 	/*
 	 * Check the inode free state is valid. This also detects lookup
 	 * racing with unlinks.
@@ -536,6 +568,17 @@ xfs_iget_cache_hit(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 	return error;
+
+out_inodegc_flush:
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+	/*
+	 * Do not wait for the workers, because the caller could hold an AGI
+	 * buffer lock.  We're just going to sleep in a loop anyway.
+	 */
+	if (xfs_is_inodegc_enabled(mp))
+		xfs_inodegc_queue_all(mp);
+	return -EAGAIN;
 }
 
 static int
@@ -863,6 +906,7 @@ xfs_reclaim_inode(
 
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
+	trace_xfs_inode_reclaiming(ip);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1340,6 +1384,8 @@ xfs_blockgc_start(
 
 /* Don't try to run block gc on an inode that's in any of these states. */
 #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
+					 XFS_NEED_INACTIVE | \
+					 XFS_INACTIVATING | \
 					 XFS_IRECLAIMABLE | \
 					 XFS_IRECLAIM)
 /*
@@ -1741,25 +1787,13 @@ xfs_check_delalloc(
 #define xfs_check_delalloc(ip, whichfork)	do { } while (0)
 #endif
 
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_mark_reclaimable(
+/* Schedule the inode for reclaim. */
+static void
+xfs_inodegc_set_reclaimable(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
-	bool			need_inactive = xfs_inode_needs_inactive(ip);
-
-	if (!need_inactive) {
-		/* Going straight to reclaim, so drop the dquots. */
-		xfs_qm_dqdetach(ip);
-	} else {
-		xfs_inactive(ip);
-	}
 
 	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
@@ -1767,30 +1801,276 @@ xfs_inode_mark_reclaimable(
 		ASSERT(0);
 	}
 
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	trace_xfs_inode_set_reclaimable(ip);
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	ip->i_flags |= XFS_IRECLAIMABLE;
+	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			XFS_ICI_RECLAIM_TAG);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+}
+
+/*
+ * Free all speculative preallocations and possibly even the inode itself.
+ * This is the last chance to make changes to an otherwise unreferenced file
+ * before incore reclamation happens.
+ */
+static void
+xfs_inodegc_inactivate(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount        *mp = ip->i_mount;
+
+	/*
+	* Inactivation isn't supposed to run when the fs is frozen because
+	* we don't want kernel threads to block on transaction allocation.
+	*/
+	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	trace_xfs_inode_inactivating(ip);
+	xfs_inactive(ip);
+	xfs_inodegc_set_reclaimable(ip);
+}
+
+void
+xfs_inodegc_worker(
+	struct work_struct	*work)
+{
+	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
+							work);
+	struct llist_node	*node = llist_del_all(&gc->list);
+	struct xfs_inode	*ip, *n;
+
+	WRITE_ONCE(gc->items, 0);
+
+	if (!node)
+		return;
+
+	ip = llist_entry(node, struct xfs_inode, i_gclist);
+	trace_xfs_inodegc_worker(ip->i_mount, __return_address);
+
+	llist_for_each_entry_safe(ip, n, node, i_gclist) {
+		xfs_iflags_set(ip, XFS_INACTIVATING);
+		xfs_inodegc_inactivate(ip);
+	}
+}
+
+/*
+ * Force all currently queued inode inactivation work to run immediately, and
+ * wait for the work to finish. Two pass - queue all the work first pass, wait
+ * for it in a second pass.
+ */
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (!xfs_is_inodegc_enabled(mp))
+		return;
+
+	trace_xfs_inodegc_flush(mp, __return_address);
+
+	xfs_inodegc_queue_all(mp);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		flush_work(&gc->work);
+	}
+}
+
+/*
+ * Flush all the pending work and then disable the inode inactivation background
+ * workers and wait for them to stop.
+ */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (!xfs_clear_inodegc_enabled(mp))
+		return;
+
+	xfs_inodegc_queue_all(mp);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		cancel_work_sync(&gc->work);
+	}
+	trace_xfs_inodegc_stop(mp, __return_address);
+}
+
+/*
+ * Enable the inode inactivation background workers and schedule deferred inode
+ * inactivation work if there is any.
+ */
+void
+xfs_inodegc_start(
+	struct xfs_mount	*mp)
+{
+	if (xfs_set_inodegc_enabled(mp))
+		return;
+
+	trace_xfs_inodegc_start(mp, __return_address);
+	xfs_inodegc_queue_all(mp);
+}
+
+/*
+ * Schedule the inactivation worker when:
+ *
+ *  - We've accumulated more than one inode cluster buffer's worth of inodes.
+ */
+static inline bool
+xfs_inodegc_want_queue_work(
+	struct xfs_inode	*ip,
+	unsigned int		items)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (items > mp->m_ino_geo.inodes_per_cluster)
+		return true;
+
+	return false;
+}
+
+/*
+ * Upper bound on the number of inodes in each AG that can be queued for
+ * inactivation at any given time, to avoid monopolizing the workqueue.
+ */
+#define XFS_INODEGC_MAX_BACKLOG		(4 * XFS_INODES_PER_CHUNK)
+
+/*
+ * Make the frontend wait for inactivations when:
+ *
+ *  - The queue depth exceeds the maximum allowable percpu backlog.
+ */
+static inline bool
+xfs_inodegc_want_flush_work(
+	struct xfs_inode	*ip,
+	unsigned int		items)
+{
+	if (items > XFS_INODEGC_MAX_BACKLOG)
+		return true;
+
+	return false;
+}
+
+/*
+ * Queue a background inactivation worker if there are inodes that need to be
+ * inactivated and higher level xfs code hasn't disabled the background
+ * workers.
+ */
+static void
+xfs_inodegc_queue(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_inodegc	*gc;
+	int			items;
+
+	trace_xfs_inode_set_need_inactive(ip);
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags |= XFS_NEED_INACTIVE;
+	spin_unlock(&ip->i_flags_lock);
+
+	gc = get_cpu_ptr(mp->m_inodegc);
+	llist_add(&ip->i_gclist, &gc->list);
+	items = READ_ONCE(gc->items);
+	WRITE_ONCE(gc->items, items + 1);
+	put_cpu_ptr(gc);
+
+	if (!xfs_is_inodegc_enabled(mp))
+		return;
+
+	if (xfs_inodegc_want_queue_work(ip, items)) {
+		trace_xfs_inodegc_queue(mp, __return_address);
+		queue_work(mp->m_inodegc_wq, &gc->work);
+	}
+
+	if (xfs_inodegc_want_flush_work(ip, items)) {
+		trace_xfs_inodegc_throttle(mp, __return_address);
+		flush_work(&gc->work);
+	}
+}
+
+/*
+ * Fold the dead CPU inodegc queue into the current CPUs queue.
+ */
+void
+xfs_inodegc_cpu_dead(
+	struct xfs_mount	*mp,
+	unsigned int		dead_cpu)
+{
+	struct xfs_inodegc	*dead_gc, *gc;
+	struct llist_node	*first, *last;
+	unsigned int		count = 0;
+
+	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
+	cancel_work_sync(&dead_gc->work);
+
+	if (llist_empty(&dead_gc->list))
+		return;
+
+	first = dead_gc->list.first;
+	last = first;
+	while (last->next) {
+		last = last->next;
+		count++;
+	}
+	dead_gc->list.first = NULL;
+	dead_gc->items = 0;
+
+	/* Add pending work to current CPU */
+	gc = get_cpu_ptr(mp->m_inodegc);
+	llist_add_batch(first, last, &gc->list);
+	count += READ_ONCE(gc->items);
+	WRITE_ONCE(gc->items, count);
+	put_cpu_ptr(gc);
+
+	trace_xfs_inodegc_queue(mp, __return_address);
+	queue_work(mp->m_inodegc_wq, &gc->work);
+}
+
+/*
+ * We set the inode flag atomically with the radix tree tag.  Once we get tag
+ * lookups on the radix tree, this inode flag can go away.
+ *
+ * We always use background reclaim here because even if the inode is clean, it
+ * still may be under IO and hence we have wait for IO completion to occur
+ * before we can reclaim the inode. The background reclaim path handles this
+ * more efficiently than we can here, so simply let background reclaim tear down
+ * all inodes.
+ */
+void
+xfs_inode_mark_reclaimable(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			need_inactive;
+
 	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
-	 * We should never get here with one of the reclaim flags already set.
+	 * We should never get here with any of the reclaim flags already set.
 	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
 
-	/*
-	 * We always use background reclaim here because even if the inode is
-	 * clean, it still may be under IO and hence we have wait for IO
-	 * completion to occur before we can reclaim the inode. The background
-	 * reclaim path handles this more efficiently than we can here, so
-	 * simply let background reclaim tear down all inodes.
-	 */
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	need_inactive = xfs_inode_needs_inactive(ip);
+	if (need_inactive) {
+		xfs_inodegc_queue(ip);
+		return;
+	}
 
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
+	/* Going straight to reclaim, so drop the dquots. */
+	xfs_qm_dqdetach(ip);
+	xfs_inodegc_set_reclaimable(ip);
 }
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d0062ebb3f7a..8175148afd50 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -74,4 +74,10 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_flush(struct xfs_mount *mp);
+void xfs_inodegc_stop(struct xfs_mount *mp);
+void xfs_inodegc_start(struct xfs_mount *mp);
+void xfs_inodegc_cpu_dead(struct xfs_mount *mp, unsigned int cpu);
+
 #endif
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e3137bbc7b14..1f62b481d8c5 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -42,6 +42,7 @@ typedef struct xfs_inode {
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
 	atomic_t		i_pincount;	/* inode pin count */
+	struct llist_node	i_gclist;	/* deferred inactivation list */
 
 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
@@ -240,6 +241,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
+#define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
 /*
  * If this unlinked inode is in the middle of recovery, don't let drop_inode
  * truncate and free the inode.  This can happen if we iget the inode during
@@ -248,6 +250,21 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
+/* All inode state flags related to inode reclaim. */
+#define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
+				 XFS_IRECLAIM | \
+				 XFS_NEED_INACTIVE | \
+				 XFS_INACTIVATING)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -255,7 +272,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Flags for inode locking.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1721fce2ec94..a98d2429d795 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2786,6 +2786,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Flush the pending unlinked inodes to ensure that the inactivations
+	 * are fully completed on disk and the incore inodes can be reclaimed
+	 * before we signal that recovery is complete.
+	 */
+	xfs_inodegc_flush(mp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index baf7b323cb15..1f7e9a608f38 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -514,7 +514,8 @@ xfs_check_summary_counts(
  * Flush and reclaim dirty inodes in preparation for unmount. Inodes and
  * internal inode structures can be sitting in the CIL and AIL at this point,
  * so we need to unpin them, write them back and/or reclaim them before unmount
- * can proceed.
+ * can proceed.  In other words, callers are required to have inactivated all
+ * inodes.
  *
  * An inode cluster that has been freed can have its buffer still pinned in
  * memory because the transaction is still sitting in a iclog. The stale inodes
@@ -546,6 +547,7 @@ xfs_unmount_flush_inodes(
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
 
 	xfs_ail_push_all_sync(mp->m_ail);
+	xfs_inodegc_stop(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
@@ -782,6 +784,9 @@ xfs_mountfs(
 	if (error)
 		goto out_log_dealloc;
 
+	/* Enable background inode inactivation workers. */
+	xfs_inodegc_start(mp);
+
 	/*
 	 * Get and sanity-check the root inode.
 	 * Save the pointer to it in the mount structure.
@@ -942,6 +947,15 @@ xfs_mountfs(
 	xfs_irele(rip);
 	/* Clean out dquots that might be in memory after quotacheck. */
 	xfs_qm_unmount(mp);
+
+	/*
+	 * Inactivate all inodes that might still be in memory after a log
+	 * intent recovery failure so that reclaim can free them.  Metadata
+	 * inodes and the root directory shouldn't need inactivation, but the
+	 * mount failed for some reason, so pull down all the state and flee.
+	 */
+	xfs_inodegc_flush(mp);
+
 	/*
 	 * Flush all inode reclamation work and flush the log.
 	 * We have to do this /after/ rtunmount and qm_unmount because those
@@ -989,6 +1003,16 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes
+	 * that the VFS evicted earlier in the unmount process.  Freeing inodes
+	 * and discarding CoW fork preallocations can cause shape changes to
+	 * the free inode and refcount btrees, respectively, so we must finish
+	 * this before we discard the metadata space reservations.  Metadata
+	 * inodes and the root directory do not require inactivation.
+	 */
+	xfs_inodegc_flush(mp);
+
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ed7064596f94..03d59a023bbb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -56,6 +56,15 @@ struct xfs_error_cfg {
 	long		retry_timeout;	/* in jiffies, -1 = infinite */
 };
 
+/*
+ * Per-cpu deferred inode inactivation GC lists.
+ */
+struct xfs_inodegc {
+	struct llist_head	list;
+	struct work_struct	work;
+	unsigned int		items;
+};
+
 /*
  * The struct xfsmount layout is optimised to separate read-mostly variables
  * from variables that are frequently modified. We put the read-mostly variables
@@ -83,6 +92,8 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
 	struct list_head	m_mount_list;	/* global mount list */
+	void __percpu		*m_inodegc;	/* percpu inodegc structures */
+
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -95,8 +106,9 @@ typedef struct xfs_mount {
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
-	struct workqueue_struct *m_gc_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
+	struct workqueue_struct *m_blockgc_wq;
+	struct workqueue_struct *m_inodegc_wq;
 
 	int			m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
@@ -137,6 +149,7 @@ typedef struct xfs_mount {
 	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
 						/* low free space thresholds */
+	unsigned long		m_opstate;	/* dynamic state flags */
 	bool			m_always_cow;
 	bool			m_fail_unmount;
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
@@ -259,6 +272,29 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
 #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
 
+/*
+ * If set, inactivation worker threads will be scheduled to process queued
+ * inodegc work.  If not, queued inodes remain in memory waiting to be
+ * processed.
+ */
+#define XFS_STATE_INODEGC_ENABLED	0
+
+#define __XFS_IS_STATE(name, NAME) \
+static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
+{ \
+	return test_bit(XFS_STATE_ ## NAME, &mp->m_opstate); \
+} \
+static inline bool xfs_clear_ ## name (struct xfs_mount *mp) \
+{ \
+	return test_and_clear_bit(XFS_STATE_ ## NAME, &mp->m_opstate); \
+} \
+static inline bool xfs_set_ ## name (struct xfs_mount *mp) \
+{ \
+	return test_and_set_bit(XFS_STATE_ ## NAME, &mp->m_opstate); \
+}
+
+__XFS_IS_STATE(inodegc_enabled, INODEGC_ENABLED)
+
 /*
  * Max and min values for mount-option defined I/O
  * preallocation sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e0b97e4c8e16..ade192ea4230 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -530,21 +530,29 @@ xfs_init_mount_workqueues(
 	if (!mp->m_reclaim_workqueue)
 		goto out_destroy_cil;
 
-	mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s",
-			WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM,
+	mp->m_blockgc_wq = alloc_workqueue("xfs-blockgc/%s",
+			XFS_WQFLAGS(WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM),
 			0, mp->m_super->s_id);
-	if (!mp->m_gc_workqueue)
+	if (!mp->m_blockgc_wq)
 		goto out_destroy_reclaim;
 
+	mp->m_inodegc_wq = alloc_workqueue("xfs-inodegc/%s",
+			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			1, mp->m_super->s_id);
+	if (!mp->m_inodegc_wq)
+		goto out_destroy_blockgc;
+
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
 	if (!mp->m_sync_workqueue)
-		goto out_destroy_eofb;
+		goto out_destroy_inodegc;
 
 	return 0;
 
-out_destroy_eofb:
-	destroy_workqueue(mp->m_gc_workqueue);
+out_destroy_inodegc:
+	destroy_workqueue(mp->m_inodegc_wq);
+out_destroy_blockgc:
+	destroy_workqueue(mp->m_blockgc_wq);
 out_destroy_reclaim:
 	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_cil:
@@ -562,7 +570,8 @@ xfs_destroy_mount_workqueues(
 	struct xfs_mount	*mp)
 {
 	destroy_workqueue(mp->m_sync_workqueue);
-	destroy_workqueue(mp->m_gc_workqueue);
+	destroy_workqueue(mp->m_blockgc_wq);
+	destroy_workqueue(mp->m_inodegc_wq);
 	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
@@ -724,6 +733,8 @@ xfs_fs_sync_fs(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	trace_xfs_fs_sync_fs(mp, __return_address);
+
 	/*
 	 * Doing anything during the async pass would be counterproductive.
 	 */
@@ -740,6 +751,25 @@ xfs_fs_sync_fs(
 		flush_delayed_work(&mp->m_log->l_work);
 	}
 
+	/*
+	 * Flush all deferred inode inactivation work so that the free space
+	 * counters will reflect recent deletions.  Do not force the log again
+	 * because log recovery can restart the inactivation from the info that
+	 * we just wrote into the ondisk log.
+	 *
+	 * For regular operation this isn't strictly necessary since we aren't
+	 * required to guarantee that unlinking frees space immediately, but
+	 * that is how XFS historically behaved.
+	 *
+	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
+	 * last chance to complete the inactivation work before the filesystem
+	 * freezes and the log is quiesced.  The background worker will not
+	 * activate again until the fs is thawed because the VFS won't evict
+	 * any more inodes until freeze_super drops s_umount and we disable the
+	 * worker in xfs_fs_freeze.
+	 */
+	xfs_inodegc_flush(mp);
+
 	return 0;
 }
 
@@ -854,6 +884,17 @@ xfs_fs_freeze(
 	 */
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
+
+	/*
+	 * Stop the inodegc background worker.  freeze_super already flushed
+	 * all pending inodegc work when it sync'd the filesystem after setting
+	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
+	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
+	 * If the filesystem is read-write, inactivated inodes will queue but
+	 * the worker will not run until the filesystem thaws or unmounts.
+	 */
+	xfs_inodegc_stop(mp);
+
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);
@@ -869,6 +910,14 @@ xfs_fs_unfreeze(
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
 	xfs_blockgc_start(mp);
+
+	/*
+	 * Don't reactivate the inodegc worker on a readonly filesystem because
+	 * inodes are sent directly to reclaim.
+	 */
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -994,6 +1043,35 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_delalloc_blks);
 }
 
+static int
+xfs_inodegc_init_percpu(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	mp->m_inodegc = alloc_percpu(struct xfs_inodegc);
+	if (!mp->m_inodegc)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		init_llist_head(&gc->list);
+		gc->items = 0;
+                INIT_WORK(&gc->work, xfs_inodegc_worker);
+	}
+	return 0;
+}
+
+static void
+xfs_inodegc_free_percpu(
+	struct xfs_mount	*mp)
+{
+	if (!mp->m_inodegc)
+		return;
+	free_percpu(mp->m_inodegc);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
@@ -1011,6 +1089,7 @@ xfs_fs_put_super(
 	xfs_freesb(mp);
 	free_percpu(mp->m_stats.xs_stats);
 	xfs_mount_list_del(mp);
+	xfs_inodegc_free_percpu(mp);
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
@@ -1382,6 +1461,10 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_destroy_workqueues;
 
+	error = xfs_inodegc_init_percpu(mp);
+	if (error)
+		goto out_destroy_counters;
+
 	/*
 	 * All percpu data structures requiring cleanup when a cpu goes offline
 	 * must be allocated before adding this @mp to the cpu-dead handler's
@@ -1393,7 +1476,7 @@ xfs_fs_fill_super(
 	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
 	if (!mp->m_stats.xs_stats) {
 		error = -ENOMEM;
-		goto out_destroy_counters;
+		goto out_destroy_inodegc;
 	}
 
 	error = xfs_readsb(mp, flags);
@@ -1596,8 +1679,10 @@ xfs_fs_fill_super(
 	xfs_freesb(mp);
  out_free_stats:
 	free_percpu(mp->m_stats.xs_stats);
+ out_destroy_inodegc:
+	xfs_mount_list_del(mp);
+	xfs_inodegc_free_percpu(mp);
  out_destroy_counters:
-	xfs_mount_list_del(mp);
 	xfs_destroy_percpu_counters(mp);
  out_destroy_workqueues:
 	xfs_destroy_mount_workqueues(mp);
@@ -1680,6 +1765,9 @@ xfs_remount_rw(
 	if (error && error != -ENOSPC)
 		return error;
 
+	/* Re-enable the background inode inactivation worker. */
+	xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -1702,6 +1790,15 @@ xfs_remount_ro(
 		return error;
 	}
 
+	/*
+	 * Stop the inodegc background worker.  xfs_fs_reconfigure already
+	 * flushed all pending inodegc work when it sync'd the filesystem.
+	 * The VFS holds s_umount, so we know that inodes cannot enter
+	 * xfs_fs_destroy_inode during a remount operation.  In readonly mode
+	 * we send inodes straight to reclaim, so no inodes will be queued.
+	 */
+	xfs_inodegc_stop(mp);
+
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
 	if (error) {
@@ -2102,7 +2199,7 @@ xfs_cpu_dead(
 	spin_lock(&xfs_mount_list_lock);
 	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
 		spin_unlock(&xfs_mount_list_lock);
-		/* xfs_subsys_dead(mp, cpu); */
+		xfs_inodegc_cpu_dead(mp, cpu);
 		spin_lock(&xfs_mount_list_lock);
 	}
 	spin_unlock(&xfs_mount_list_lock);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 19260291ff8b..bd8abb50b33a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -157,6 +157,48 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
+#define XFS_STATE_FLAGS \
+	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }
+
+DECLARE_EVENT_CLASS(xfs_fs_class,
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
+	TP_ARGS(mp, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, mflags)
+		__field(unsigned long, opstate)
+		__field(unsigned long, sbflags)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		if (mp) {
+			__entry->dev = mp->m_super->s_dev;
+			__entry->mflags = mp->m_flags;
+			__entry->opstate = mp->m_opstate;
+			__entry->sbflags = mp->m_super->s_flags;
+		}
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d m_flags 0x%llx opstate (%s) s_flags 0x%lx caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->mflags,
+		  __print_flags(__entry->opstate, "|", XFS_STATE_FLAGS),
+		  __entry->sbflags,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_FS_EVENT(name)	\
+DEFINE_EVENT(xfs_fs_class, name,					\
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
+	TP_ARGS(mp, caller_ip))
+DEFINE_FS_EVENT(xfs_inodegc_flush);
+DEFINE_FS_EVENT(xfs_inodegc_start);
+DEFINE_FS_EVENT(xfs_inodegc_stop);
+DEFINE_FS_EVENT(xfs_inodegc_worker);
+DEFINE_FS_EVENT(xfs_inodegc_queue);
+DEFINE_FS_EVENT(xfs_inodegc_throttle);
+DEFINE_FS_EVENT(xfs_fs_sync_fs);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),
@@ -616,14 +658,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -667,6 +712,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/14] xfs: queue inactivation immediately when free space is tight
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-08-05  2:06 ` [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues Darrick J. Wong
@ 2021-08-05  2:06 ` Darrick J. Wong
  2021-08-05  5:31   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 07/14] xfs: queue inactivation immediately when quota is nearing enforcement Darrick J. Wong
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

On a mostly empty filesystem, the risk of the allocator making poor
decisions due to fragmentation of the free space on account a lengthy
delay in background updates is minimal because there's plenty of space.
However, if free space is tight, we want to deallocate unlinked inodes
as quickly as possible to avoid fallocate ENOSPC and to give the
allocator the best shot at optimal allocations for new writes.

Therefore, queue the percpu worker immediately if the filesystem is more
than 95% full.  This follows the same principle that XFS becomes less
aggressive about speculative allocations and lazy cleanup (and more
precise about accounting) when nearing full.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    6 ++++++
 fs/xfs/xfs_mount.c  |    8 --------
 fs/xfs/xfs_mount.h  |    9 +++++++++
 3 files changed, 15 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index fedfa40e3cd6..0332acaad6f3 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1928,6 +1928,7 @@ xfs_inodegc_start(
  * Schedule the inactivation worker when:
  *
  *  - We've accumulated more than one inode cluster buffer's worth of inodes.
+ *  - There is less than 5% free space left.
  */
 static inline bool
 xfs_inodegc_want_queue_work(
@@ -1939,6 +1940,11 @@ xfs_inodegc_want_queue_work(
 	if (items > mp->m_ino_geo.inodes_per_cluster)
 		return true;
 
+	if (__percpu_counter_compare(&mp->m_fdblocks,
+				mp->m_low_space[XFS_LOWSP_5_PCNT],
+				XFS_FDBLOCKS_BATCH) < 0)
+		return true;
+
 	return false;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 1f7e9a608f38..5fe6f1db4fe9 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1077,14 +1077,6 @@ xfs_fs_writable(
 	return true;
 }
 
-/*
- * Deltas for the block count can vary from 1 to very large, but lock contention
- * only occurs on frequent small block count updates such as in the delayed
- * allocation path for buffered writes (page a time updates). Hence we set
- * a large batch count (1024) to minimise global counter updates except when
- * we get near to ENOSPC and we have to be very accurate with our updates.
- */
-#define XFS_FDBLOCKS_BATCH	1024
 int
 xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 03d59a023bbb..750297498a09 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -343,6 +343,15 @@ extern uint64_t xfs_default_resblks(xfs_mount_t *mp);
 extern int	xfs_mountfs(xfs_mount_t *mp);
 extern void	xfs_unmountfs(xfs_mount_t *);
 
+/*
+ * Deltas for the block count can vary from 1 to very large, but lock contention
+ * only occurs on frequent small block count updates such as in the delayed
+ * allocation path for buffered writes (page a time updates). Hence we set
+ * a large batch count (1024) to minimise global counter updates except when
+ * we get near to ENOSPC and we have to be very accurate with our updates.
+ */
+#define XFS_FDBLOCKS_BATCH	1024
+
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
 				 bool reserved);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/14] xfs: queue inactivation immediately when quota is nearing enforcement
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-08-05  2:06 ` [PATCH 06/14] xfs: queue inactivation immediately when free space is tight Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:35   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 08/14] xfs: queue inactivation immediately when free realtime extents are tight Darrick J. Wong
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

Specifically, if the dquots attached to the inode being inactivated are
nearing any kind of enforcement boundary, we want to queue that
inactivation work immediately so that users don't get EDQUOT/ENOSPC
errors even after they deleted a bunch of files to stay within quota.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_dquot.h  |   10 ++++++++++
 fs/xfs/xfs_icache.c |   10 ++++++++++
 fs/xfs/xfs_qm.c     |   34 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_quota.h  |    2 ++
 4 files changed, 56 insertions(+)


diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index f642884a6834..6b5e3cf40c8b 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -54,6 +54,16 @@ struct xfs_dquot_res {
 	xfs_qwarncnt_t		warnings;
 };
 
+static inline bool
+xfs_dquot_res_over_limits(
+	const struct xfs_dquot_res	*qres)
+{
+	if ((qres->softlimit && qres->softlimit < qres->reserved) ||
+	    (qres->hardlimit && qres->hardlimit < qres->reserved))
+		return true;
+	return false;
+}
+
 /*
  * The incore dquot structure
  */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0332acaad6f3..e5e90f09bcc6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1929,6 +1929,7 @@ xfs_inodegc_start(
  *
  *  - We've accumulated more than one inode cluster buffer's worth of inodes.
  *  - There is less than 5% free space left.
+ *  - Any of the quotas for this inode are near an enforcement limit.
  */
 static inline bool
 xfs_inodegc_want_queue_work(
@@ -1945,6 +1946,15 @@ xfs_inodegc_want_queue_work(
 				XFS_FDBLOCKS_BATCH) < 0)
 		return true;
 
+	if (xfs_inode_near_dquot_enforcement(ip, XFS_DQTYPE_USER))
+		return true;
+
+	if (xfs_inode_near_dquot_enforcement(ip, XFS_DQTYPE_GROUP))
+		return true;
+
+	if (xfs_inode_near_dquot_enforcement(ip, XFS_DQTYPE_PROJ))
+		return true;
+
 	return false;
 }
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 351d99bc52e5..2bef4735d030 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1882,3 +1882,37 @@ xfs_qm_vop_create_dqattach(
 	}
 }
 
+/* Decide if this inode's dquot is near an enforcement boundary. */
+bool
+xfs_inode_near_dquot_enforcement(
+	struct xfs_inode	*ip,
+	xfs_dqtype_t		type)
+{
+	struct xfs_dquot	*dqp;
+	int64_t			freesp;
+
+	/* We only care for quotas that are enabled and enforced. */
+	dqp = xfs_inode_dquot(ip, type);
+	if (!dqp || !xfs_dquot_is_enforced(dqp))
+		return false;
+
+	if (xfs_dquot_res_over_limits(&dqp->q_ino) ||
+	    xfs_dquot_res_over_limits(&dqp->q_rtb))
+		return true;
+
+	/* For space on the data device, check the various thresholds. */
+	if (!dqp->q_prealloc_hi_wmark)
+		return false;
+
+	if (dqp->q_blk.reserved < dqp->q_prealloc_lo_wmark)
+		return false;
+
+	if (dqp->q_blk.reserved >= dqp->q_prealloc_hi_wmark)
+		return true;
+
+	freesp = dqp->q_prealloc_hi_wmark - dqp->q_blk.reserved;
+	if (freesp < dqp->q_low_space[XFS_QLOWSP_5_PCNT])
+		return true;
+
+	return false;
+}
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index d00d01302545..dcc785fdd345 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -113,6 +113,7 @@ xfs_quota_reserve_blkres(struct xfs_inode *ip, int64_t blocks)
 {
 	return xfs_trans_reserve_quota_nblks(NULL, ip, blocks, 0, false);
 }
+bool xfs_inode_near_dquot_enforcement(struct xfs_inode *ip, xfs_dqtype_t type);
 #else
 static inline int
 xfs_qm_vop_dqalloc(struct xfs_inode *ip, kuid_t kuid, kgid_t kgid,
@@ -168,6 +169,7 @@ xfs_trans_reserve_quota_icreate(struct xfs_trans *tp, struct xfs_dquot *udqp,
 #define xfs_qm_mount_quotas(mp)
 #define xfs_qm_unmount(mp)
 #define xfs_qm_unmount_quotas(mp)
+#define xfs_inode_near_dquot_enforcement(ip, type)			(false)
 #endif /* CONFIG_XFS_QUOTA */
 
 static inline int


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/14] xfs: queue inactivation immediately when free realtime extents are tight
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 07/14] xfs: queue inactivation immediately when quota is nearing enforcement Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:36   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 09/14] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

Similar to the patch doing this for free space on the data device, if
the file being inactivated is a realtime file and the realtime volume is
running low on free extents, we want to run the worker ASAP so that the
realtime allocator can make better decisions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   24 ++++++++++++++++++++++++
 fs/xfs/xfs_mount.c  |   13 ++++++++-----
 fs/xfs/xfs_mount.h  |    3 ++-
 3 files changed, 34 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e5e90f09bcc6..4a062cf689c3 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1924,6 +1924,27 @@ xfs_inodegc_start(
 	xfs_inodegc_queue_all(mp);
 }
 
+#ifdef CONFIG_XFS_RT
+static inline bool
+xfs_inodegc_want_queue_rt_file(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	uint64_t		freertx;
+
+	if (!XFS_IS_REALTIME_INODE(ip))
+		return false;
+
+	spin_lock(&mp->m_sb_lock);
+	freertx = mp->m_sb.sb_frextents;
+	spin_unlock(&mp->m_sb_lock);
+
+	return freertx < mp->m_low_rtexts[XFS_LOWSP_5_PCNT];
+}
+#else
+# define xfs_inodegc_want_queue_rt_file(ip)	(false)
+#endif /* CONFIG_XFS_RT */
+
 /*
  * Schedule the inactivation worker when:
  *
@@ -1946,6 +1967,9 @@ xfs_inodegc_want_queue_work(
 				XFS_FDBLOCKS_BATCH) < 0)
 		return true;
 
+	if (xfs_inodegc_want_queue_rt_file(ip))
+		return true;
+
 	if (xfs_inode_near_dquot_enforcement(ip, XFS_DQTYPE_USER))
 		return true;
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5fe6f1db4fe9..ed1e7e3dce7e 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -365,13 +365,16 @@ void
 xfs_set_low_space_thresholds(
 	struct xfs_mount	*mp)
 {
-	int i;
+	uint64_t		dblocks = mp->m_sb.sb_dblocks;
+	uint64_t		rtexts = mp->m_sb.sb_rextents;
+	int			i;
+
+	do_div(dblocks, 100);
+	do_div(rtexts, 100);
 
 	for (i = 0; i < XFS_LOWSP_MAX; i++) {
-		uint64_t space = mp->m_sb.sb_dblocks;
-
-		do_div(space, 100);
-		mp->m_low_space[i] = space * (i + 1);
+		mp->m_low_space[i] = dblocks * (i + 1);
+		mp->m_low_rtexts[i] = rtexts * (i + 1);
 	}
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 750297498a09..1061ac985c18 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -145,7 +145,8 @@ typedef struct xfs_mount {
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint			m_qflags;	/* quota status flags */
 	uint64_t		m_flags;	/* global mount flags */
-	int64_t			m_low_space[XFS_LOWSP_MAX];
+	uint64_t		m_low_space[XFS_LOWSP_MAX];
+	uint64_t		m_low_rtexts[XFS_LOWSP_MAX];
 	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
 						/* low free space thresholds */


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/14] xfs: inactivate inodes any time we try to free speculative preallocations
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 08/14] xfs: queue inactivation immediately when free realtime extents are tight Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:36   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 10/14] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
to try to free speculative preallocations when space is tight.  This
means that file writes, transaction reservation failures, quota limit
enforcement, and the EOFBLOCKS ioctl all call this function to free
space when things are tight.

Since inode inactivation is now a background task, this means that the
filesystem can be hanging on to unlinked but not yet freed space.  Add
this to the list of things that xfs_blockgc_free_* makes writer threads
scan for when they cannot reserve space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4a062cf689c3..f2d12405dd87 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1468,16 +1468,24 @@ xfs_blockgc_worker(
 }
 
 /*
- * Try to free space in the filesystem by purging eofblocks and cowblocks.
+ * Try to free space in the filesystem by purging inactive inodes, eofblocks
+ * and cowblocks.
  */
 int
 xfs_blockgc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_icwalk	*icw)
 {
+	int			error;
+
 	trace_xfs_blockgc_free_space(mp, icw, _RET_IP_);
 
-	return xfs_icwalk(mp, XFS_ICWALK_BLOCKGC, icw);
+	error = xfs_icwalk(mp, XFS_ICWALK_BLOCKGC, icw);
+	if (error)
+		return error;
+
+	xfs_inodegc_flush(mp);
+	return 0;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/14] xfs: flush inode inactivation work when compiling usage statistics
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 09/14] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:38   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 11/14] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Users have come to expect that the space accounting information in
statfs and getquota reports are fairly accurate.  Now that we inactivate
inodes from a background queue, these numbers can be thrown off by
whatever resources are singly-owned by the inodes in the queue.  Flush
the pending inactivations when userspace asks for a space usage report.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_qm_syscalls.c |    8 ++++++++
 fs/xfs/xfs_super.c       |    3 +++
 2 files changed, 11 insertions(+)


diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 982cd6613a4c..c6902f9d064c 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -481,6 +481,10 @@ xfs_qm_scall_getquota(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	/* Flush inodegc work at the start of a quota reporting scan. */
+	if (id == 0)
+		xfs_inodegc_flush(mp);
+
 	/*
 	 * Try to get the dquot. We don't want it allocated on disk, so don't
 	 * set doalloc. If it doesn't exist, we'll get ENOENT back.
@@ -519,6 +523,10 @@ xfs_qm_scall_getquota_next(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	/* Flush inodegc work at the start of a quota reporting scan. */
+	if (*id == 0)
+		xfs_inodegc_flush(mp);
+
 	error = xfs_qm_dqget_next(mp, *id, type, &dqp);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ade192ea4230..2ac040c95703 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -788,6 +788,9 @@ xfs_fs_statfs(
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
+	/* Wait for whatever inactivations are in progress. */
+	xfs_inodegc_flush(mp);
+
 	statp->f_type = XFS_SUPER_MAGIC;
 	statp->f_namelen = MAXNAMELEN - 1;
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/14] xfs: don't run speculative preallocation gc when fs is frozen
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 10/14] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:40   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 12/14] xfs: use background worker pool when transactions can't get free space Darrick J. Wong
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the infrastructure to switch background workers on and
off at will, fix the block gc worker code so that we don't actually run
the worker when the filesystem is frozen, same as we do for deferred
inactivation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |    9 +++++----
 fs/xfs/xfs_icache.c   |   27 +++++++++++++++++++++++----
 fs/xfs/xfs_mount.c    |    1 +
 fs/xfs/xfs_mount.h    |    8 ++++++++
 fs/xfs/xfs_super.c    |    9 ++++++---
 fs/xfs/xfs_trace.h    |    6 +++++-
 6 files changed, 48 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 06b697f72f23..e86854171b0c 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -893,11 +893,12 @@ xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
 	/*
-	 * Readonly filesystems do not perform inactivation, so there's no
-	 * need to restart the worker.
+	 * Readonly filesystems do not perform inactivation or speculative
+	 * preallocation, so there's no need to restart the workers.
 	 */
-	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
+	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY)) {
 		xfs_inodegc_start(sc->mp);
-	xfs_blockgc_start(sc->mp);
+		xfs_blockgc_start(sc->mp);
+	}
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f2d12405dd87..5ee1a084d36d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -211,6 +211,11 @@ static inline void
 xfs_blockgc_queue(
 	struct xfs_perag	*pag)
 {
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	if (!xfs_is_blockgc_enabled(mp))
+		return;
+
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
 		queue_delayed_work(pag->pag_mount->m_blockgc_wq,
@@ -1366,8 +1371,12 @@ xfs_blockgc_stop(
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
-	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
+	if (!xfs_clear_blockgc_enabled(mp))
+		return;
+
+	for_each_perag(mp, agno, pag)
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
+	trace_xfs_blockgc_stop(mp, __return_address);
 }
 
 /* Enable post-EOF and CoW block auto-reclamation. */
@@ -1378,6 +1387,10 @@ xfs_blockgc_start(
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
+	if (xfs_set_blockgc_enabled(mp))
+		return;
+
+	trace_xfs_blockgc_start(mp, __return_address);
 	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
 		xfs_blockgc_queue(pag);
 }
@@ -1435,6 +1448,13 @@ xfs_blockgc_scan_inode(
 	unsigned int		lockflags = 0;
 	int			error;
 
+	/*
+	 * Speculative preallocation gc isn't supposed to run when the fs is
+	 * frozen because we don't want kernel threads to block on transaction
+	 * allocation.
+	 */
+	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
+
 	error = xfs_inode_free_eofblocks(ip, icw, &lockflags);
 	if (error)
 		goto unlock;
@@ -1457,13 +1477,12 @@ xfs_blockgc_worker(
 	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
-	if (!sb_start_write_trylock(mp->m_super))
-		return;
+	trace_xfs_blockgc_worker(mp, __return_address);
+
 	error = xfs_icwalk_ag(pag, XFS_ICWALK_BLOCKGC, NULL);
 	if (error)
 		xfs_info(mp, "AG %u preallocation gc worker failed, err=%d",
 				pag->pag_agno, error);
-	sb_end_write(mp->m_super);
 	xfs_blockgc_queue(pag);
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index ed1e7e3dce7e..b81f2fc734bd 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -789,6 +789,7 @@ xfs_mountfs(
 
 	/* Enable background inode inactivation workers. */
 	xfs_inodegc_start(mp);
+	xfs_blockgc_start(mp);
 
 	/*
 	 * Get and sanity-check the root inode.
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 1061ac985c18..797b8068dfe6 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -280,6 +280,13 @@ typedef struct xfs_mount {
  */
 #define XFS_STATE_INODEGC_ENABLED	0
 
+/*
+ * If set, background speculative prealloc gc worker threads will be scheduled
+ * to process queued blockgc work.  If not, inodes retain their preallocations
+ * until explicitly deleted.
+ */
+#define XFS_STATE_BLOCKGC_ENABLED	1
+
 #define __XFS_IS_STATE(name, NAME) \
 static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
 { \
@@ -295,6 +302,7 @@ static inline bool xfs_set_ ## name (struct xfs_mount *mp) \
 }
 
 __XFS_IS_STATE(inodegc_enabled, INODEGC_ENABLED)
+__XFS_IS_STATE(blockgc_enabled, BLOCKGC_ENABLED)
 
 /*
  * Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2ac040c95703..2bab18ed73b9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -912,14 +912,17 @@ xfs_fs_unfreeze(
 
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
-	xfs_blockgc_start(mp);
 
 	/*
 	 * Don't reactivate the inodegc worker on a readonly filesystem because
-	 * inodes are sent directly to reclaim.
+	 * inodes are sent directly to reclaim.  Don't reactivate the blockgc
+	 * worker because there are no speculative preallocations on a readonly
+	 * filesystem.
 	 */
-	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+		xfs_blockgc_start(mp);
 		xfs_inodegc_start(mp);
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index bd8abb50b33a..1b1c288610af 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -158,7 +158,8 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
 #define XFS_STATE_FLAGS \
-	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }
+	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }, \
+	{ (1UL << XFS_STATE_BLOCKGC_ENABLED),		"blockgc" }
 
 DECLARE_EVENT_CLASS(xfs_fs_class,
 	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
@@ -198,6 +199,9 @@ DEFINE_FS_EVENT(xfs_inodegc_worker);
 DEFINE_FS_EVENT(xfs_inodegc_queue);
 DEFINE_FS_EVENT(xfs_inodegc_throttle);
 DEFINE_FS_EVENT(xfs_fs_sync_fs);
+DEFINE_FS_EVENT(xfs_blockgc_start);
+DEFINE_FS_EVENT(xfs_blockgc_stop);
+DEFINE_FS_EVENT(xfs_blockgc_worker);
 
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 12/14] xfs: use background worker pool when transactions can't get free space
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (10 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 11/14] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:42   ` Dave Chinner
  2021-08-05  2:07 ` [PATCH 13/14] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
  2021-08-05  2:07 ` [PATCH 14/14] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
call xfs_blockgc_free_space with a NULL icwalk structure to try to free
space.  Each frontend thread that encounters this situation starts its
own walk of the inode cache to see if it can find anything, which is
wasteful since we don't have any additional selection criteria.  For
this one common case, create a function that reschedules all pending
background work immediately and flushes the workqueue so that the scan
can run in parallel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   28 ++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.h |    1 +
 fs/xfs/xfs_trace.h  |    1 +
 fs/xfs/xfs_trans.c  |    5 +----
 4 files changed, 31 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 5ee1a084d36d..f5a4f4d64c50 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1507,6 +1507,34 @@ xfs_blockgc_free_space(
 	return 0;
 }
 
+/*
+ * Reclaim all the free space that we can by scheduling the background blockgc
+ * and inodegc workers immediately and waiting for them all to clear.
+ */
+void
+xfs_blockgc_flush_all(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	trace_xfs_blockgc_flush_all(mp, __return_address);
+
+	/*
+	 * For each blockgc worker, move its queue time up to now.  If it
+	 * wasn't queued, it will not be requeued.  Then flush whatever's
+	 * left.
+	 */
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
+		mod_delayed_work(pag->pag_mount->m_blockgc_wq,
+				&pag->pag_blockgc_work, 0);
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
+		flush_delayed_work(&pag->pag_blockgc_work);
+
+	xfs_inodegc_flush(mp);
+}
+
 /*
  * Run cow/eofblocks scans on the supplied dquots.  We don't know exactly which
  * quota caused an allocation failure, so we make a best effort by including
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 8175148afd50..18c2d224aa78 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -59,6 +59,7 @@ int xfs_blockgc_free_dquots(struct xfs_mount *mp, struct xfs_dquot *udqp,
 		unsigned int iwalk_flags);
 int xfs_blockgc_free_quota(struct xfs_inode *ip, unsigned int iwalk_flags);
 int xfs_blockgc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icm);
+void xfs_blockgc_flush_all(struct xfs_mount *mp);
 
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 1b1c288610af..0a179cfc35c0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -202,6 +202,7 @@ DEFINE_FS_EVENT(xfs_fs_sync_fs);
 DEFINE_FS_EVENT(xfs_blockgc_start);
 DEFINE_FS_EVENT(xfs_blockgc_stop);
 DEFINE_FS_EVENT(xfs_blockgc_worker);
+DEFINE_FS_EVENT(xfs_blockgc_flush_all);
 
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 87bffd12c20c..83abaa219616 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -295,10 +295,7 @@ xfs_trans_alloc(
 		 * Do not perform a synchronous scan because callers can hold
 		 * other locks.
 		 */
-		error = xfs_blockgc_free_space(mp, NULL);
-		if (error)
-			return error;
-
+		xfs_blockgc_flush_all(mp);
 		want_retry = false;
 		goto retry;
 	}


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 13/14] xfs: avoid buffer deadlocks when walking fs inodes
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (11 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 12/14] xfs: use background worker pool when transactions can't get free space Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  2:07 ` [PATCH 14/14] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
  13 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

When we're servicing an INUMBERS or BULKSTAT request or running
quotacheck, grab an empty transaction so that we can use its inherent
recursive buffer locking abilities to detect inode btree cycles without
hitting ABBA buffer deadlocks.  This patch requires the deferred inode
inactivation patchset because xfs_irele cannot directly call
xfs_inactive when the iwalk itself has an (empty) transaction.

Found by fuzzing an inode btree pointer to introduce a cycle into the
tree (xfs/365).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_itable.c |   42 +++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_iwalk.c  |   33 ++++++++++++++++++++++++++++-----
 2 files changed, 65 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index f331975a16de..84c17a9f9869 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -19,6 +19,7 @@
 #include "xfs_error.h"
 #include "xfs_icache.h"
 #include "xfs_health.h"
+#include "xfs_trans.h"
 
 /*
  * Bulk Stat
@@ -163,6 +164,7 @@ xfs_bulkstat_one(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error;
 
 	if (breq->mnt_userns != &init_user_ns) {
@@ -178,9 +180,18 @@ xfs_bulkstat_one(
 	if (!bc.buf)
 		return -ENOMEM;
 
-	error = xfs_bulkstat_one_int(breq->mp, breq->mnt_userns, NULL,
-				     breq->startino, &bc);
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
 
+	error = xfs_bulkstat_one_int(breq->mp, breq->mnt_userns, tp,
+			breq->startino, &bc);
+	xfs_trans_cancel(tp);
+out:
 	kmem_free(bc.buf);
 
 	/*
@@ -244,6 +255,7 @@ xfs_bulkstat(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error;
 
 	if (breq->mnt_userns != &init_user_ns) {
@@ -259,9 +271,18 @@ xfs_bulkstat(
 	if (!bc.buf)
 		return -ENOMEM;
 
-	error = xfs_iwalk(breq->mp, NULL, breq->startino, breq->flags,
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
+
+	error = xfs_iwalk(breq->mp, tp, breq->startino, breq->flags,
 			xfs_bulkstat_iwalk, breq->icount, &bc);
-
+	xfs_trans_cancel(tp);
+out:
 	kmem_free(bc.buf);
 
 	/*
@@ -374,13 +395,24 @@ xfs_inumbers(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error = 0;
 
 	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
 		return 0;
 
-	error = xfs_inobt_walk(breq->mp, NULL, breq->startino, breq->flags,
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
+
+	error = xfs_inobt_walk(breq->mp, tp, breq->startino, breq->flags,
 			xfs_inumbers_walk, breq->icount, &ic);
+	xfs_trans_cancel(tp);
+out:
 
 	/*
 	 * We found some inode groups, so clear the error status and return
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 917d51eefee3..7558486f4937 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -83,6 +83,9 @@ struct xfs_iwalk_ag {
 
 	/* Skip empty inobt records? */
 	unsigned int			skip_empty:1;
+
+	/* Drop the (hopefully empty) transaction when calling iwalk_fn. */
+	unsigned int			drop_trans:1;
 };
 
 /*
@@ -352,7 +355,6 @@ xfs_iwalk_run_callbacks(
 	int				*has_more)
 {
 	struct xfs_mount		*mp = iwag->mp;
-	struct xfs_trans		*tp = iwag->tp;
 	struct xfs_inobt_rec_incore	*irec;
 	xfs_agino_t			next_agino;
 	int				error;
@@ -362,10 +364,15 @@ xfs_iwalk_run_callbacks(
 	ASSERT(iwag->nr_recs > 0);
 
 	/* Delete cursor but remember the last record we cached... */
-	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
+	xfs_iwalk_del_inobt(iwag->tp, curpp, agi_bpp, 0);
 	irec = &iwag->recs[iwag->nr_recs - 1];
 	ASSERT(next_agino >= irec->ir_startino + XFS_INODES_PER_CHUNK);
 
+	if (iwag->drop_trans) {
+		xfs_trans_cancel(iwag->tp);
+		iwag->tp = NULL;
+	}
+
 	error = xfs_iwalk_ag_recs(iwag);
 	if (error)
 		return error;
@@ -376,8 +383,15 @@ xfs_iwalk_run_callbacks(
 	if (!has_more)
 		return 0;
 
+	if (iwag->drop_trans) {
+		error = xfs_trans_alloc_empty(mp, &iwag->tp);
+		if (error)
+			return error;
+	}
+
 	/* ...and recreate the cursor just past where we left off. */
-	error = xfs_inobt_cur(mp, tp, iwag->pag, XFS_BTNUM_INO, curpp, agi_bpp);
+	error = xfs_inobt_cur(mp, iwag->tp, iwag->pag, XFS_BTNUM_INO, curpp,
+			agi_bpp);
 	if (error)
 		return error;
 
@@ -390,7 +404,6 @@ xfs_iwalk_ag(
 	struct xfs_iwalk_ag		*iwag)
 {
 	struct xfs_mount		*mp = iwag->mp;
-	struct xfs_trans		*tp = iwag->tp;
 	struct xfs_perag		*pag = iwag->pag;
 	struct xfs_buf			*agi_bp = NULL;
 	struct xfs_btree_cur		*cur = NULL;
@@ -469,7 +482,7 @@ xfs_iwalk_ag(
 	error = xfs_iwalk_run_callbacks(iwag, &cur, &agi_bp, &has_more);
 
 out:
-	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+	xfs_iwalk_del_inobt(iwag->tp, &cur, &agi_bp, error);
 	return error;
 }
 
@@ -599,8 +612,18 @@ xfs_iwalk_ag_work(
 	error = xfs_iwalk_alloc(iwag);
 	if (error)
 		goto out;
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(mp, &iwag->tp);
+	if (error)
+		goto out;
+	iwag->drop_trans = 1;
 
 	error = xfs_iwalk_ag(iwag);
+	if (iwag->tp)
+		xfs_trans_cancel(iwag->tp);
 	xfs_iwalk_free(iwag);
 out:
 	xfs_perag_put(iwag->pag);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 14/14] xfs: throttle inode inactivation queuing on memory reclaim
  2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
                   ` (12 preceding siblings ...)
  2021-08-05  2:07 ` [PATCH 13/14] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
@ 2021-08-05  2:07 ` Darrick J. Wong
  2021-08-05  5:44   ` Dave Chinner
  13 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we defer inode inactivation, we've decoupled the process of
unlinking or closing an inode from the process of inactivating it.  In
theory this should lead to better throughput since we now inactivate the
queued inodes in batches instead of one at a time.

Unfortunately, one of the primary risks with this decoupling is the loss
of rate control feedback between the frontend and background threads.
In other words, a rm -rf /* thread can run the system out of memory if
it can queue inodes for inactivation and jump to a new CPU faster than
the background threads can actually clear the deferred work.  The
workers can get scheduled off the CPU if they have to do IO, etc.

To solve this problem, we configure a shrinker so that it will activate
the /second/ time the shrinkers are called.  The custom shrinker will
queue all percpu deferred inactivation workers immediately and set a
flag to force frontend callers who are releasing a vfs inode to wait for
the inactivation workers.

On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
most of the OOMing problem when deleting 10 million inodes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h |    1 +
 fs/xfs/xfs_mount.c  |    9 ++++-
 fs/xfs/xfs_mount.h  |    3 ++
 fs/xfs/xfs_trace.h  |   37 ++++++++++++++++++-
 5 files changed, 147 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f5a4f4d64c50..6741e27603ad 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1908,8 +1908,9 @@ xfs_inodegc_worker(
 		return;
 
 	ip = llist_entry(node, struct xfs_inode, i_gclist);
-	trace_xfs_inodegc_worker(ip->i_mount, __return_address);
+	trace_xfs_inodegc_worker(ip->i_mount, READ_ONCE(gc->shrinker_hits));
 
+	WRITE_ONCE(gc->shrinker_hits, 0);
 	llist_for_each_entry_safe(ip, n, node, i_gclist) {
 		xfs_iflags_set(ip, XFS_INACTIVATING);
 		xfs_inodegc_inactivate(ip);
@@ -2046,13 +2047,18 @@ xfs_inodegc_want_queue_work(
 /*
  * Make the frontend wait for inactivations when:
  *
+ *  - Memory shrinkers queued the inactivation worker and it hasn't finished.
  *  - The queue depth exceeds the maximum allowable percpu backlog.
  */
 static inline bool
 xfs_inodegc_want_flush_work(
 	struct xfs_inode	*ip,
-	unsigned int		items)
+	unsigned int		items,
+	unsigned int		shrinker_hits)
 {
+	if (shrinker_hits > 0)
+		return true;
+
 	if (items > XFS_INODEGC_MAX_BACKLOG)
 		return true;
 
@@ -2071,6 +2077,7 @@ xfs_inodegc_queue(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_inodegc	*gc;
 	int			items;
+	unsigned int		shrinker_hits;
 
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
@@ -2081,6 +2088,7 @@ xfs_inodegc_queue(
 	llist_add(&ip->i_gclist, &gc->list);
 	items = READ_ONCE(gc->items);
 	WRITE_ONCE(gc->items, items + 1);
+	shrinker_hits = READ_ONCE(gc->shrinker_hits);
 	put_cpu_ptr(gc);
 
 	if (!xfs_is_inodegc_enabled(mp))
@@ -2091,7 +2099,7 @@ xfs_inodegc_queue(
 		queue_work(mp->m_inodegc_wq, &gc->work);
 	}
 
-	if (xfs_inodegc_want_flush_work(ip, items)) {
+	if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) {
 		trace_xfs_inodegc_throttle(mp, __return_address);
 		flush_work(&gc->work);
 	}
@@ -2169,3 +2177,91 @@ xfs_inode_mark_reclaimable(
 	xfs_qm_dqdetach(ip);
 	xfs_inodegc_set_reclaimable(ip);
 }
+
+/*
+ * Register a phony shrinker so that we can run background inodegc sooner when
+ * there's memory pressure.  Inactivation does not itself free any memory but
+ * it does make inodes reclaimable, which eventually frees memory.
+ *
+ * The count function, seek value, and batch value are crafted to trigger the
+ * scan function during the second round of scanning.  Hopefully this means
+ * that we reclaimed enough memory that initiating metadata transactions won't
+ * make things worse.
+ */
+#define XFS_INODEGC_SHRINKER_COUNT	(1UL << DEF_PRIORITY)
+#define XFS_INODEGC_SHRINKER_BATCH	((XFS_INODEGC_SHRINKER_COUNT / 2) + 1)
+
+static unsigned long
+xfs_inodegc_shrinker_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_mount	*mp = container_of(shrink, struct xfs_mount,
+						   m_inodegc_shrinker);
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (!xfs_is_inodegc_enabled(mp))
+		return 0;
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		if (!llist_empty(&gc->list))
+			return XFS_INODEGC_SHRINKER_COUNT;
+	}
+
+	return 0;
+}
+
+static unsigned long
+xfs_inodegc_shrinker_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_mount	*mp = container_of(shrink, struct xfs_mount,
+						   m_inodegc_shrinker);
+	struct xfs_inodegc	*gc;
+	int			cpu;
+	bool			no_items = true;
+
+	if (!xfs_is_inodegc_enabled(mp))
+		return SHRINK_STOP;
+
+	trace_xfs_inodegc_shrinker_scan(mp, sc, __return_address);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		if (!llist_empty(&gc->list)) {
+			unsigned int	h = READ_ONCE(gc->shrinker_hits);
+
+			WRITE_ONCE(gc->shrinker_hits, h + 1);
+			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+			no_items = false;
+		}
+	}
+
+	/*
+	 * If there are no inodes to inactivate, we don't want the shrinker
+	 * to think there's deferred work to call us back about.
+	 */
+	if (no_items)
+		return LONG_MAX;
+
+	return SHRINK_STOP;
+}
+
+/* Register a shrinker so we can accelerate inodegc and throttle queuing. */
+int
+xfs_inodegc_register_shrinker(
+	struct xfs_mount	*mp)
+{
+	struct shrinker		*shrink = &mp->m_inodegc_shrinker;
+
+	shrink->count_objects = xfs_inodegc_shrinker_count;
+	shrink->scan_objects = xfs_inodegc_shrinker_scan;
+	shrink->seeks = 0;
+	shrink->flags = SHRINKER_NONSLAB;
+	shrink->batch = XFS_INODEGC_SHRINKER_BATCH;
+
+	return register_shrinker(shrink);
+}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 18c2d224aa78..2e4cfddf8b8e 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -80,5 +80,6 @@ void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
 void xfs_inodegc_cpu_dead(struct xfs_mount *mp, unsigned int cpu);
+int xfs_inodegc_register_shrinker(struct xfs_mount *mp);
 
 #endif
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b81f2fc734bd..ff08192d8d2a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -769,6 +769,10 @@ xfs_mountfs(
 		goto out_free_perag;
 	}
 
+	error = xfs_inodegc_register_shrinker(mp);
+	if (error)
+		goto out_fail_wait;
+
 	/*
 	 * Log's mount-time initialization. The first part of recovery can place
 	 * some items on the AIL, to be handled when recovery is finished or
@@ -779,7 +783,7 @@ xfs_mountfs(
 			      XFS_FSB_TO_BB(mp, sbp->sb_logblocks));
 	if (error) {
 		xfs_warn(mp, "log mount failed");
-		goto out_fail_wait;
+		goto out_inodegc_shrinker;
 	}
 
 	/* Make sure the summary counts are ok. */
@@ -974,6 +978,8 @@ xfs_mountfs(
 	xfs_unmount_flush_inodes(mp);
  out_log_dealloc:
 	xfs_log_mount_cancel(mp);
+ out_inodegc_shrinker:
+	unregister_shrinker(&mp->m_inodegc_shrinker);
  out_fail_wait:
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_buftarg_drain(mp->m_logdev_targp);
@@ -1054,6 +1060,7 @@ xfs_unmountfs(
 #if defined(DEBUG)
 	xfs_errortag_clearall(mp);
 #endif
+	unregister_shrinker(&mp->m_inodegc_shrinker);
 	xfs_free_perag(mp);
 
 	xfs_errortag_del(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 797b8068dfe6..3eb7b06f2eff 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -63,6 +63,7 @@ struct xfs_inodegc {
 	struct llist_head	list;
 	struct work_struct	work;
 	unsigned int		items;
+	unsigned int		shrinker_hits;
 };
 
 /*
@@ -208,6 +209,8 @@ typedef struct xfs_mount {
 	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
 	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
 
+	/* Memory shrinker to throttle and reprioritize inodegc */
+	struct shrinker		m_inodegc_shrinker;
 	/*
 	 * Workqueue item so that we can coalesce multiple inode flush attempts
 	 * into a single flush.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0a179cfc35c0..90ae884bfee6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -157,6 +157,22 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
+TRACE_EVENT(xfs_inodegc_worker,
+	TP_PROTO(struct xfs_mount *mp, unsigned int shrinker_hits),
+	TP_ARGS(mp, shrinker_hits),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, shrinker_hits)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->shrinker_hits = shrinker_hits;
+	),
+	TP_printk("dev %d:%d shrinker_hits %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->shrinker_hits)
+);
+
 #define XFS_STATE_FLAGS \
 	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }, \
 	{ (1UL << XFS_STATE_BLOCKGC_ENABLED),		"blockgc" }
@@ -195,7 +211,6 @@ DEFINE_EVENT(xfs_fs_class, name,					\
 DEFINE_FS_EVENT(xfs_inodegc_flush);
 DEFINE_FS_EVENT(xfs_inodegc_start);
 DEFINE_FS_EVENT(xfs_inodegc_stop);
-DEFINE_FS_EVENT(xfs_inodegc_worker);
 DEFINE_FS_EVENT(xfs_inodegc_queue);
 DEFINE_FS_EVENT(xfs_inodegc_throttle);
 DEFINE_FS_EVENT(xfs_fs_sync_fs);
@@ -204,6 +219,26 @@ DEFINE_FS_EVENT(xfs_blockgc_stop);
 DEFINE_FS_EVENT(xfs_blockgc_worker);
 DEFINE_FS_EVENT(xfs_blockgc_flush_all);
 
+TRACE_EVENT(xfs_inodegc_shrinker_scan,
+	TP_PROTO(struct xfs_mount *mp, struct shrink_control *sc,
+		 void *caller_ip),
+	TP_ARGS(mp, sc, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, nr_to_scan)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->nr_to_scan = sc->nr_to_scan;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d nr_to_scan %lu caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nr_to_scan,
+		  __entry->caller_ip)
+);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/14] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable
  2021-08-05  2:06 ` [PATCH 03/14] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
@ 2021-08-05  5:29   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:06:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Move the xfs_inactive call and all the other debugging checks and stats
> updates into xfs_inode_mark_reclaimable because most of that are
> implementation details about the inode cache.  This is preparation for
> deferred inactivation that is coming up.  We also move it around
> xfs_icache.c in preparation for deferred inactivation.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/14] xfs: detach dquots from inode if we don't need to inactivate it
  2021-08-05  2:06 ` [PATCH 04/14] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
@ 2021-08-05  5:30   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:06:44PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> If we don't need to inactivate an inode, we can detach the dquots and
> move on to reclamation.  This isn't strictly required here; it's a
> preparation patch for deferred inactivation per reviewer request[1] to
> move the creation of xfs_inode_needs_inactivation into a separate
> change.  Eventually this !need_inactive chunk will turn into the code
> path for inodes that skip xfs_inactive and go straight to memory
> reclaim.
> 
> [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

LGTM.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/14] xfs: queue inactivation immediately when free space is tight
  2021-08-05  2:06 ` [PATCH 06/14] xfs: queue inactivation immediately when free space is tight Darrick J. Wong
@ 2021-08-05  5:31   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:06:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have made the inactivation of unlinked inodes a background
> task to increase the throughput of file deletions, we need to be a
> little more careful about how long of a delay we can tolerate.
> 
> On a mostly empty filesystem, the risk of the allocator making poor
> decisions due to fragmentation of the free space on account a lengthy
> delay in background updates is minimal because there's plenty of space.
> However, if free space is tight, we want to deallocate unlinked inodes
> as quickly as possible to avoid fallocate ENOSPC and to give the
> allocator the best shot at optimal allocations for new writes.
> 
> Therefore, queue the percpu worker immediately if the filesystem is more
> than 95% full.  This follows the same principle that XFS becomes less
> aggressive about speculative allocations and lazy cleanup (and more
> precise about accounting) when nearing full.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Makes sense.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/14] xfs: queue inactivation immediately when quota is nearing enforcement
  2021-08-05  2:07 ` [PATCH 07/14] xfs: queue inactivation immediately when quota is nearing enforcement Darrick J. Wong
@ 2021-08-05  5:35   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have made the inactivation of unlinked inodes a background
> task to increase the throughput of file deletions, we need to be a
> little more careful about how long of a delay we can tolerate.
> 
> Specifically, if the dquots attached to the inode being inactivated are
> nearing any kind of enforcement boundary, we want to queue that
> inactivation work immediately so that users don't get EDQUOT/ENOSPC
> errors even after they deleted a bunch of files to stay within quota.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_dquot.h  |   10 ++++++++++
>  fs/xfs/xfs_icache.c |   10 ++++++++++
>  fs/xfs/xfs_qm.c     |   34 ++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_quota.h  |    2 ++
>  4 files changed, 56 insertions(+)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/14] xfs: queue inactivation immediately when free realtime extents are tight
  2021-08-05  2:07 ` [PATCH 08/14] xfs: queue inactivation immediately when free realtime extents are tight Darrick J. Wong
@ 2021-08-05  5:36   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have made the inactivation of unlinked inodes a background
> task to increase the throughput of file deletions, we need to be a
> little more careful about how long of a delay we can tolerate.
> 
> Similar to the patch doing this for free space on the data device, if
> the file being inactivated is a realtime file and the realtime volume is
> running low on free extents, we want to run the worker ASAP so that the
> realtime allocator can make better decisions.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   24 ++++++++++++++++++++++++
>  fs/xfs/xfs_mount.c  |   13 ++++++++-----
>  fs/xfs/xfs_mount.h  |    3 ++-
>  3 files changed, 34 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index e5e90f09bcc6..4a062cf689c3 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1924,6 +1924,27 @@ xfs_inodegc_start(
>  	xfs_inodegc_queue_all(mp);
>  }
>  
> +#ifdef CONFIG_XFS_RT
> +static inline bool
> +xfs_inodegc_want_queue_rt_file(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	uint64_t		freertx;
> +
> +	if (!XFS_IS_REALTIME_INODE(ip))
> +		return false;
> +
> +	spin_lock(&mp->m_sb_lock);
> +	freertx = mp->m_sb.sb_frextents;
> +	spin_unlock(&mp->m_sb_lock);

READ_ONCE() is probably sufficient here. We're not actually
serialising this against any specific operation, so I don't think
the lock is necessary to sample the value.

Other than that, all good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/14] xfs: inactivate inodes any time we try to free speculative preallocations
  2021-08-05  2:07 ` [PATCH 09/14] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
@ 2021-08-05  5:36   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
> to try to free speculative preallocations when space is tight.  This
> means that file writes, transaction reservation failures, quota limit
> enforcement, and the EOFBLOCKS ioctl all call this function to free
> space when things are tight.
> 
> Since inode inactivation is now a background task, this means that the
> filesystem can be hanging on to unlinked but not yet freed space.  Add
> this to the list of things that xfs_blockgc_free_* makes writer threads
> scan for when they cannot reserve space.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)

Yup.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/14] xfs: flush inode inactivation work when compiling usage statistics
  2021-08-05  2:07 ` [PATCH 10/14] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
@ 2021-08-05  5:38   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Users have come to expect that the space accounting information in
> statfs and getquota reports are fairly accurate.  Now that we inactivate
> inodes from a background queue, these numbers can be thrown off by
> whatever resources are singly-owned by the inodes in the queue.  Flush
> the pending inactivations when userspace asks for a space usage report.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_qm_syscalls.c |    8 ++++++++
>  fs/xfs/xfs_super.c       |    3 +++
>  2 files changed, 11 insertions(+)

Makes sense. These aren't perf/latency sensitive operations, so
waiting on completion of a flush shouldn't be a big deal.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/14] xfs: don't run speculative preallocation gc when fs is frozen
  2021-08-05  2:07 ` [PATCH 11/14] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
@ 2021-08-05  5:40   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have the infrastructure to switch background workers on and
> off at will, fix the block gc worker code so that we don't actually run
> the worker when the filesystem is frozen, same as we do for deferred
> inactivation.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/common.c |    9 +++++----
>  fs/xfs/xfs_icache.c   |   27 +++++++++++++++++++++++----
>  fs/xfs/xfs_mount.c    |    1 +
>  fs/xfs/xfs_mount.h    |    8 ++++++++
>  fs/xfs/xfs_super.c    |    9 ++++++---
>  fs/xfs/xfs_trace.h    |    6 +++++-
>  6 files changed, 48 insertions(+), 12 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 12/14] xfs: use background worker pool when transactions can't get free space
  2021-08-05  2:07 ` [PATCH 12/14] xfs: use background worker pool when transactions can't get free space Darrick J. Wong
@ 2021-08-05  5:42   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
> call xfs_blockgc_free_space with a NULL icwalk structure to try to free
> space.  Each frontend thread that encounters this situation starts its
> own walk of the inode cache to see if it can find anything, which is
> wasteful since we don't have any additional selection criteria.  For
> this one common case, create a function that reschedules all pending
> background work immediately and flushes the workqueue so that the scan
> can run in parallel.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   28 ++++++++++++++++++++++++++++
>  fs/xfs/xfs_icache.h |    1 +
>  fs/xfs/xfs_trace.h  |    1 +
>  fs/xfs/xfs_trans.c  |    5 +----
>  4 files changed, 31 insertions(+), 4 deletions(-)

Yup, looks a bit more efficient to do it this way.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 14/14] xfs: throttle inode inactivation queuing on memory reclaim
  2021-08-05  2:07 ` [PATCH 14/14] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
@ 2021-08-05  5:44   ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  5:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 07:07:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we defer inode inactivation, we've decoupled the process of
> unlinking or closing an inode from the process of inactivating it.  In
> theory this should lead to better throughput since we now inactivate the
> queued inodes in batches instead of one at a time.
> 
> Unfortunately, one of the primary risks with this decoupling is the loss
> of rate control feedback between the frontend and background threads.
> In other words, a rm -rf /* thread can run the system out of memory if
> it can queue inodes for inactivation and jump to a new CPU faster than
> the background threads can actually clear the deferred work.  The
> workers can get scheduled off the CPU if they have to do IO, etc.
> 
> To solve this problem, we configure a shrinker so that it will activate
> the /second/ time the shrinkers are called.  The custom shrinker will
> queue all percpu deferred inactivation workers immediately and set a
> flag to force frontend callers who are releasing a vfs inode to wait for
> the inactivation workers.
> 
> On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
> most of the OOMing problem when deleting 10 million inodes.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_icache.h |    1 +
>  fs/xfs/xfs_mount.c  |    9 ++++-
>  fs/xfs/xfs_mount.h  |    3 ++
>  fs/xfs/xfs_trace.h  |   37 ++++++++++++++++++-
>  5 files changed, 147 insertions(+), 5 deletions(-)

I'm still not really convinced this is the right way to go here, but
it doesn't hurt much so lets run with it for now. When I rework the
inode reclaim shrinker hooks I'll revisit this.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-05  2:06 ` [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues Darrick J. Wong
@ 2021-08-05  6:43   ` Dave Chinner
  2021-08-05  7:00     ` Darrick J. Wong
  2021-08-07  0:21   ` Darrick J. Wong
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-08-05  6:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, hch

On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Move inode inactivation to background work contexts so that it no
> longer runs in the context that releases the final reference to an
> inode. This will allow process work that ends up blocking on
> inactivation to continue doing work while the filesytem processes
> the inactivation in the background.
> 
> A typical demonstration of this is unlinking an inode with lots of
> extents. The extents are removed during inactivation, so this blocks
> the process that unlinked the inode from the directory structure. By
> moving the inactivation to the background process, the userspace
> applicaiton can keep working (e.g. unlinking the next inode in the
> directory) while the inactivation work on the previous inode is
> done by a different CPU.
> 
> The implementation of the queue is relatively simple. We use a
> per-cpu lockless linked list (llist) to queue inodes for
> inactivation without requiring serialisation mechanisms, and a work
> item to allow the queue to be processed by a CPU bound worker
> thread. We also keep a count of the queue depth so that we can
> trigger work after a number of deferred inactivations have been
> queued.
> 
> The use of a bound workqueue with a single work depth allows the
> workqueue to run one work item per CPU. We queue the work item on
> the CPU we are currently running on, and so this essentially gives
> us affine per-cpu worker threads for the per-cpu queues. THis
> maintains the effective CPU affinity that occurs within XFS at the
> AG level due to all objects in a directory being local to an AG.
> Hence inactivation work tends to run on the same CPU that last
> accessed all the objects that inactivation accesses and this
> maintains hot CPU caches for unlink workloads.
> 
> A depth of 32 inodes was chosen to match the number of inodes in an
> inode cluster buffer. This hopefully allows sequential
> allocation/unlink behaviours to defering inactivation of all the
> inodes in a single cluster buffer at a time, further helping
> maintain hot CPU and buffer cache accesses while running
> inactivations.
> 
> A hard per-cpu queue throttle of 256 inode has been set to avoid
> runaway queuing when inodes that take a long to time inactivate are
> being processed. For example, when unlinking inodes with large
> numbers of extents that can take a lot of processing to free.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> [djwong: tweak comments and tracepoints, convert opflags to state bits]
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
.....
> @@ -740,6 +751,25 @@ xfs_fs_sync_fs(
>  		flush_delayed_work(&mp->m_log->l_work);
>  	}
>  
> +	/*
> +	 * Flush all deferred inode inactivation work so that the free space
> +	 * counters will reflect recent deletions.  Do not force the log again
> +	 * because log recovery can restart the inactivation from the info that
> +	 * we just wrote into the ondisk log.
> +	 *
> +	 * For regular operation this isn't strictly necessary since we aren't
> +	 * required to guarantee that unlinking frees space immediately, but
> +	 * that is how XFS historically behaved.
> +	 *
> +	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
> +	 * last chance to complete the inactivation work before the filesystem
> +	 * freezes and the log is quiesced.  The background worker will not
> +	 * activate again until the fs is thawed because the VFS won't evict
> +	 * any more inodes until freeze_super drops s_umount and we disable the
> +	 * worker in xfs_fs_freeze.
> +	 */
> +	xfs_inodegc_flush(mp);
> +
>  	return 0;
>  }
>  
> @@ -854,6 +884,17 @@ xfs_fs_freeze(
>  	 */
>  	flags = memalloc_nofs_save();
>  	xfs_blockgc_stop(mp);
> +
> +	/*
> +	 * Stop the inodegc background worker.  freeze_super already flushed
> +	 * all pending inodegc work when it sync'd the filesystem after setting
> +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> +	 * If the filesystem is read-write, inactivated inodes will queue but
> +	 * the worker will not run until the filesystem thaws or unmounts.
> +	 */
> +	xfs_inodegc_stop(mp);
> +
>  	xfs_save_resvblks(mp);
>  	ret = xfs_log_quiesce(mp);
>  	memalloc_nofs_restore(flags);

I still think this freeze handling is problematic. While I can't easily trigger
the problem I saw, I still don't really see what makes the flush in
xfs_fs_sync_fs() prevent races with the final stage of freeze before
inactivation is stopped......

.... and ....

as I write this the xfs/517 loop goes boom on my pmem test setup (but no DAX):

SECTION       -- xfs
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test3 5.14.0-rc4-dgc #506 SMP PREEMPT Thu Aug 5 15:49:49 AEST 2021
MKFS_OPTIONS  -- -f -m rmapbt=1 /dev/pmem1
MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch

generic/390 3s ...  3s
xfs/517 43s ... 
Message from syslogd@test3 at Aug  5 15:56:24 ...
kernel:[  162.849634] XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1889

I suspect that we could actually target this better and close the
race by doing something like:

xfs_fs_sync_fs()
{
	....

	/*
	 * If we are called with page faults frozen out, it means we are about
	 * to freeze the transaction subsystem. Take the opportunity to shut
	 * down inodegc because once SB_FREEZE_FS is set it's too late to
	 * prevent inactivation races with freeze. The fs doesn't get called
	 * again by the freezing process until after SB_FREEZE_FS has been set,
	 * so it's now or never.
	 *
	 * We don't care if this is a normal syncfs call that does this or
	 * freeze that does this - we can run this multiple times without issue
	 * and we won't race with a restart because a restart can only occur when
	 * the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
	 */
	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
		xfs_inodegc_stop(mp);
}

xfs_fs_freeze()
{
.....
error:
	/*
	 * We need to restart the inodegc on error because we stopped it at
	 * SB_FREEZE_PAGEFAULT level and a thaw is not going to be run to
	 * restart it now. We are at SB_FREEZE_FS level here, so we can restart
	 * safely without racing with a stop in xfs_fs_sync_fs().
	 */
	if (error)
		xfs_inodegc_start(mp);
	return error:
}

The stop and "restart on error" are done under the same s_umount hold, so they
are atomic w.r.t. to other freeze operations so doesn't have the problems with
nested freeze/thaw (g/390) that my last patch had.

Your thoughts?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-05  6:43   ` Dave Chinner
@ 2021-08-05  7:00     ` Darrick J. Wong
  2021-08-05 22:15       ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05  7:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Chinner, linux-xfs, hch

On Thu, Aug 05, 2021 at 04:43:24PM +1000, Dave Chinner wrote:
> On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Move inode inactivation to background work contexts so that it no
> > longer runs in the context that releases the final reference to an
> > inode. This will allow process work that ends up blocking on
> > inactivation to continue doing work while the filesytem processes
> > the inactivation in the background.
> > 
> > A typical demonstration of this is unlinking an inode with lots of
> > extents. The extents are removed during inactivation, so this blocks
> > the process that unlinked the inode from the directory structure. By
> > moving the inactivation to the background process, the userspace
> > applicaiton can keep working (e.g. unlinking the next inode in the
> > directory) while the inactivation work on the previous inode is
> > done by a different CPU.
> > 
> > The implementation of the queue is relatively simple. We use a
> > per-cpu lockless linked list (llist) to queue inodes for
> > inactivation without requiring serialisation mechanisms, and a work
> > item to allow the queue to be processed by a CPU bound worker
> > thread. We also keep a count of the queue depth so that we can
> > trigger work after a number of deferred inactivations have been
> > queued.
> > 
> > The use of a bound workqueue with a single work depth allows the
> > workqueue to run one work item per CPU. We queue the work item on
> > the CPU we are currently running on, and so this essentially gives
> > us affine per-cpu worker threads for the per-cpu queues. THis
> > maintains the effective CPU affinity that occurs within XFS at the
> > AG level due to all objects in a directory being local to an AG.
> > Hence inactivation work tends to run on the same CPU that last
> > accessed all the objects that inactivation accesses and this
> > maintains hot CPU caches for unlink workloads.
> > 
> > A depth of 32 inodes was chosen to match the number of inodes in an
> > inode cluster buffer. This hopefully allows sequential
> > allocation/unlink behaviours to defering inactivation of all the
> > inodes in a single cluster buffer at a time, further helping
> > maintain hot CPU and buffer cache accesses while running
> > inactivations.
> > 
> > A hard per-cpu queue throttle of 256 inode has been set to avoid
> > runaway queuing when inodes that take a long to time inactivate are
> > being processed. For example, when unlinking inodes with large
> > numbers of extents that can take a lot of processing to free.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > [djwong: tweak comments and tracepoints, convert opflags to state bits]
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> .....
> > @@ -740,6 +751,25 @@ xfs_fs_sync_fs(
> >  		flush_delayed_work(&mp->m_log->l_work);
> >  	}
> >  
> > +	/*
> > +	 * Flush all deferred inode inactivation work so that the free space
> > +	 * counters will reflect recent deletions.  Do not force the log again
> > +	 * because log recovery can restart the inactivation from the info that
> > +	 * we just wrote into the ondisk log.
> > +	 *
> > +	 * For regular operation this isn't strictly necessary since we aren't
> > +	 * required to guarantee that unlinking frees space immediately, but
> > +	 * that is how XFS historically behaved.
> > +	 *
> > +	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
> > +	 * last chance to complete the inactivation work before the filesystem
> > +	 * freezes and the log is quiesced.  The background worker will not
> > +	 * activate again until the fs is thawed because the VFS won't evict
> > +	 * any more inodes until freeze_super drops s_umount and we disable the
> > +	 * worker in xfs_fs_freeze.
> > +	 */
> > +	xfs_inodegc_flush(mp);
> > +
> >  	return 0;
> >  }
> >  
> > @@ -854,6 +884,17 @@ xfs_fs_freeze(
> >  	 */
> >  	flags = memalloc_nofs_save();
> >  	xfs_blockgc_stop(mp);
> > +
> > +	/*
> > +	 * Stop the inodegc background worker.  freeze_super already flushed
> > +	 * all pending inodegc work when it sync'd the filesystem after setting
> > +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> > +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> > +	 * If the filesystem is read-write, inactivated inodes will queue but
> > +	 * the worker will not run until the filesystem thaws or unmounts.
> > +	 */
> > +	xfs_inodegc_stop(mp);
> > +
> >  	xfs_save_resvblks(mp);
> >  	ret = xfs_log_quiesce(mp);
> >  	memalloc_nofs_restore(flags);
> 
> I still think this freeze handling is problematic. While I can't easily trigger
> the problem I saw, I still don't really see what makes the flush in
> xfs_fs_sync_fs() prevent races with the final stage of freeze before
> inactivation is stopped......
> 
> .... and ....
> 
> as I write this the xfs/517 loop goes boom on my pmem test setup (but no DAX):
> 
> SECTION       -- xfs
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 test3 5.14.0-rc4-dgc #506 SMP PREEMPT Thu Aug 5 15:49:49 AEST 2021
> MKFS_OPTIONS  -- -f -m rmapbt=1 /dev/pmem1
> MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> 
> generic/390 3s ...  3s
> xfs/517 43s ... 
> Message from syslogd@test3 at Aug  5 15:56:24 ...
> kernel:[  162.849634] XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1889
> 
> I suspect that we could actually target this better and close the
> race by doing something like:
> 
> xfs_fs_sync_fs()
> {
> 	....
> 
> 	/*
> 	 * If we are called with page faults frozen out, it means we are about
> 	 * to freeze the transaction subsystem. Take the opportunity to shut
> 	 * down inodegc because once SB_FREEZE_FS is set it's too late to
> 	 * prevent inactivation races with freeze. The fs doesn't get called
> 	 * again by the freezing process until after SB_FREEZE_FS has been set,
> 	 * so it's now or never.
> 	 *
> 	 * We don't care if this is a normal syncfs call that does this or
> 	 * freeze that does this - we can run this multiple times without issue
> 	 * and we won't race with a restart because a restart can only occur when
> 	 * the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> 	 */
> 	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
> 		xfs_inodegc_stop(mp);

LOL, a previous version of this series actually did this part this way,
but...

> }
> 
> xfs_fs_freeze()
> {
> .....
> error:
> 	/*
> 	 * We need to restart the inodegc on error because we stopped it at
> 	 * SB_FREEZE_PAGEFAULT level and a thaw is not going to be run to
> 	 * restart it now. We are at SB_FREEZE_FS level here, so we can restart
> 	 * safely without racing with a stop in xfs_fs_sync_fs().
> 	 */
> 	if (error)
> 		xfs_inodegc_start(mp);

...missed this part.  If this fixes x517 and doesn't break g390 for you,
I'll meld it into the series.  I think the reasoning here makes sense.

--D

> 	return error:
> }
> 
> The stop and "restart on error" are done under the same s_umount hold, so they
> are atomic w.r.t. to other freeze operations so doesn't have the problems with
> nested freeze/thaw (g/390) that my last patch had.
> 
> Your thoughts?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-05  7:00     ` Darrick J. Wong
@ 2021-08-05 22:15       ` Dave Chinner
  2021-08-05 22:38         ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-08-05 22:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, hch

On Thu, Aug 05, 2021 at 12:00:32AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 05, 2021 at 04:43:24PM +1000, Dave Chinner wrote:
> > On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Move inode inactivation to background work contexts so that it no
> > > longer runs in the context that releases the final reference to an
> > > inode. This will allow process work that ends up blocking on
> > > inactivation to continue doing work while the filesytem processes
> > > the inactivation in the background.
....
> > > @@ -854,6 +884,17 @@ xfs_fs_freeze(
> > >  	 */
> > >  	flags = memalloc_nofs_save();
> > >  	xfs_blockgc_stop(mp);
> > > +
> > > +	/*
> > > +	 * Stop the inodegc background worker.  freeze_super already flushed
> > > +	 * all pending inodegc work when it sync'd the filesystem after setting
> > > +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> > > +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> > > +	 * If the filesystem is read-write, inactivated inodes will queue but
> > > +	 * the worker will not run until the filesystem thaws or unmounts.
> > > +	 */
> > > +	xfs_inodegc_stop(mp);
> > > +
> > >  	xfs_save_resvblks(mp);
> > >  	ret = xfs_log_quiesce(mp);
> > >  	memalloc_nofs_restore(flags);
> > 
> > I still think this freeze handling is problematic. While I can't easily trigger
> > the problem I saw, I still don't really see what makes the flush in
> > xfs_fs_sync_fs() prevent races with the final stage of freeze before
> > inactivation is stopped......
> > 
> > .... and ....
> > 
> > as I write this the xfs/517 loop goes boom on my pmem test setup (but no DAX):
> > 
> > SECTION       -- xfs
> > FSTYP         -- xfs (debug)
> > PLATFORM      -- Linux/x86_64 test3 5.14.0-rc4-dgc #506 SMP PREEMPT Thu Aug 5 15:49:49 AEST 2021
> > MKFS_OPTIONS  -- -f -m rmapbt=1 /dev/pmem1
> > MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> > 
> > generic/390 3s ...  3s
> > xfs/517 43s ... 
> > Message from syslogd@test3 at Aug  5 15:56:24 ...
> > kernel:[  162.849634] XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1889
> > 
> > I suspect that we could actually target this better and close the
> > race by doing something like:
> > 
> > xfs_fs_sync_fs()
> > {
> > 	....
> > 
> > 	/*
> > 	 * If we are called with page faults frozen out, it means we are about
> > 	 * to freeze the transaction subsystem. Take the opportunity to shut
> > 	 * down inodegc because once SB_FREEZE_FS is set it's too late to
> > 	 * prevent inactivation races with freeze. The fs doesn't get called
> > 	 * again by the freezing process until after SB_FREEZE_FS has been set,
> > 	 * so it's now or never.
> > 	 *
> > 	 * We don't care if this is a normal syncfs call that does this or
> > 	 * freeze that does this - we can run this multiple times without issue
> > 	 * and we won't race with a restart because a restart can only occur when
> > 	 * the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > 	 */
> > 	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
> > 		xfs_inodegc_stop(mp);
> 
> LOL, a previous version of this series actually did this part this way,
> but...
> 
> > }
> > 
> > xfs_fs_freeze()
> > {
> > .....
> > error:
> > 	/*
> > 	 * We need to restart the inodegc on error because we stopped it at
> > 	 * SB_FREEZE_PAGEFAULT level and a thaw is not going to be run to
> > 	 * restart it now. We are at SB_FREEZE_FS level here, so we can restart
> > 	 * safely without racing with a stop in xfs_fs_sync_fs().
> > 	 */
> > 	if (error)
> > 		xfs_inodegc_start(mp);
> 
> ...missed this part.  If this fixes x517 and doesn't break g390 for you,
> I'll meld it into the series.  I think the reasoning here makes sense.

Nope, both x517 and g390 still fire this assert, so there's
something else we're missing here.

I keep wondering if we should be wrapping the entire flush mechanism
in a rwsem - read for flush, write for start/stop - so that we
aren't ever still processing a stop while a concurrent start runs or
vice versa...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-05 22:15       ` Dave Chinner
@ 2021-08-05 22:38         ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-05 22:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Chinner, linux-xfs, hch

On Fri, Aug 06, 2021 at 08:15:02AM +1000, Dave Chinner wrote:
> On Thu, Aug 05, 2021 at 12:00:32AM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 05, 2021 at 04:43:24PM +1000, Dave Chinner wrote:
> > > On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Move inode inactivation to background work contexts so that it no
> > > > longer runs in the context that releases the final reference to an
> > > > inode. This will allow process work that ends up blocking on
> > > > inactivation to continue doing work while the filesytem processes
> > > > the inactivation in the background.
> ....
> > > > @@ -854,6 +884,17 @@ xfs_fs_freeze(
> > > >  	 */
> > > >  	flags = memalloc_nofs_save();
> > > >  	xfs_blockgc_stop(mp);
> > > > +
> > > > +	/*
> > > > +	 * Stop the inodegc background worker.  freeze_super already flushed
> > > > +	 * all pending inodegc work when it sync'd the filesystem after setting
> > > > +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> > > > +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> > > > +	 * If the filesystem is read-write, inactivated inodes will queue but
> > > > +	 * the worker will not run until the filesystem thaws or unmounts.
> > > > +	 */
> > > > +	xfs_inodegc_stop(mp);
> > > > +
> > > >  	xfs_save_resvblks(mp);
> > > >  	ret = xfs_log_quiesce(mp);
> > > >  	memalloc_nofs_restore(flags);
> > > 
> > > I still think this freeze handling is problematic. While I can't easily trigger
> > > the problem I saw, I still don't really see what makes the flush in
> > > xfs_fs_sync_fs() prevent races with the final stage of freeze before
> > > inactivation is stopped......
> > > 
> > > .... and ....
> > > 
> > > as I write this the xfs/517 loop goes boom on my pmem test setup (but no DAX):
> > > 
> > > SECTION       -- xfs
> > > FSTYP         -- xfs (debug)
> > > PLATFORM      -- Linux/x86_64 test3 5.14.0-rc4-dgc #506 SMP PREEMPT Thu Aug 5 15:49:49 AEST 2021
> > > MKFS_OPTIONS  -- -f -m rmapbt=1 /dev/pmem1
> > > MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> > > 
> > > generic/390 3s ...  3s
> > > xfs/517 43s ... 
> > > Message from syslogd@test3 at Aug  5 15:56:24 ...
> > > kernel:[  162.849634] XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1889
> > > 
> > > I suspect that we could actually target this better and close the
> > > race by doing something like:
> > > 
> > > xfs_fs_sync_fs()
> > > {
> > > 	....
> > > 
> > > 	/*
> > > 	 * If we are called with page faults frozen out, it means we are about
> > > 	 * to freeze the transaction subsystem. Take the opportunity to shut
> > > 	 * down inodegc because once SB_FREEZE_FS is set it's too late to
> > > 	 * prevent inactivation races with freeze. The fs doesn't get called
> > > 	 * again by the freezing process until after SB_FREEZE_FS has been set,
> > > 	 * so it's now or never.
> > > 	 *
> > > 	 * We don't care if this is a normal syncfs call that does this or
> > > 	 * freeze that does this - we can run this multiple times without issue
> > > 	 * and we won't race with a restart because a restart can only occur when
> > > 	 * the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > 	 */
> > > 	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
> > > 		xfs_inodegc_stop(mp);
> > 
> > LOL, a previous version of this series actually did this part this way,
> > but...
> > 
> > > }
> > > 
> > > xfs_fs_freeze()
> > > {
> > > .....
> > > error:
> > > 	/*
> > > 	 * We need to restart the inodegc on error because we stopped it at
> > > 	 * SB_FREEZE_PAGEFAULT level and a thaw is not going to be run to
> > > 	 * restart it now. We are at SB_FREEZE_FS level here, so we can restart
> > > 	 * safely without racing with a stop in xfs_fs_sync_fs().
> > > 	 */
> > > 	if (error)
> > > 		xfs_inodegc_start(mp);
> > 
> > ...missed this part.  If this fixes x517 and doesn't break g390 for you,
> > I'll meld it into the series.  I think the reasoning here makes sense.
> 
> Nope, both x517 and g390 still fire this assert, so there's
> something else we're missing here.
> 
> I keep wondering if we should be wrapping the entire flush mechanism
> in a rwsem - read for flush, write for start/stop - so that we
> aren't ever still processing a stop while a concurrent start runs or
> vice versa...

Funny you should mention that, I /do/ have a patch in djwong-dev adding
such a rwsem, though for different purposes (permitting scrub to lock
out freeze requests from userspace).

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-05  2:06 ` [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues Darrick J. Wong
  2021-08-05  6:43   ` Dave Chinner
@ 2021-08-07  0:21   ` Darrick J. Wong
  2021-08-07 21:49     ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-07  0:21 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs, david, hch

On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>

<megasnip> A couple of minor changes that aren't worth reposting the
entire series:

> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index b9214733d0c3..fedfa40e3cd6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c

<snip>

> @@ -1767,30 +1801,276 @@ xfs_inode_mark_reclaimable(
>  		ASSERT(0);
>  	}
>  
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +	spin_lock(&pag->pag_ici_lock);
> +	spin_lock(&ip->i_flags_lock);
> +
> +	trace_xfs_inode_set_reclaimable(ip);
> +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> +	ip->i_flags |= XFS_IRECLAIMABLE;
> +	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			XFS_ICI_RECLAIM_TAG);
> +
> +	spin_unlock(&ip->i_flags_lock);
> +	spin_unlock(&pag->pag_ici_lock);
> +	xfs_perag_put(pag);
> +}
> +
> +/*
> + * Free all speculative preallocations and possibly even the inode itself.
> + * This is the last chance to make changes to an otherwise unreferenced file
> + * before incore reclamation happens.
> + */
> +static void
> +xfs_inodegc_inactivate(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount        *mp = ip->i_mount;
> +
> +	/*
> +	* Inactivation isn't supposed to run when the fs is frozen because
> +	* we don't want kernel threads to block on transaction allocation.
> +	*/
> +	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
> +

I solved the problems Dave was complaining about (g/390, x/517) by
removing this ASSERT.

> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 19260291ff8b..bd8abb50b33a 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -157,6 +157,48 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
>  DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
>  DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
>  
> +#define XFS_STATE_FLAGS \
> +	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }

I've also changed the name of this to XFS_OPSTATE_STRINGS because we use
_STRINGS everywhere else in this file.

--D

> +
> +DECLARE_EVENT_CLASS(xfs_fs_class,
> +	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
> +	TP_ARGS(mp, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned long long, mflags)
> +		__field(unsigned long, opstate)
> +		__field(unsigned long, sbflags)
> +		__field(void *, caller_ip)
> +	),
> +	TP_fast_assign(
> +		if (mp) {
> +			__entry->dev = mp->m_super->s_dev;
> +			__entry->mflags = mp->m_flags;
> +			__entry->opstate = mp->m_opstate;
> +			__entry->sbflags = mp->m_super->s_flags;
> +		}
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("dev %d:%d m_flags 0x%llx opstate (%s) s_flags 0x%lx caller %pS",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->mflags,
> +		  __print_flags(__entry->opstate, "|", XFS_STATE_FLAGS),
> +		  __entry->sbflags,
> +		  __entry->caller_ip)
> +);
> +
> +#define DEFINE_FS_EVENT(name)	\
> +DEFINE_EVENT(xfs_fs_class, name,					\
> +	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
> +	TP_ARGS(mp, caller_ip))
> +DEFINE_FS_EVENT(xfs_inodegc_flush);
> +DEFINE_FS_EVENT(xfs_inodegc_start);
> +DEFINE_FS_EVENT(xfs_inodegc_stop);
> +DEFINE_FS_EVENT(xfs_inodegc_worker);
> +DEFINE_FS_EVENT(xfs_inodegc_queue);
> +DEFINE_FS_EVENT(xfs_inodegc_throttle);
> +DEFINE_FS_EVENT(xfs_fs_sync_fs);
> +
>  DECLARE_EVENT_CLASS(xfs_ag_class,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
>  	TP_ARGS(mp, agno),
> @@ -616,14 +658,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
>  	TP_STRUCT__entry(
>  		__field(dev_t, dev)
>  		__field(xfs_ino_t, ino)
> +		__field(unsigned long, iflags)
>  	),
>  	TP_fast_assign(
>  		__entry->dev = VFS_I(ip)->i_sb->s_dev;
>  		__entry->ino = ip->i_ino;
> +		__entry->iflags = ip->i_flags;
>  	),
> -	TP_printk("dev %d:%d ino 0x%llx",
> +	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino)
> +		  __entry->ino,
> +		  __entry->iflags)
>  )
>  
>  #define DEFINE_INODE_EVENT(name) \
> @@ -667,6 +712,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
>  DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
>  DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
>  DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
> +DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
> +DEFINE_INODE_EVENT(xfs_inode_reclaiming);
> +DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
> +DEFINE_INODE_EVENT(xfs_inode_inactivating);
>  
>  /*
>   * ftrace's __print_symbolic requires that all enum values be wrapped in the
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-07  0:21   ` Darrick J. Wong
@ 2021-08-07 21:49     ` Dave Chinner
  2021-08-09 23:36       ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-08-07 21:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, hch

On Fri, Aug 06, 2021 at 05:21:04PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> 
> <megasnip> A couple of minor changes that aren't worth reposting the
> entire series:
> 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index b9214733d0c3..fedfa40e3cd6 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> 
> <snip>
> 
> > @@ -1767,30 +1801,276 @@ xfs_inode_mark_reclaimable(
> >  		ASSERT(0);
> >  	}
> >  
> > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> > +	spin_lock(&pag->pag_ici_lock);
> > +	spin_lock(&ip->i_flags_lock);
> > +
> > +	trace_xfs_inode_set_reclaimable(ip);
> > +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> > +	ip->i_flags |= XFS_IRECLAIMABLE;
> > +	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > +			XFS_ICI_RECLAIM_TAG);
> > +
> > +	spin_unlock(&ip->i_flags_lock);
> > +	spin_unlock(&pag->pag_ici_lock);
> > +	xfs_perag_put(pag);
> > +}
> > +
> > +/*
> > + * Free all speculative preallocations and possibly even the inode itself.
> > + * This is the last chance to make changes to an otherwise unreferenced file
> > + * before incore reclamation happens.
> > + */
> > +static void
> > +xfs_inodegc_inactivate(
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount        *mp = ip->i_mount;
> > +
> > +	/*
> > +	* Inactivation isn't supposed to run when the fs is frozen because
> > +	* we don't want kernel threads to block on transaction allocation.
> > +	*/
> > +	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
> > +
> 
> I solved the problems Dave was complaining about (g/390, x/517) by
> removing this ASSERT.
> 
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 19260291ff8b..bd8abb50b33a 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -157,6 +157,48 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
> >  DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
> >  DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
> >  
> > +#define XFS_STATE_FLAGS \
> > +	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }
> 
> I've also changed the name of this to XFS_OPSTATE_STRINGS because we use
> _STRINGS everywhere else in this file.

FWIW, can we define this with the definition of the OPSTATE
variables in xfs_mount.h? THat makes it much easier to keep up to
date when we add new flags because it's obvious that there are
tracing flags that also need to be updated when we add a new state
flag...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues
  2021-08-07 21:49     ` Dave Chinner
@ 2021-08-09 23:36       ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-08-09 23:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Chinner, linux-xfs, hch

On Sun, Aug 08, 2021 at 07:49:00AM +1000, Dave Chinner wrote:
> On Fri, Aug 06, 2021 at 05:21:04PM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 04, 2021 at 07:06:50PM -0700, Darrick J. Wong wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > 
> > <megasnip> A couple of minor changes that aren't worth reposting the
> > entire series:
> > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index b9214733d0c3..fedfa40e3cd6 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > 
> > <snip>
> > 
> > > @@ -1767,30 +1801,276 @@ xfs_inode_mark_reclaimable(
> > >  		ASSERT(0);
> > >  	}
> > >  
> > > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> > > +	spin_lock(&pag->pag_ici_lock);
> > > +	spin_lock(&ip->i_flags_lock);
> > > +
> > > +	trace_xfs_inode_set_reclaimable(ip);
> > > +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> > > +	ip->i_flags |= XFS_IRECLAIMABLE;
> > > +	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > > +			XFS_ICI_RECLAIM_TAG);
> > > +
> > > +	spin_unlock(&ip->i_flags_lock);
> > > +	spin_unlock(&pag->pag_ici_lock);
> > > +	xfs_perag_put(pag);
> > > +}
> > > +
> > > +/*
> > > + * Free all speculative preallocations and possibly even the inode itself.
> > > + * This is the last chance to make changes to an otherwise unreferenced file
> > > + * before incore reclamation happens.
> > > + */
> > > +static void
> > > +xfs_inodegc_inactivate(
> > > +	struct xfs_inode	*ip)
> > > +{
> > > +	struct xfs_mount        *mp = ip->i_mount;
> > > +
> > > +	/*
> > > +	* Inactivation isn't supposed to run when the fs is frozen because
> > > +	* we don't want kernel threads to block on transaction allocation.
> > > +	*/
> > > +	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
> > > +
> > 
> > I solved the problems Dave was complaining about (g/390, x/517) by
> > removing this ASSERT.
> > 
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 19260291ff8b..bd8abb50b33a 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -157,6 +157,48 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
> > >  DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
> > >  DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
> > >  
> > > +#define XFS_STATE_FLAGS \
> > > +	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }
> > 
> > I've also changed the name of this to XFS_OPSTATE_STRINGS because we use
> > _STRINGS everywhere else in this file.
> 
> FWIW, can we define this with the definition of the OPSTATE
> variables in xfs_mount.h? THat makes it much easier to keep up to
> date when we add new flags because it's obvious that there are
> tracing flags that also need to be updated when we add a new state
> flag...

Yeah, I'll move the _STRINGS definition and clean up all these names to
have 'OPSTATE' instead of just 'STATE' too.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-08-09 23:36 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05  2:06 [PATCHSET v9 00/14] xfs: deferred inode inactivation Darrick J. Wong
2021-08-05  2:06 ` [PATCH 01/14] xfs: introduce CPU hotplug infrastructure Darrick J. Wong
2021-08-05  2:06 ` [PATCH 02/14] xfs: introduce all-mounts list for cpu hotplug notifications Darrick J. Wong
2021-08-05  2:06 ` [PATCH 03/14] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
2021-08-05  5:29   ` Dave Chinner
2021-08-05  2:06 ` [PATCH 04/14] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
2021-08-05  5:30   ` Dave Chinner
2021-08-05  2:06 ` [PATCH 05/14] xfs: per-cpu deferred inode inactivation queues Darrick J. Wong
2021-08-05  6:43   ` Dave Chinner
2021-08-05  7:00     ` Darrick J. Wong
2021-08-05 22:15       ` Dave Chinner
2021-08-05 22:38         ` Darrick J. Wong
2021-08-07  0:21   ` Darrick J. Wong
2021-08-07 21:49     ` Dave Chinner
2021-08-09 23:36       ` Darrick J. Wong
2021-08-05  2:06 ` [PATCH 06/14] xfs: queue inactivation immediately when free space is tight Darrick J. Wong
2021-08-05  5:31   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 07/14] xfs: queue inactivation immediately when quota is nearing enforcement Darrick J. Wong
2021-08-05  5:35   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 08/14] xfs: queue inactivation immediately when free realtime extents are tight Darrick J. Wong
2021-08-05  5:36   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 09/14] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
2021-08-05  5:36   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 10/14] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
2021-08-05  5:38   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 11/14] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
2021-08-05  5:40   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 12/14] xfs: use background worker pool when transactions can't get free space Darrick J. Wong
2021-08-05  5:42   ` Dave Chinner
2021-08-05  2:07 ` [PATCH 13/14] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
2021-08-05  2:07 ` [PATCH 14/14] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
2021-08-05  5:44   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).