linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v8 00/20] xfs: deferred inode inactivation
@ 2021-07-29 18:43 Darrick J. Wong
  2021-07-29 18:43 ` [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
                   ` (20 more replies)
  0 siblings, 21 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:43 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch

Hi all,

This patch series implements deferred inode inactivation.  Inactivation
is what happens when an open file loses its last incore reference: if
the file has speculative preallocations, they must be freed, and if the
file is unlinked, all forks must be truncated, and the inode marked
freed in the inode chunk and the inode btrees.

Currently, all of this activity is performed in frontend threads when
the last in-memory reference is lost and/or the vfs decides to drop the
inode.  Three complaints stem from this behavior: first, that the time
to unlink (in the worst case) depends on both the complexity of the
directory as well as the the number of extents in that file; second,
that deleting a directory tree is inefficient and seeky because we free
the inodes in readdir order, not disk order; and third, the upcoming
online repair feature needs to be able to xfs_irele while scanning a
filesystem in transaction context.  It cannot perform inode inactivation
in this context because xfs does not support nested transactions.

The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation.  When inodes need inactivation, we set
NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
deferred work item.  The deferred worker runs in an unbounded workqueue,
scanning the inode radix tree for tagged inodes to inactivate, and
performing all the on-disk metadata updates.  Once the inode has been
inactivated, it is left in the reclaim state and the background reclaim
worker (or direct reclaim) will get to it eventually.

Doing the inactivations from kernel threads solves the first problem by
constraining the amount of work done by the unlink() call to removing
the directory entry.  It solves the third problem by moving inactivation
to a separate process.  Because the inactivations are done in order of
inode number, we solve the second problem by performing updates in (we
hope) disk order.  This also decreases the amount of time it takes to
let go of an inode cluster if we're deleting entire directory trees.

There are three big warts I can think of in this series: first, because
the actual freeing of nlink==0 inodes is now done in the background,
this means that the system will be busy making metadata updates for some
time after the unlink() call returns.  This temporarily reduces
available iops.  Second, in order to retain the behavior that deleting
100TB of unshared data should result in a free space gain of 100TB, the
statvfs and quota reporting ioctls wait for inactivation to finish,
which increases the long tail latency of those calls.  This behavior is,
unfortunately, key to not introducing regressions in fstests.  The third
problem is that the deferrals keep memory usage higher for longer,
reduce opportunities to throttle the frontend when metadata load is
heavy, and the unbounded workqueues can create transaction storms.

v1-v2: NYE patchbombs
v3: rebase against 5.12-rc2 for submission.
v4: combine the can/has eofblocks predicates, clean up incore inode tree
    walks, fix inobt deadlock
v5: actually freeze the inode gc threads when we freeze the filesystem,
    consolidate the code that deals with inode tagging, and use
    foreground inactivation during quotaoff to avoid cycling dquots
v6: rebase to 5.13-rc4, fix quotaoff not to require foreground inactivation,
    refactor to use inode walk goals, use atomic bitflags to control the
    scheduling of gc workers
v7: simplify the inodegc worker, which simplifies how flushes work, break
    up the patch into smaller pieces, flush inactive inodes on syncfs to
    simplify freeze/ro-remount handling, separate inode selection filtering
    in iget, refactor inode recycling further, change gc delay to 100ms,
    decrease the gc delay when space or quota are low, move most of the
    destroy_inode logic to mark_reclaimable, get rid of the fallocate flush
    scan thing, get rid of polled flush mode
v8: rebase against 5.14-rc2, hook the memory shrinkers so that we requeue
    inactivation immediately when memory starts to get tight and force
    callers queueing inodes for inactivation to wait for the inactivation
    workers to run (i.e. throttling the frontend) to reduce memory storms,
    add hch's quotaoff removal series as a dependency to shut down arguments
    about quota walks

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
---
 Documentation/admin-guide/xfs.rst |   10 
 fs/xfs/libxfs/xfs_ag.c            |   13 +
 fs/xfs/libxfs/xfs_ag.h            |   13 +
 fs/xfs/scrub/common.c             |   10 
 fs/xfs/xfs_dquot.h                |   10 
 fs/xfs/xfs_globals.c              |    5 
 fs/xfs/xfs_icache.c               |  848 ++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h               |    7 
 fs/xfs/xfs_inode.c                |   53 ++
 fs/xfs/xfs_inode.h                |   21 +
 fs/xfs/xfs_itable.c               |   42 ++
 fs/xfs/xfs_iwalk.c                |   33 +
 fs/xfs/xfs_linux.h                |    1 
 fs/xfs/xfs_log_recover.c          |    7 
 fs/xfs/xfs_mount.c                |   42 ++
 fs/xfs/xfs_mount.h                |   29 +
 fs/xfs/xfs_qm_syscalls.c          |    8 
 fs/xfs/xfs_super.c                |  110 +++--
 fs/xfs/xfs_sysctl.c               |    9 
 fs/xfs/xfs_sysctl.h               |    1 
 fs/xfs/xfs_trace.h                |  295 +++++++++++++
 fs/xfs/xfs_trans.c                |    5 
 22 files changed, 1463 insertions(+), 109 deletions(-)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-07-29 18:43 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:43 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Move the xfs_inactive call and all the other debugging checks and stats
updates into xfs_inode_mark_reclaimable because most of that are
implementation details about the inode cache.  This is preparation for
deferred inactivation that is coming up.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c  |   50 --------------------------------------------------
 2 files changed, 49 insertions(+), 50 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 086a88b8dfdb..7bc2690da87d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -292,6 +292,32 @@ xfs_perag_clear_inode_tag(
 	trace_xfs_perag_clear_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
 }
 
+#ifdef DEBUG
+static void
+xfs_check_delalloc(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	struct xfs_bmbt_irec	got;
+	struct xfs_iext_cursor	icur;
+
+	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
+		return;
+	do {
+		if (isnullstartblock(got.br_startblock)) {
+			xfs_warn(ip->i_mount,
+	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
+				ip->i_ino,
+				whichfork == XFS_DATA_FORK ? "data" : "cow",
+				got.br_startoff, got.br_blockcount);
+		}
+	} while (xfs_iext_next_extent(ifp, &icur, &got));
+}
+#else
+#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
+#endif
+
 /*
  * We set the inode flag atomically with the radix tree tag.
  * Once we get tag lookups on the radix tree, this inode flag
@@ -304,6 +330,29 @@ xfs_inode_mark_reclaimable(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
 
+	xfs_inactive(ip);
+
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+		xfs_check_delalloc(ip, XFS_DATA_FORK);
+		xfs_check_delalloc(ip, XFS_COW_FORK);
+		ASSERT(0);
+	}
+
+	XFS_STATS_INC(mp, vn_reclaim);
+
+	/*
+	 * We should never get here with one of the reclaim flags already set.
+	 */
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+
+	/*
+	 * We always use background reclaim here because even if the inode is
+	 * clean, it still may be under IO and hence we have wait for IO
+	 * completion to occur before we can reclaim the inode. The background
+	 * reclaim path handles this more efficiently than we can here, so
+	 * simply let background reclaim tear down all inodes.
+	 */
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 36fc81e52dc2..ef89a9a3ba9e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -591,32 +591,6 @@ xfs_fs_alloc_inode(
 	return NULL;
 }
 
-#ifdef DEBUG
-static void
-xfs_check_delalloc(
-	struct xfs_inode	*ip,
-	int			whichfork)
-{
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
-	struct xfs_bmbt_irec	got;
-	struct xfs_iext_cursor	icur;
-
-	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
-		return;
-	do {
-		if (isnullstartblock(got.br_startblock)) {
-			xfs_warn(ip->i_mount,
-	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
-				ip->i_ino,
-				whichfork == XFS_DATA_FORK ? "data" : "cow",
-				got.br_startoff, got.br_blockcount);
-		}
-	} while (xfs_iext_next_extent(ifp, &icur, &got));
-}
-#else
-#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
-#endif
-
 /*
  * Now that the generic code is guaranteed not to be accessing
  * the linux inode, we can inactivate and reclaim the inode.
@@ -632,30 +606,6 @@ xfs_fs_destroy_inode(
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
-
-	xfs_inactive(ip);
-
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
-		xfs_check_delalloc(ip, XFS_DATA_FORK);
-		xfs_check_delalloc(ip, XFS_COW_FORK);
-		ASSERT(0);
-	}
-
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
-
-	/*
-	 * We should never get here with one of the reclaim flags already set.
-	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
-
-	/*
-	 * We always use background reclaim here because even if the inode is
-	 * clean, it still may be under IO and hence we have wait for IO
-	 * completion to occur before we can reclaim the inode. The background
-	 * reclaim path handles this more efficiently than we can here, so
-	 * simply let background reclaim tear down all inodes.
-	 */
 	xfs_inode_mark_reclaimable(ip);
 }
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
  2021-07-29 18:43 ` [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

If we don't need to inactivate an inode, we can detach the dquots and
move on to reclamation.  This isn't strictly required here; it's a
preparation patch for deferred inactivation per reviewer request[1] to
move the creation of xfs_inode_needs_inactivation into a separate
change.  Eventually this !need_inactive chunk will turn into the code
path for inodes that skip xfs_inactive and go straight to memory
reclaim.

[1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    8 +++++++-
 fs/xfs/xfs_inode.c  |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h  |    2 ++
 3 files changed, 62 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7bc2690da87d..709507cc83ae 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -329,8 +329,14 @@ xfs_inode_mark_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	bool			need_inactive = xfs_inode_needs_inactive(ip);
 
-	xfs_inactive(ip);
+	if (!need_inactive) {
+		/* Going straight to reclaim, so drop the dquots. */
+		xfs_qm_dqdetach(ip);
+	} else {
+		xfs_inactive(ip);
+	}
 
 	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 990b72ae3635..3c6ce1f6f643 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1654,6 +1654,59 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/*
+ * Returns true if we need to update the on-disk metadata before we can free
+ * the memory used by this inode.  Updates include freeing post-eof
+ * preallocations; freeing COW staging extents; and marking the inode free in
+ * the inobt if it is on the unlinked list.
+ */
+bool
+xfs_inode_needs_inactive(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*cow_ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+
+	/*
+	 * If the inode is already free, then there can be nothing
+	 * to clean up here.
+	 */
+	if (VFS_I(ip)->i_mode == 0)
+		return false;
+
+	/* If this is a read-only mount, don't do this (would generate I/O) */
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return false;
+
+	/* If the log isn't running, push inodes straight to reclaim. */
+	if (XFS_FORCED_SHUTDOWN(mp) || (mp->m_flags & XFS_MOUNT_NORECOVERY))
+		return false;
+
+	/* Metadata inodes require explicit resource cleanup. */
+	if (xfs_is_metadata_inode(ip))
+		return false;
+
+	/* Want to clean out the cow blocks if there are any. */
+	if (cow_ifp && cow_ifp->if_bytes > 0)
+		return true;
+
+	/* Unlinked files must be freed. */
+	if (VFS_I(ip)->i_nlink == 0)
+		return true;
+
+	/*
+	 * This file isn't being freed, so check if there are post-eof blocks
+	 * to free.  @force is true because we are evicting an inode from the
+	 * cache.  Post-eof blocks must be freed, lest we end up with broken
+	 * free space accounting.
+	 *
+	 * Note: don't bother with iolock here since lockdep complains about
+	 * acquiring it in reclaim context. We have the only reference to the
+	 * inode at this point anyways.
+	 */
+	return xfs_can_free_eofblocks(ip, true);
+}
+
 /*
  * xfs_inactive
  *
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 4b6703dbffb8..e3137bbc7b14 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -493,6 +493,8 @@ extern struct kmem_zone	*xfs_inode_zone;
 /* The default CoW extent size hint. */
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
+bool xfs_inode_needs_inactive(struct xfs_inode *ip);
+
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/20] xfs: defer inode inactivation to a workqueue
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
  2021-07-29 18:43 ` [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-30  4:24   ` Dave Chinner
  2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
  2021-07-29 18:44 ` [PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
                   ` (17 subsequent siblings)
  20 siblings, 2 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue.  With this change,
we can speed up directory tree deletions by reducing the duration of
unlink() calls to the directory and unlinked list updates.

By moving the inactivation work to the background, we can reduce the
total cost of deleting a lot of files by performing the file deletions
in disk order instead of directory entry order, which can be arbitrary.

We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated.  A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).

Unfortunately, deferring the inactivation has one huge downside --
eventual consistency.  Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we flush inactivation work during syncfs in order to maintain the
existing behaviors, at least for callers that unlink() and sync().

For this patch we'll set the delay to zero to mimic the old timing as
much as possible; in the next patch we'll play with different delay
settings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    3 
 fs/xfs/scrub/common.c             |    7 +
 fs/xfs/xfs_icache.c               |  306 ++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_icache.h               |    5 +
 fs/xfs/xfs_inode.h                |   19 ++
 fs/xfs/xfs_log_recover.c          |    7 +
 fs/xfs/xfs_mount.c                |   26 +++
 fs/xfs/xfs_mount.h                |   21 +++
 fs/xfs/xfs_super.c                |   53 ++++++
 fs/xfs/xfs_trace.h                |   68 ++++++++
 10 files changed, 488 insertions(+), 27 deletions(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index 8de008c0c5ad..f9b109bfc6a6 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -524,7 +524,8 @@ and the short name of the data device.  They all can be found in:
                   mount time quotacheck.
   xfs-gc          Background garbage collection of disk space that have been
                   speculatively allocated beyond EOF or for staging copy on
-                  write operations.
+                  write operations; and files that are no longer linked into
+                  the directory tree.
 ================  ===========
 
 For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 8558ca05e11d..06b697f72f23 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -884,6 +884,7 @@ xchk_stop_reaping(
 {
 	sc->flags |= XCHK_REAPING_DISABLED;
 	xfs_blockgc_stop(sc->mp);
+	xfs_inodegc_stop(sc->mp);
 }
 
 /* Restart background reaping of resources. */
@@ -891,6 +892,12 @@ void
 xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
+	/*
+	 * Readonly filesystems do not perform inactivation, so there's no
+	 * need to restart the worker.
+	 */
+	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(sc->mp);
 	xfs_blockgc_start(sc->mp);
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 709507cc83ae..e97404d2f63a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -32,6 +32,8 @@
 #define XFS_ICI_RECLAIM_TAG	0
 /* Inode has speculative preallocations (posteof or cow) to clean. */
 #define XFS_ICI_BLOCKGC_TAG	1
+/* Inode can be inactivated. */
+#define XFS_ICI_INODEGC_TAG	2
 
 /*
  * The goal for walking incore inodes.  These can correspond with incore inode
@@ -41,6 +43,7 @@ enum xfs_icwalk_goal {
 	/* Goals directly associated with tagged inodes. */
 	XFS_ICWALK_BLOCKGC	= XFS_ICI_BLOCKGC_TAG,
 	XFS_ICWALK_RECLAIM	= XFS_ICI_RECLAIM_TAG,
+	XFS_ICWALK_INODEGC	= XFS_ICI_INODEGC_TAG,
 };
 
 #define XFS_ICWALK_NULL_TAG	(-1U)
@@ -219,6 +222,26 @@ xfs_blockgc_queue(
 	rcu_read_unlock();
 }
 
+/*
+ * Queue a background inactivation worker if there are inodes that need to be
+ * inactivated and higher level xfs code hasn't disabled the background
+ * workers.
+ */
+static void
+xfs_inodegc_queue(
+	struct xfs_mount        *mp)
+{
+	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
+		trace_xfs_inodegc_queue(mp, 0);
+		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+	}
+	rcu_read_unlock();
+}
+
 /* Set a tag on both the AG incore inode tree and the AG radix tree. */
 static void
 xfs_perag_set_inode_tag(
@@ -253,6 +276,9 @@ xfs_perag_set_inode_tag(
 	case XFS_ICI_BLOCKGC_TAG:
 		xfs_blockgc_queue(pag);
 		break;
+	case XFS_ICI_INODEGC_TAG:
+		xfs_inodegc_queue(mp);
+		break;
 	}
 
 	trace_xfs_perag_set_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
@@ -329,28 +355,27 @@ xfs_inode_mark_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
-	bool			need_inactive = xfs_inode_needs_inactive(ip);
+	unsigned int		tag;
+	bool			need_inactive;
 
+	need_inactive = xfs_inode_needs_inactive(ip);
 	if (!need_inactive) {
 		/* Going straight to reclaim, so drop the dquots. */
 		xfs_qm_dqdetach(ip);
-	} else {
-		xfs_inactive(ip);
-	}
 
-	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
-		xfs_check_delalloc(ip, XFS_DATA_FORK);
-		xfs_check_delalloc(ip, XFS_COW_FORK);
-		ASSERT(0);
+		if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+			xfs_check_delalloc(ip, XFS_DATA_FORK);
+			xfs_check_delalloc(ip, XFS_COW_FORK);
+			ASSERT(0);
+		}
 	}
 
 	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
-	 * We should never get here with one of the reclaim flags already set.
+	 * We should never get here with any of the reclaim flags already set.
 	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
 
 	/*
 	 * We always use background reclaim here because even if the inode is
@@ -363,13 +388,30 @@ xfs_inode_mark_reclaimable(
 	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
 
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	if (need_inactive) {
+		trace_xfs_inode_set_need_inactive(ip);
+		ip->i_flags |= XFS_NEED_INACTIVE;
+		tag = XFS_ICI_INODEGC_TAG;
+	} else {
+		trace_xfs_inode_set_reclaimable(ip);
+		ip->i_flags |= XFS_IRECLAIMABLE;
+		tag = XFS_ICI_RECLAIM_TAG;
+	}
+
+	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
+
+	/*
+	 * Wait for the background inodegc worker if it's running so that the
+	 * frontend can't overwhelm the background workers with inodes and OOM
+	 * the machine.  We'll improve this with feedback from the rest of the
+	 * system in subsequent patches.
+	 */
+	if (need_inactive && flush_work(&mp->m_inodegc_work.work))
+		trace_xfs_inodegc_throttled(mp, __return_address);
 }
 
 static inline void
@@ -433,6 +475,7 @@ xfs_iget_recycle(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
+	unsigned int		tag;
 	int			error;
 
 	trace_xfs_iget_recycle(ip);
@@ -443,7 +486,16 @@ xfs_iget_recycle(
 	 * the inode.  We can't clear the radix tree tag yet as it requires
 	 * pag_ici_lock to be held exclusive.
 	 */
-	ip->i_flags |= XFS_IRECLAIM;
+	if (ip->i_flags & XFS_IRECLAIMABLE) {
+		tag = XFS_ICI_RECLAIM_TAG;
+		ip->i_flags |= XFS_IRECLAIM;
+	} else if (ip->i_flags & XFS_NEED_INACTIVE) {
+		tag = XFS_ICI_INODEGC_TAG;
+		ip->i_flags |= XFS_INACTIVATING;
+	} else {
+		ASSERT(0);
+		return -EINVAL;
+	}
 
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
@@ -460,10 +512,10 @@ xfs_iget_recycle(
 		rcu_read_lock();
 		spin_lock(&ip->i_flags_lock);
 		wake = !!__xfs_iflags_test(ip, XFS_INEW);
-		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING);
 		if (wake)
 			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
-		ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
+		ASSERT(ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE));
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
@@ -481,8 +533,7 @@ xfs_iget_recycle(
 	 */
 	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 	ip->i_flags |= XFS_INEW;
-	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
+	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
 	inode->i_state = I_NEW;
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -566,9 +617,15 @@ xfs_iget_cache_hit(
 	 *	     wait_on_inode to wait for these flags to be cleared
 	 *	     instead of polling for it.
 	 */
-	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM))
+	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING))
 		goto out_skip;
 
+	/* Unlinked inodes cannot be re-grabbed. */
+	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE)) {
+		error = -ENOENT;
+		goto out_error;
+	}
+
 	/*
 	 * Check the inode free state is valid. This also detects lookup
 	 * racing with unlinks.
@@ -579,11 +636,11 @@ xfs_iget_cache_hit(
 
 	/* Skip inodes that have no vfs state. */
 	if ((flags & XFS_IGET_INCORE) &&
-	    (ip->i_flags & XFS_IRECLAIMABLE))
+	    (ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE)))
 		goto out_skip;
 
 	/* The inode fits the selection criteria; process it. */
-	if (ip->i_flags & XFS_IRECLAIMABLE) {
+	if (ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE)) {
 		/* Drops i_flags_lock and RCU read lock. */
 		error = xfs_iget_recycle(pag, ip);
 		if (error)
@@ -943,6 +1000,7 @@ xfs_reclaim_inode(
 
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
+	trace_xfs_inode_reclaiming(ip);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1420,6 +1478,8 @@ xfs_blockgc_start(
 
 /* Don't try to run block gc on an inode that's in any of these states. */
 #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
+					 XFS_NEED_INACTIVE | \
+					 XFS_INACTIVATING | \
 					 XFS_IRECLAIMABLE | \
 					 XFS_IRECLAIM)
 /*
@@ -1580,6 +1640,203 @@ xfs_blockgc_free_quota(
 			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), iwalk_flags);
 }
 
+/*
+ * Inode Inactivation and Reclaimation
+ * ===================================
+ *
+ * Sometimes, inodes need to have work done on them once the last program has
+ * closed the file.  Typically this means cleaning out any leftover speculative
+ * preallocations after EOF or in the CoW fork.  For inodes that have been
+ * totally unlinked, this means unmapping data/attr/cow blocks, removing the
+ * inode from the unlinked buckets, and marking it free in the inobt and inode
+ * table.
+ *
+ * This process can generate many metadata updates, which shows up as close()
+ * and unlink() calls that take a long time.  We defer all that work to a
+ * workqueue which means that we can batch a lot of work and do it in inode
+ * order for better performance.  Furthermore, we can control the workqueue,
+ * which means that we can avoid doing inactivation work at a bad time, such as
+ * when the fs is frozen.
+ *
+ * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
+ * INACTIVATING) and adds a new INODEGC radix tree tag for fast access.  We
+ * maintain separate perag counters for both types, and move counts as inodes
+ * wander the state machine, which now works as follows:
+ *
+ * If the inode needs inactivation, we:
+ *   - Set the NEED_INACTIVE inode flag
+ *   - Schedule background inode inactivation
+ *
+ * If the inode does not need inactivation, we:
+ *   - Set the IRECLAIMABLE inode flag
+ *   - Schedule background inode reclamation
+ *
+ * When it is time to inactivate the inode, we:
+ *   - Set the INACTIVATING inode flag
+ *   - Make all the on-disk updates
+ *   - Clear the inactive state and set the IRECLAIMABLE inode flag
+ *   - Schedule background inode reclamation
+ *
+ * When it is time to reclaim the inode, we:
+ *   - Set the IRECLAIM inode flag
+ *   - Reclaim the inode and RCU free it
+ *
+ * When these state transitions occur, the caller must have taken the per-AG
+ * incore inode tree lock and then the inode i_flags lock, in that order.
+ */
+
+/*
+ * Decide if the given @ip is eligible for inactivation, and grab it if so.
+ * Returns true if it's ready to go or false if we should just ignore it.
+ *
+ * Skip inodes that don't need inactivation or are being inactivated (or
+ * recycled) by another thread.  Inodes should not be tagged for inactivation
+ * while also in INEW or any reclaim state.
+ *
+ * Otherwise, mark this inode as being inactivated even if the fs is shut down
+ * because we need xfs_inodegc_inactivate to push this inode into the reclaim
+ * state.
+ */
+static bool
+xfs_inodegc_igrab(
+	struct xfs_inode	*ip)
+{
+	bool			ret = false;
+
+	ASSERT(rcu_read_lock_held());
+
+	/* Check for stale RCU freed inode */
+	spin_lock(&ip->i_flags_lock);
+	if (!ip->i_ino)
+		goto out_unlock_noent;
+
+	if ((ip->i_flags & XFS_NEED_INACTIVE) &&
+	    !(ip->i_flags & XFS_INACTIVATING)) {
+		ret = true;
+		ip->i_flags |= XFS_INACTIVATING;
+	}
+
+out_unlock_noent:
+	spin_unlock(&ip->i_flags_lock);
+	return ret;
+}
+
+/*
+ * Free all speculative preallocations and possibly even the inode itself.
+ * This is the last chance to make changes to an otherwise unreferenced file
+ * before incore reclamation happens.
+ */
+static void
+xfs_inodegc_inactivate(
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	struct xfs_icwalk	*icw)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+
+	/*
+	 * Inactivation isn't supposed to run when the fs is frozen because
+	 * we don't want kernel threads to block on transaction allocation.
+	 */
+	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	/*
+	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed to
+	 * pass in a icw structure to look for inodes to inactivate
+	 * immediately to free some resources.  If this inode isn't a match,
+	 * put it back on the shelf and move on.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (!xfs_icwalk_match(ip, icw)) {
+		ip->i_flags &= ~XFS_INACTIVATING;
+		spin_unlock(&ip->i_flags_lock);
+		return;
+	}
+	spin_unlock(&ip->i_flags_lock);
+
+	trace_xfs_inode_inactivating(ip);
+
+	xfs_inactive(ip);
+
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+		xfs_check_delalloc(ip, XFS_DATA_FORK);
+		xfs_check_delalloc(ip, XFS_COW_FORK);
+		ASSERT(0);
+	}
+
+	/* Schedule the inactivated inode for reclaim. */
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	trace_xfs_inode_set_reclaimable(ip);
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	ip->i_flags |= XFS_IRECLAIMABLE;
+
+	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_INODEGC_TAG);
+	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+}
+
+/* Inactivate inodes until we run out. */
+void
+xfs_inodegc_worker(
+	struct work_struct	*work)
+{
+	struct xfs_mount	*mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_inodegc_work);
+
+	/*
+	 * Inactivation never returns error codes and never fails to push a
+	 * tagged inode to reclaim.  Loop until there there's nothing left.
+	 */
+	while (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
+		trace_xfs_inodegc_worker(mp, __return_address);
+		xfs_icwalk(mp, XFS_ICWALK_INODEGC, NULL);
+	}
+}
+
+/*
+ * Force all currently queued inode inactivation work to run immediately, and
+ * wait for the work to finish.
+ */
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	trace_xfs_inodegc_flush(mp, __return_address);
+	flush_delayed_work(&mp->m_inodegc_work);
+}
+
+/* Disable the inode inactivation background worker and wait for it to stop. */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	cancel_delayed_work_sync(&mp->m_inodegc_work);
+	trace_xfs_inodegc_stop(mp, __return_address);
+}
+
+/*
+ * Enable the inode inactivation background worker and schedule deferred inode
+ * inactivation work if there is any.
+ */
+void
+xfs_inodegc_start(
+	struct xfs_mount	*mp)
+{
+	if (test_and_set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	trace_xfs_inodegc_start(mp, __return_address);
+	xfs_inodegc_queue(mp);
+}
+
 /* XFS Inode Cache Walking Code */
 
 /*
@@ -1606,6 +1863,8 @@ xfs_icwalk_igrab(
 		return xfs_blockgc_igrab(ip);
 	case XFS_ICWALK_RECLAIM:
 		return xfs_reclaim_igrab(ip, icw);
+	case XFS_ICWALK_INODEGC:
+		return xfs_inodegc_igrab(ip);
 	default:
 		return false;
 	}
@@ -1631,6 +1890,9 @@ xfs_icwalk_process_inode(
 	case XFS_ICWALK_RECLAIM:
 		xfs_reclaim_inode(ip, pag);
 		break;
+	case XFS_ICWALK_INODEGC:
+		xfs_inodegc_inactivate(ip, pag, icw);
+		break;
 	}
 	return error;
 }
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d0062ebb3f7a..c1dfc909a5b0 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -74,4 +74,9 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_flush(struct xfs_mount *mp);
+void xfs_inodegc_stop(struct xfs_mount *mp);
+void xfs_inodegc_start(struct xfs_mount *mp);
+
 #endif
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e3137bbc7b14..fa5be0d071ad 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -240,6 +240,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
+#define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
 /*
  * If this unlinked inode is in the middle of recovery, don't let drop_inode
  * truncate and free the inode.  This can happen if we iget the inode during
@@ -248,6 +249,21 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
+/* All inode state flags related to inode reclaim. */
+#define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
+				 XFS_IRECLAIM | \
+				 XFS_NEED_INACTIVE | \
+				 XFS_INACTIVATING)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -255,7 +271,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Flags for inode locking.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1721fce2ec94..a98d2429d795 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2786,6 +2786,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Flush the pending unlinked inodes to ensure that the inactivations
+	 * are fully completed on disk and the incore inodes can be reclaimed
+	 * before we signal that recovery is complete.
+	 */
+	xfs_inodegc_flush(mp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index baf7b323cb15..1f7e9a608f38 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -514,7 +514,8 @@ xfs_check_summary_counts(
  * Flush and reclaim dirty inodes in preparation for unmount. Inodes and
  * internal inode structures can be sitting in the CIL and AIL at this point,
  * so we need to unpin them, write them back and/or reclaim them before unmount
- * can proceed.
+ * can proceed.  In other words, callers are required to have inactivated all
+ * inodes.
  *
  * An inode cluster that has been freed can have its buffer still pinned in
  * memory because the transaction is still sitting in a iclog. The stale inodes
@@ -546,6 +547,7 @@ xfs_unmount_flush_inodes(
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
 
 	xfs_ail_push_all_sync(mp->m_ail);
+	xfs_inodegc_stop(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
@@ -782,6 +784,9 @@ xfs_mountfs(
 	if (error)
 		goto out_log_dealloc;
 
+	/* Enable background inode inactivation workers. */
+	xfs_inodegc_start(mp);
+
 	/*
 	 * Get and sanity-check the root inode.
 	 * Save the pointer to it in the mount structure.
@@ -942,6 +947,15 @@ xfs_mountfs(
 	xfs_irele(rip);
 	/* Clean out dquots that might be in memory after quotacheck. */
 	xfs_qm_unmount(mp);
+
+	/*
+	 * Inactivate all inodes that might still be in memory after a log
+	 * intent recovery failure so that reclaim can free them.  Metadata
+	 * inodes and the root directory shouldn't need inactivation, but the
+	 * mount failed for some reason, so pull down all the state and flee.
+	 */
+	xfs_inodegc_flush(mp);
+
 	/*
 	 * Flush all inode reclamation work and flush the log.
 	 * We have to do this /after/ rtunmount and qm_unmount because those
@@ -989,6 +1003,16 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes
+	 * that the VFS evicted earlier in the unmount process.  Freeing inodes
+	 * and discarding CoW fork preallocations can cause shape changes to
+	 * the free inode and refcount btrees, respectively, so we must finish
+	 * this before we discard the metadata space reservations.  Metadata
+	 * inodes and the root directory do not require inactivation.
+	 */
+	xfs_inodegc_flush(mp);
+
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c78b63fe779a..dc906b78e24c 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -154,6 +154,13 @@ typedef struct xfs_mount {
 	uint8_t			m_rt_checked;
 	uint8_t			m_rt_sick;
 
+	/*
+	 * This atomic bitset controls flags that alter the behavior of the
+	 * filesystem.  Use only the atomic bit helper functions here; see
+	 * XFS_OPFLAG_* for information about the actual flags.
+	 */
+	unsigned long		m_opflags;
+
 	/*
 	 * End of read-mostly variables. Frequently written variables and locks
 	 * should be placed below this comment from now on. The first variable
@@ -184,6 +191,7 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
+	struct delayed_work	m_inodegc_work; /* background inode inactive */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -258,6 +266,19 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
 #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
 
+/*
+ * Operation flags -- each entry here is a bit index into m_opflags and is
+ * not itself a flag value.  Use the atomic bit functions to access.
+ */
+enum xfs_opflag_bits {
+	/*
+	 * If set, background inactivation worker threads will be scheduled to
+	 * process queued inodegc work.  If not, queued inodes remain in memory
+	 * waiting to be processed.
+	 */
+	XFS_OPFLAG_INODEGC_RUNNING_BIT	= 0,
+};
+
 /*
  * Max and min values for mount-option defined I/O
  * preallocation sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ef89a9a3ba9e..f8f05d1037d2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -702,6 +702,8 @@ xfs_fs_sync_fs(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	trace_xfs_fs_sync_fs(mp, __return_address);
+
 	/*
 	 * Doing anything during the async pass would be counterproductive.
 	 */
@@ -718,6 +720,25 @@ xfs_fs_sync_fs(
 		flush_delayed_work(&mp->m_log->l_work);
 	}
 
+	/*
+	 * Flush all deferred inode inactivation work so that the free space
+	 * counters will reflect recent deletions.  Do not force the log again
+	 * because log recovery can restart the inactivation from the info that
+	 * we just wrote into the ondisk log.
+	 *
+	 * For regular operation this isn't strictly necessary since we aren't
+	 * required to guarantee that unlinking frees space immediately, but
+	 * that is how XFS historically behaved.
+	 *
+	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
+	 * last chance to complete the inactivation work before the filesystem
+	 * freezes and the log is quiesced.  The background worker will not
+	 * activate again until the fs is thawed because the VFS won't evict
+	 * any more inodes until freeze_super drops s_umount and we disable the
+	 * worker in xfs_fs_freeze.
+	 */
+	xfs_inodegc_flush(mp);
+
 	return 0;
 }
 
@@ -832,6 +853,17 @@ xfs_fs_freeze(
 	 */
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
+
+	/*
+	 * Stop the inodegc background worker.  freeze_super already flushed
+	 * all pending inodegc work when it sync'd the filesystem after setting
+	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
+	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
+	 * If the filesystem is read-write, inactivated inodes will queue but
+	 * the worker will not run until the filesystem thaws or unmounts.
+	 */
+	xfs_inodegc_stop(mp);
+
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);
@@ -847,6 +879,14 @@ xfs_fs_unfreeze(
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
 	xfs_blockgc_start(mp);
+
+	/*
+	 * Don't reactivate the inodegc worker on a readonly filesystem because
+	 * inodes are sent directly to reclaim.
+	 */
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -1649,6 +1689,9 @@ xfs_remount_rw(
 	if (error && error != -ENOSPC)
 		return error;
 
+	/* Re-enable the background inode inactivation worker. */
+	xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -1671,6 +1714,15 @@ xfs_remount_ro(
 		return error;
 	}
 
+	/*
+	 * Stop the inodegc background worker.  xfs_fs_reconfigure already
+	 * flushed all pending inodegc work when it sync'd the filesystem.
+	 * The VFS holds s_umount, so we know that inodes cannot enter
+	 * xfs_fs_destroy_inode during a remount operation.  In readonly mode
+	 * we send inodes straight to reclaim, so no inodes will be queued.
+	 */
+	xfs_inodegc_stop(mp);
+
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
 	if (error) {
@@ -1794,6 +1846,7 @@ static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
+	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f9d8d605f9b1..12ce47aebaef 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -157,6 +157,63 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
+DECLARE_EVENT_CLASS(xfs_fs_class,
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
+	TP_ARGS(mp, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, mflags)
+		__field(unsigned long, opflags)
+		__field(unsigned long, sbflags)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->mflags = mp->m_flags;
+		__entry->opflags = mp->m_opflags;
+		__entry->sbflags = mp->m_super->s_flags;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d m_flags 0x%llx m_opflags 0x%lx s_flags 0x%lx caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->mflags,
+		  __entry->opflags,
+		  __entry->sbflags,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_FS_EVENT(name)	\
+DEFINE_EVENT(xfs_fs_class, name,					\
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
+	TP_ARGS(mp, caller_ip))
+DEFINE_FS_EVENT(xfs_inodegc_flush);
+DEFINE_FS_EVENT(xfs_inodegc_start);
+DEFINE_FS_EVENT(xfs_inodegc_stop);
+DEFINE_FS_EVENT(xfs_inodegc_worker);
+DEFINE_FS_EVENT(xfs_inodegc_throttled);
+DEFINE_FS_EVENT(xfs_fs_sync_fs);
+
+DECLARE_EVENT_CLASS(xfs_gc_queue_class,
+	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),
+	TP_ARGS(mp, delay_ms),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, delay_ms)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->delay_ms = delay_ms;
+	),
+	TP_printk("dev %d:%d delay_ms %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->delay_ms)
+);
+#define DEFINE_GC_QUEUE_EVENT(name)	\
+DEFINE_EVENT(xfs_gc_queue_class, name,	\
+	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),	\
+	TP_ARGS(mp, delay_ms))
+DEFINE_GC_QUEUE_EVENT(xfs_inodegc_queue);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),
@@ -616,14 +673,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -667,6 +727,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes Darrick J. Wong
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we defer inode inactivation, we've decoupled the process of
unlinking or closing an inode from the process of inactivating it.  In
theory this should lead to better throughput since we now inactivate the
queued inodes in disk order.

Unfortunately, one of the primary risks with this decoupling is the loss
of rate control feedback between the frontend and background threads.
In other words, if a rm -rf /* thread can load inodes into cache and
schedule them for inactivation faster than we can inactivate them, we
can easily OOM the system.  Currently, we throttle frontend processes
by forcing them to flush_work the background processes.

However, this leaves plenty of performance on the table if we have
enough memory to allow for some caching of inodes.  We can relax the
coupling by only throttling processes that are trying to queue inodes
for inactivation if the system is under memory pressure.  This makes
inactivation more bursty on my system, but raises throughput.

To make this work smoothly, we register a new faux shrinker, and
configure it carefully such that the scan function will turn on and
throttle the /second/ time the shrinker gets called by reclaim and there
are inodes queued for inactivation.

On my test VM with 960M of RAM and a 2TB filesystem, the time to delete
10 million inodes decreases from ~28 minutes to ~23 minutes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |  105 ++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_icache.h |    1 
 fs/xfs/xfs_mount.c  |    9 ++++
 fs/xfs/xfs_mount.h  |    7 +++
 fs/xfs/xfs_super.c  |    1 
 fs/xfs/xfs_trace.h  |   35 +++++++++++++++++
 6 files changed, 150 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e97404d2f63a..3e2302a44c69 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -344,6 +344,26 @@ xfs_check_delalloc(
 #define xfs_check_delalloc(ip, whichfork)	do { } while (0)
 #endif
 
+/*
+ * Decide if we're going to throttle frontend threads that are inactivating
+ * inodes so that we don't overwhelm the background workers with inodes and OOM
+ * the machine.
+ */
+static inline bool
+xfs_inodegc_want_throttle(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	/* Throttle if memory reclaim anywhere has triggered us. */
+	if (atomic_read(&mp->m_inodegc_reclaim) > 0) {
+		trace_xfs_inodegc_throttle_mempressure(mp);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * We set the inode flag atomically with the radix tree tag.
  * Once we get tag lookups on the radix tree, this inode flag
@@ -357,6 +377,7 @@ xfs_inode_mark_reclaimable(
 	struct xfs_perag	*pag;
 	unsigned int		tag;
 	bool			need_inactive;
+	bool			flush_inodegc = false;
 
 	need_inactive = xfs_inode_needs_inactive(ip);
 	if (!need_inactive) {
@@ -392,6 +413,7 @@ xfs_inode_mark_reclaimable(
 		trace_xfs_inode_set_need_inactive(ip);
 		ip->i_flags |= XFS_NEED_INACTIVE;
 		tag = XFS_ICI_INODEGC_TAG;
+		flush_inodegc = xfs_inodegc_want_throttle(pag);
 	} else {
 		trace_xfs_inode_set_reclaimable(ip);
 		ip->i_flags |= XFS_IRECLAIMABLE;
@@ -404,13 +426,7 @@ xfs_inode_mark_reclaimable(
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
 
-	/*
-	 * Wait for the background inodegc worker if it's running so that the
-	 * frontend can't overwhelm the background workers with inodes and OOM
-	 * the machine.  We'll improve this with feedback from the rest of the
-	 * system in subsequent patches.
-	 */
-	if (need_inactive && flush_work(&mp->m_inodegc_work.work))
+	if (flush_inodegc && flush_work(&mp->m_inodegc_work.work))
 		trace_xfs_inodegc_throttled(mp, __return_address);
 }
 
@@ -1796,6 +1812,12 @@ xfs_inodegc_worker(
 		trace_xfs_inodegc_worker(mp, __return_address);
 		xfs_icwalk(mp, XFS_ICWALK_INODEGC, NULL);
 	}
+
+	/*
+	 * We inactivated all the inodes we could, so disable the throttling
+	 * of new inactivations that happens when memory gets tight.
+	 */
+	atomic_set(&mp->m_inodegc_reclaim, 0);
 }
 
 /*
@@ -1837,6 +1859,75 @@ xfs_inodegc_start(
 	xfs_inodegc_queue(mp);
 }
 
+/*
+ * Register a phony shrinker so that we can speed up background inodegc and
+ * throttle new inodegc queuing when there's memory pressure.  Inactivation
+ * does not itself free any memory but it does make inodes reclaimable, which
+ * eventually frees memory.  The count function, seek value, and batch value
+ * are crafted to trigger the scan function any time the shrinker is not being
+ * called from a background idle scan (i.e. the second time).
+ */
+#define XFS_INODEGC_SHRINK_COUNT	(1UL << DEF_PRIORITY)
+#define XFS_INODEGC_SHRINK_BATCH	(LONG_MAX)
+
+static unsigned long
+xfs_inodegc_shrink_count(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_mount	*mp;
+
+	mp = container_of(shrink, struct xfs_mount, m_inodegc_shrink);
+
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
+		return XFS_INODEGC_SHRINK_COUNT;
+
+	return 0;
+}
+
+static unsigned long
+xfs_inodegc_shrink_scan(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_mount	*mp;
+
+	/*
+	 * Inode inactivation work requires NOFS allocations, so don't make
+	 * things worse if the caller wanted a NOFS allocation.
+	 */
+	if (!(sc->gfp_mask & __GFP_FS))
+		return SHRINK_STOP;
+
+	mp = container_of(shrink, struct xfs_mount, m_inodegc_shrink);
+
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
+		trace_xfs_inodegc_requeue_mempressure(mp, sc->nr_to_scan,
+				__return_address);
+
+		atomic_inc(&mp->m_inodegc_reclaim);
+		mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+	}
+
+	return 0;
+}
+
+/* Register a shrinker so we can accelerate inodegc and throttle queuing. */
+int
+xfs_inodegc_register_shrinker(
+	struct xfs_mount	*mp)
+{
+	struct shrinker		*shrink = &mp->m_inodegc_shrink;
+
+	shrink->count_objects = xfs_inodegc_shrink_count;
+	shrink->scan_objects = xfs_inodegc_shrink_scan;
+	shrink->seeks = 0;
+	shrink->flags = SHRINKER_NONSLAB;
+	shrink->batch = XFS_INODEGC_SHRINK_BATCH;
+
+	return register_shrinker(shrink);
+}
+
 /* XFS Inode Cache Walking Code */
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index c1dfc909a5b0..e38c8bc5461f 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -78,5 +78,6 @@ void xfs_inodegc_worker(struct work_struct *work);
 void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
+int xfs_inodegc_register_shrinker(struct xfs_mount *mp);
 
 #endif
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 1f7e9a608f38..ac953c486b9f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -766,6 +766,10 @@ xfs_mountfs(
 		goto out_free_perag;
 	}
 
+	error = xfs_inodegc_register_shrinker(mp);
+	if (error)
+		goto out_fail_wait;
+
 	/*
 	 * Log's mount-time initialization. The first part of recovery can place
 	 * some items on the AIL, to be handled when recovery is finished or
@@ -776,7 +780,7 @@ xfs_mountfs(
 			      XFS_FSB_TO_BB(mp, sbp->sb_logblocks));
 	if (error) {
 		xfs_warn(mp, "log mount failed");
-		goto out_fail_wait;
+		goto out_inodegc_shrink;
 	}
 
 	/* Make sure the summary counts are ok. */
@@ -970,6 +974,8 @@ xfs_mountfs(
 	xfs_unmount_flush_inodes(mp);
  out_log_dealloc:
 	xfs_log_mount_cancel(mp);
+ out_inodegc_shrink:
+	unregister_shrinker(&mp->m_inodegc_shrink);
  out_fail_wait:
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_buftarg_drain(mp->m_logdev_targp);
@@ -1050,6 +1056,7 @@ xfs_unmountfs(
 #if defined(DEBUG)
 	xfs_errortag_clearall(mp);
 #endif
+	unregister_shrinker(&mp->m_inodegc_shrink);
 	xfs_free_perag(mp);
 
 	xfs_errortag_del(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index dc906b78e24c..7844b44d45ea 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -192,6 +192,7 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_inodegc_work; /* background inode inactive */
+	struct shrinker		m_inodegc_shrink;
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -219,6 +220,12 @@ typedef struct xfs_mount {
 	uint32_t		m_generation;
 	struct mutex		m_growlock;	/* growfs mutex */
 
+	/*
+	 * How many times has the memory shrinker poked us since the last time
+	 * inodegc was queued?
+	 */
+	atomic_t		m_inodegc_reclaim;
+
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f8f05d1037d2..c8207da0bb38 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1847,6 +1847,7 @@ static int xfs_init_fs_context(
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
+	atomic_set(&mp->m_inodegc_reclaim, 0);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 12ce47aebaef..eaebb070d859 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -193,6 +193,25 @@ DEFINE_FS_EVENT(xfs_inodegc_worker);
 DEFINE_FS_EVENT(xfs_inodegc_throttled);
 DEFINE_FS_EVENT(xfs_fs_sync_fs);
 
+TRACE_EVENT(xfs_inodegc_requeue_mempressure,
+	TP_PROTO(struct xfs_mount *mp, unsigned long nr, void *caller_ip),
+	TP_ARGS(mp, nr, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, nr)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->nr = nr;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d nr_to_scan %lu caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nr,
+		  __entry->caller_ip)
+);
+
 DECLARE_EVENT_CLASS(xfs_gc_queue_class,
 	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),
 	TP_ARGS(mp, delay_ms),
@@ -214,6 +233,22 @@ DEFINE_EVENT(xfs_gc_queue_class, name,	\
 	TP_ARGS(mp, delay_ms))
 DEFINE_GC_QUEUE_EVENT(xfs_inodegc_queue);
 
+TRACE_EVENT(xfs_inodegc_throttle_mempressure,
+	TP_PROTO(struct xfs_mount *mp),
+	TP_ARGS(mp),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, votes)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->votes = atomic_read(&mp->m_inodegc_reclaim);
+	),
+	TP_printk("dev %d:%d votes %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->votes)
+);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 06/20] xfs: throttle inodegc queuing on backlog Darrick J. Wong
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the means to throttle queueing of inactive inodes and
push the background workers when memory gets tight, stop forcing tasks
that are evicting inodes from a memory reclaim context to wait for the
inodes to inactivate.

There's not much reason to make reclaimers wait, because it can take
quite a long time to inactivate an inode (particularly deleted ones) and
wait for the metadata updates to push through the logs until the incore
inode can be reclaimed.  In other words, memory allocations will no
longer stall on XFS when inode eviction requires metadata updates.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    9 +++++++++
 1 file changed, 9 insertions(+)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3e2302a44c69..82f0db311ef9 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -355,6 +355,15 @@ xfs_inodegc_want_throttle(
 {
 	struct xfs_mount	*mp = pag->pag_mount;
 
+	/*
+	 * If we're in memory reclaim context, we don't want to wait for inode
+	 * inactivation to finish because it can take a very long time to
+	 * commit all the metadata updates and push the inodes through memory
+	 * reclamation.  Also, we might be the background inodegc thread.
+	 */
+	if (current->reclaim_state != NULL)
+		return false;
+
 	/* Throttle if memory reclaim anywhere has triggered us. */
 	if (atomic_read(&mp->m_inodegc_reclaim) > 0) {
 		trace_xfs_inodegc_throttle_mempressure(mp);


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/20] xfs: throttle inodegc queuing on backlog
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-08-02  0:45   ` Dave Chinner
  2021-07-29 18:44 ` [PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight Darrick J. Wong
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Track the number of inodes in each AG that are queued for inactivation,
then use that information to decide if we're going to make threads that
has queued an inode for inactivation wait for the background thread.
The purpose of this high water mark is to establish a maximum bound on
the backlog of work that can accumulate on a non-frozen filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c |    1 +
 fs/xfs/libxfs/xfs_ag.h |    3 ++-
 fs/xfs/xfs_icache.c    |   16 ++++++++++++++++
 fs/xfs/xfs_trace.h     |   24 ++++++++++++++++++++++++
 4 files changed, 43 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index ee9ec0c50bec..125a4b1f5be5 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -193,6 +193,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
+		ASSERT(pag->pag_ici_needs_inactive == 0);
 
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
 		xfs_iunlink_destroy(pag);
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 4c6f9045baca..ad0d3480a4a2 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -83,7 +83,8 @@ struct xfs_perag {
 
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
-	int		pag_ici_reclaimable;	/* reclaimable inodes */
+	unsigned int	pag_ici_needs_inactive;	/* inodes queued for inactivation */
+	unsigned int	pag_ici_reclaimable;	/* reclaimable inodes */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 82f0db311ef9..abd95f16b697 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -35,6 +35,12 @@
 /* Inode can be inactivated. */
 #define XFS_ICI_INODEGC_TAG	2
 
+/*
+ * Upper bound on the number of inodes in each AG that can be queued for
+ * inactivation at any given time, to avoid monopolizing the workqueue.
+ */
+#define XFS_INODEGC_MAX_BACKLOG	(1024 * XFS_INODES_PER_CHUNK)
+
 /*
  * The goal for walking incore inodes.  These can correspond with incore inode
  * radix tree tags when convenient.  Avoid existing XFS_IWALK namespace.
@@ -259,6 +265,8 @@ xfs_perag_set_inode_tag(
 
 	if (tag == XFS_ICI_RECLAIM_TAG)
 		pag->pag_ici_reclaimable++;
+	else if (tag == XFS_ICI_INODEGC_TAG)
+		pag->pag_ici_needs_inactive++;
 
 	if (was_tagged)
 		return;
@@ -306,6 +314,8 @@ xfs_perag_clear_inode_tag(
 
 	if (tag == XFS_ICI_RECLAIM_TAG)
 		pag->pag_ici_reclaimable--;
+	else if (tag == XFS_ICI_INODEGC_TAG)
+		pag->pag_ici_needs_inactive--;
 
 	if (radix_tree_tagged(&pag->pag_ici_root, tag))
 		return;
@@ -364,6 +374,12 @@ xfs_inodegc_want_throttle(
 	if (current->reclaim_state != NULL)
 		return false;
 
+	/* Enforce an upper bound on how many inodes can queue up. */
+	if (pag->pag_ici_needs_inactive > XFS_INODEGC_MAX_BACKLOG) {
+		trace_xfs_inodegc_throttle_backlog(pag);
+		return true;
+	}
+
 	/* Throttle if memory reclaim anywhere has triggered us. */
 	if (atomic_read(&mp->m_inodegc_reclaim) > 0) {
 		trace_xfs_inodegc_throttle_mempressure(mp);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index eaebb070d859..b4dfa7e7e700 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -249,6 +249,30 @@ TRACE_EVENT(xfs_inodegc_throttle_mempressure,
 		  __entry->votes)
 );
 
+DECLARE_EVENT_CLASS(xfs_inodegc_backlog_class,
+	TP_PROTO(struct xfs_perag *pag),
+	TP_ARGS(pag),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(unsigned int, needs_inactive)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->needs_inactive = pag->pag_ici_needs_inactive;
+	),
+	TP_printk("dev %d:%d agno %u needs_inactive %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->needs_inactive)
+);
+#define DEFINE_INODEGC_BACKLOG_EVENT(name)	\
+DEFINE_EVENT(xfs_inodegc_backlog_class, name,	\
+	TP_PROTO(struct xfs_perag *pag),	\
+	TP_ARGS(pag))
+DEFINE_INODEGC_BACKLOG_EVENT(xfs_inodegc_throttle_backlog);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 06/20] xfs: throttle inodegc queuing on backlog Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

If there is enough memory pressure that we're scheduling inodes for
inactivation from a shrinker, queue the inactivation worker immediately
to try to facilitate reclaming inodes.  This patch prepares us for
adding a configurable inodegc delay in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   34 ++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_trace.h  |    1 +
 2 files changed, 33 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index abd95f16b697..e0803544ea19 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -212,6 +212,32 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
+/*
+ * Compute the lag between scheduling and executing some kind of background
+ * garbage collection work.  Return value is in ms.
+ */
+static inline unsigned int
+xfs_gc_delay_ms(
+	struct xfs_mount	*mp,
+	unsigned int		tag)
+{
+	switch (tag) {
+	case XFS_ICI_INODEGC_TAG:
+		/* If we're in a shrinker, kick off the worker immediately. */
+		if (current->reclaim_state != NULL) {
+			trace_xfs_inodegc_delay_mempressure(mp,
+					__return_address);
+			return 0;
+		}
+		break;
+	default:
+		ASSERT(0);
+		return 0;
+	}
+
+	return 0;
+}
+
 /*
  * Background scanning to trim preallocated space. This is queued based on the
  * 'speculative_prealloc_lifetime' tunable (5m by default).
@@ -242,8 +268,12 @@ xfs_inodegc_queue(
 
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
-		trace_xfs_inodegc_queue(mp, 0);
-		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+		unsigned int	delay;
+
+		delay = xfs_gc_delay_ms(mp, XFS_ICI_INODEGC_TAG);
+		trace_xfs_inodegc_queue(mp, delay);
+		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+				msecs_to_jiffies(delay));
 	}
 	rcu_read_unlock();
 }
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b4dfa7e7e700..d3f3f6a32872 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -192,6 +192,7 @@ DEFINE_FS_EVENT(xfs_inodegc_stop);
 DEFINE_FS_EVENT(xfs_inodegc_worker);
 DEFINE_FS_EVENT(xfs_inodegc_throttled);
 DEFINE_FS_EVENT(xfs_fs_sync_fs);
+DEFINE_FS_EVENT(xfs_inodegc_delay_mempressure);
 
 TRACE_EVENT(xfs_inodegc_requeue_mempressure,
 	TP_PROTO(struct xfs_mount *mp, unsigned long nr, void *caller_ip),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 09/20] xfs: reduce inactivation delay when free space is tight Darrick J. Wong
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Allow administrators to control the length that we defer inode
inactivation.  By default we'll set the delay to 5 seconds, as an
arbitrary choice between allowing for some batching of a deltree
operation, and not letting too many inodes pile up in memory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    7 +++++++
 fs/xfs/xfs_globals.c              |    5 +++++
 fs/xfs/xfs_icache.c               |    6 +++++-
 fs/xfs/xfs_linux.h                |    1 +
 fs/xfs/xfs_sysctl.c               |    9 +++++++++
 fs/xfs/xfs_sysctl.h               |    1 +
 6 files changed, 28 insertions(+), 1 deletion(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index f9b109bfc6a6..11d3103890dc 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -277,6 +277,13 @@ The following sysctls are available for the XFS filesystem:
 	references and returns timed-out AGs back to the free stream
 	pool.
 
+  fs.xfs.inode_gc_delay_ms
+	(Units: milliseconds   Min: 0  Default: 2000  Max: 3600000)
+	The amount of time to delay cleanup work that happens after a file is
+	closed by all programs.  This involves clearing speculative
+	preallocations from linked files and freeing unlinked files.  A higher
+	value here increases batching at a risk of background work storms.
+
   fs.xfs.speculative_prealloc_lifetime
 	(Units: seconds   Min: 1  Default: 300  Max: 86400)
 	The interval at which the background scanning for inodes
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index f62fa652c2fd..e81f3a39bebc 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -28,6 +28,11 @@ xfs_param_t xfs_params = {
 	.rotorstep	= {	1,		1,		255	},
 	.inherit_nodfrg	= {	0,		1,		1	},
 	.fstrm_timer	= {	1,		30*100,		3600*100},
+
+	/* Values below here are measured in milliseconds */
+	.inodegc_ms	= {	0,		5000,		3600*1000},
+
+	/* Values below here are measured in seconds */
 	.blockgc_timer	= {	1,		300,		3600*24},
 };
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e0803544ea19..69f7fb048116 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -221,8 +221,12 @@ xfs_gc_delay_ms(
 	struct xfs_mount	*mp,
 	unsigned int		tag)
 {
+	unsigned int		default_ms;
+
 	switch (tag) {
 	case XFS_ICI_INODEGC_TAG:
+		default_ms = xfs_inodegc_ms;
+
 		/* If we're in a shrinker, kick off the worker immediately. */
 		if (current->reclaim_state != NULL) {
 			trace_xfs_inodegc_delay_mempressure(mp,
@@ -235,7 +239,7 @@ xfs_gc_delay_ms(
 		return 0;
 	}
 
-	return 0;
+	return default_ms;
 }
 
 /*
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index c174262a074e..89bafcce3579 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -99,6 +99,7 @@ typedef __u32			xfs_nlink_t;
 #define xfs_inherit_nodefrag	xfs_params.inherit_nodfrg.val
 #define xfs_fstrm_centisecs	xfs_params.fstrm_timer.val
 #define xfs_blockgc_secs	xfs_params.blockgc_timer.val
+#define xfs_inodegc_ms		xfs_params.inodegc_ms.val
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_set_flags_nested(sp, f)		\
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index 546a6cd96729..6495887f4f00 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -176,6 +176,15 @@ static struct ctl_table xfs_table[] = {
 		.extra1		= &xfs_params.fstrm_timer.min,
 		.extra2		= &xfs_params.fstrm_timer.max,
 	},
+	{
+		.procname	= "inode_gc_delay_ms",
+		.data		= &xfs_params.inodegc_ms.val,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &xfs_params.inodegc_ms.min,
+		.extra2		= &xfs_params.inodegc_ms.max
+	},
 	{
 		.procname	= "speculative_prealloc_lifetime",
 		.data		= &xfs_params.blockgc_timer.val,
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 7692e76ead33..9a867b379a1f 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -36,6 +36,7 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t blockgc_timer;	/* Interval between blockgc scans */
+	xfs_sysctl_val_t inodegc_ms;	/* Inode inactivation scan interval */
 } xfs_param_t;
 
 /*


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/20] xfs: reduce inactivation delay when free space is tight
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 10/20] xfs: reduce inactivation delay when quota are tight Darrick J. Wong
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

On a mostly empty filesystem, the risk of the allocator making poor
decisions due to fragmentation of the free space on account a lengthy
delay in background updates is minimal because there's plenty of space.
However, if free space is tight, we want to deallocate unlinked inodes
as quickly as possible to avoid fallocate ENOSPC and to give the
allocator the best shot at optimal allocations for new writes.

Therefore, use the same free space thresholds that we use to limit
preallocation to scale down the delay between an AG being tagged for
needing inodgc work and the inodegc worker being executed.  This follows
the same principle that XFS becomes less aggressive about allocations
(and more precise about accounting) when nearing full.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   96 ++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_trace.h  |   38 ++++++++++++++++++++
 2 files changed, 124 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 69f7fb048116..6418e50518f8 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -212,6 +212,39 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
+/*
+ * Scale down the background work delay if we're low on free space.  Similar to
+ * the way that we throttle preallocations, we halve the delay time for every
+ * low free space threshold that isn't met.  Return value is in ms.
+ */
+static inline unsigned int
+xfs_gc_delay_freesp(
+	struct xfs_mount	*mp,
+	unsigned int		tag,
+	unsigned int		delay_ms)
+{
+	int64_t			freesp;
+	unsigned int		shift = 0;
+
+	freesp = percpu_counter_read_positive(&mp->m_fdblocks);
+	if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) {
+		shift = 2;
+		if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT])
+			shift++;
+		if (freesp < mp->m_low_space[XFS_LOWSP_3_PCNT])
+			shift++;
+		if (freesp < mp->m_low_space[XFS_LOWSP_2_PCNT])
+			shift++;
+		if (freesp < mp->m_low_space[XFS_LOWSP_1_PCNT])
+			shift++;
+	}
+
+	if (shift)
+		trace_xfs_gc_delay_fdblocks(mp, tag, shift);
+
+	return delay_ms >> shift;
+}
+
 /*
  * Compute the lag between scheduling and executing some kind of background
  * garbage collection work.  Return value is in ms.
@@ -239,7 +272,7 @@ xfs_gc_delay_ms(
 		return 0;
 	}
 
-	return default_ms;
+	return xfs_gc_delay_freesp(mp, tag, default_ms);
 }
 
 /*
@@ -265,7 +298,8 @@ xfs_blockgc_queue(
  */
 static void
 xfs_inodegc_queue(
-	struct xfs_mount        *mp)
+	struct xfs_mount        *mp,
+	struct xfs_inode	*ip)
 {
 	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
 		return;
@@ -282,14 +316,55 @@ xfs_inodegc_queue(
 	rcu_read_unlock();
 }
 
+/*
+ * Reschedule the background inactivation worker immediately if space is
+ * getting tight and the worker hasn't started running yet.
+ */
+static void
+xfs_gc_requeue_now(
+	struct xfs_mount	*mp,
+	unsigned int		tag)
+{
+	struct delayed_work	*dwork;
+	unsigned int		opflag_bit;
+	unsigned int		default_ms;
+
+	switch (tag) {
+	case XFS_ICI_INODEGC_TAG:
+		dwork = &mp->m_inodegc_work;
+		default_ms = xfs_inodegc_ms;
+		opflag_bit = XFS_OPFLAG_INODEGC_RUNNING_BIT;
+		break;
+	default:
+		return;
+	}
+
+	if (!delayed_work_pending(dwork) ||
+	    !test_bit(opflag_bit, &mp->m_opflags))
+		return;
+
+	rcu_read_lock();
+	if (!radix_tree_tagged(&mp->m_perag_tree, tag))
+		goto unlock;
+
+	if (xfs_gc_delay_ms(mp, tag) == default_ms)
+		goto unlock;
+
+	trace_xfs_gc_requeue_now(mp, tag);
+	queue_delayed_work(mp->m_gc_workqueue, dwork, 0);
+unlock:
+	rcu_read_unlock();
+}
+
 /* Set a tag on both the AG incore inode tree and the AG radix tree. */
 static void
 xfs_perag_set_inode_tag(
 	struct xfs_perag	*pag,
-	xfs_agino_t		agino,
+	struct xfs_inode	*ip,
 	unsigned int		tag)
 {
 	struct xfs_mount	*mp = pag->pag_mount;
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	bool			was_tagged;
 
 	lockdep_assert_held(&pag->pag_ici_lock);
@@ -302,8 +377,10 @@ xfs_perag_set_inode_tag(
 	else if (tag == XFS_ICI_INODEGC_TAG)
 		pag->pag_ici_needs_inactive++;
 
-	if (was_tagged)
+	if (was_tagged) {
+		xfs_gc_requeue_now(mp, tag);
 		return;
+	}
 
 	/* propagate the tag up into the perag radix tree */
 	spin_lock(&mp->m_perag_lock);
@@ -319,7 +396,7 @@ xfs_perag_set_inode_tag(
 		xfs_blockgc_queue(pag);
 		break;
 	case XFS_ICI_INODEGC_TAG:
-		xfs_inodegc_queue(mp);
+		xfs_inodegc_queue(mp, ip);
 		break;
 	}
 
@@ -479,7 +556,7 @@ xfs_inode_mark_reclaimable(
 		tag = XFS_ICI_RECLAIM_TAG;
 	}
 
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
+	xfs_perag_set_inode_tag(pag, ip, tag);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -1367,8 +1444,7 @@ xfs_blockgc_set_iflag(
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
 
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_BLOCKGC_TAG);
+	xfs_perag_set_inode_tag(pag, ip, XFS_ICI_BLOCKGC_TAG);
 
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
@@ -1849,7 +1925,7 @@ xfs_inodegc_inactivate(
 	ip->i_flags |= XFS_IRECLAIMABLE;
 
 	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_INODEGC_TAG);
-	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
+	xfs_perag_set_inode_tag(pag, ip, XFS_ICI_RECLAIM_TAG);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -1915,7 +1991,7 @@ xfs_inodegc_start(
 		return;
 
 	trace_xfs_inodegc_start(mp, __return_address);
-	xfs_inodegc_queue(mp);
+	xfs_inodegc_queue(mp, NULL);
 }
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d3f3f6a32872..2092a8542862 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -213,6 +213,28 @@ TRACE_EVENT(xfs_inodegc_requeue_mempressure,
 		  __entry->caller_ip)
 );
 
+TRACE_EVENT(xfs_gc_delay_fdblocks,
+	TP_PROTO(struct xfs_mount *mp, unsigned int tag, unsigned int shift),
+	TP_ARGS(mp, tag, shift),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, fdblocks)
+		__field(unsigned int, tag)
+		__field(unsigned int, shift)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->fdblocks = percpu_counter_read(&mp->m_fdblocks);
+		__entry->tag = tag;
+		__entry->shift = shift;
+	),
+	TP_printk("dev %d:%d tag %u shift %u fdblocks %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag,
+		  __entry->shift,
+		  __entry->fdblocks)
+);
+
 DECLARE_EVENT_CLASS(xfs_gc_queue_class,
 	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),
 	TP_ARGS(mp, delay_ms),
@@ -234,6 +256,22 @@ DEFINE_EVENT(xfs_gc_queue_class, name,	\
 	TP_ARGS(mp, delay_ms))
 DEFINE_GC_QUEUE_EVENT(xfs_inodegc_queue);
 
+TRACE_EVENT(xfs_gc_requeue_now,
+	TP_PROTO(struct xfs_mount *mp, unsigned int tag),
+	TP_ARGS(mp, tag),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, tag)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->tag = tag;
+	),
+	TP_printk("dev %d:%d tag %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag)
+);
+
 TRACE_EVENT(xfs_inodegc_throttle_mempressure,
 	TP_PROTO(struct xfs_mount *mp),
 	TP_ARGS(mp),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/20] xfs: reduce inactivation delay when quota are tight
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 09/20] xfs: reduce inactivation delay when free space is tight Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 11/20] xfs: reduce inactivation delay when realtime extents " Darrick J. Wong
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Implement the same scaling down of inodegc delays when we're tight on
quota.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_dquot.h  |   10 ++++++
 fs/xfs/xfs_icache.c |   86 ++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_trace.h  |   34 ++++++++++++++++++++
 3 files changed, 125 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index f642884a6834..6b5e3cf40c8b 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -54,6 +54,16 @@ struct xfs_dquot_res {
 	xfs_qwarncnt_t		warnings;
 };
 
+static inline bool
+xfs_dquot_res_over_limits(
+	const struct xfs_dquot_res	*qres)
+{
+	if ((qres->softlimit && qres->softlimit < qres->reserved) ||
+	    (qres->hardlimit && qres->hardlimit < qres->reserved))
+		return true;
+	return false;
+}
+
 /*
  * The incore dquot structure
  */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 6418e50518f8..7ba80d7bff41 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -212,6 +212,73 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
+/*
+ * Scale down the background work delay if we're close to a quota limit.
+ * Similar to the way that we throttle preallocations, we halve the delay time
+ * for every low free space threshold that isn't met, and we zero it if we're
+ * over the hard limit.  Return value is in ms.
+ */
+static inline unsigned int
+xfs_gc_delay_dquot(
+	struct xfs_inode	*ip,
+	xfs_dqtype_t		type,
+	unsigned int		tag,
+	unsigned int		delay_ms)
+{
+	struct xfs_dquot	*dqp;
+	int64_t			freesp;
+	unsigned int		shift = 0;
+
+	if (!ip)
+		goto out;
+
+	/*
+	 * Leave the delay untouched if there are no quota limits to enforce.
+	 * These comparisons are done locklessly because at worst we schedule
+	 * background work sooner than necessary.
+	 */
+	dqp = xfs_inode_dquot(ip, type);
+	if (!dqp || !xfs_dquot_is_enforced(dqp))
+		goto out;
+
+	if (xfs_dquot_res_over_limits(&dqp->q_ino) ||
+	    xfs_dquot_res_over_limits(&dqp->q_rtb)) {
+		trace_xfs_gc_delay_dquot(dqp, tag, 32);
+		return 0;
+	}
+
+	/* no hi watermark, no throttle */
+	if (!dqp->q_prealloc_hi_wmark)
+		goto out;
+
+	/* under the lo watermark, no throttle */
+	if (dqp->q_blk.reserved < dqp->q_prealloc_lo_wmark)
+		goto out;
+
+	/* If we're over the hard limit, run immediately. */
+	if (dqp->q_blk.reserved >= dqp->q_prealloc_hi_wmark) {
+		trace_xfs_gc_delay_dquot(dqp, tag, 32);
+		return 0;
+	}
+
+	/* Scale down the delay if we're close to the soft limits. */
+	freesp = dqp->q_prealloc_hi_wmark - dqp->q_blk.reserved;
+	if (freesp < dqp->q_low_space[XFS_QLOWSP_5_PCNT]) {
+		shift = 2;
+		if (freesp < dqp->q_low_space[XFS_QLOWSP_3_PCNT])
+			shift += 2;
+		if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT])
+			shift += 2;
+	}
+
+	if (shift)
+		trace_xfs_gc_delay_dquot(dqp, tag, shift);
+
+	delay_ms >>= shift;
+out:
+	return delay_ms;
+}
+
 /*
  * Scale down the background work delay if we're low on free space.  Similar to
  * the way that we throttle preallocations, we halve the delay time for every
@@ -247,14 +314,17 @@ xfs_gc_delay_freesp(
 
 /*
  * Compute the lag between scheduling and executing some kind of background
- * garbage collection work.  Return value is in ms.
+ * garbage collection work.  Return value is in ms.  If an inode is passed in,
+ * its dquots will be considered in the lag computation.
  */
 static inline unsigned int
 xfs_gc_delay_ms(
 	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
 	unsigned int		tag)
 {
 	unsigned int		default_ms;
+	unsigned int		udelay, gdelay, pdelay, fdelay;
 
 	switch (tag) {
 	case XFS_ICI_INODEGC_TAG:
@@ -272,7 +342,12 @@ xfs_gc_delay_ms(
 		return 0;
 	}
 
-	return xfs_gc_delay_freesp(mp, tag, default_ms);
+	udelay = xfs_gc_delay_dquot(ip, XFS_DQTYPE_USER, tag, default_ms);
+	gdelay = xfs_gc_delay_dquot(ip, XFS_DQTYPE_GROUP, tag, default_ms);
+	pdelay = xfs_gc_delay_dquot(ip, XFS_DQTYPE_PROJ, tag, default_ms);
+	fdelay = xfs_gc_delay_freesp(mp, tag, default_ms);
+
+	return min(min(udelay, gdelay), min(pdelay, fdelay));
 }
 
 /*
@@ -308,7 +383,7 @@ xfs_inodegc_queue(
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
 		unsigned int	delay;
 
-		delay = xfs_gc_delay_ms(mp, XFS_ICI_INODEGC_TAG);
+		delay = xfs_gc_delay_ms(mp, ip, XFS_ICI_INODEGC_TAG);
 		trace_xfs_inodegc_queue(mp, delay);
 		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
 				msecs_to_jiffies(delay));
@@ -323,6 +398,7 @@ xfs_inodegc_queue(
 static void
 xfs_gc_requeue_now(
 	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
 	unsigned int		tag)
 {
 	struct delayed_work	*dwork;
@@ -347,7 +423,7 @@ xfs_gc_requeue_now(
 	if (!radix_tree_tagged(&mp->m_perag_tree, tag))
 		goto unlock;
 
-	if (xfs_gc_delay_ms(mp, tag) == default_ms)
+	if (xfs_gc_delay_ms(mp, ip, tag) == default_ms)
 		goto unlock;
 
 	trace_xfs_gc_requeue_now(mp, tag);
@@ -378,7 +454,7 @@ xfs_perag_set_inode_tag(
 		pag->pag_ici_needs_inactive++;
 
 	if (was_tagged) {
-		xfs_gc_requeue_now(mp, tag);
+		xfs_gc_requeue_now(mp, ip, tag);
 		return;
 	}
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2092a8542862..001fd202dbfb 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -213,6 +213,40 @@ TRACE_EVENT(xfs_inodegc_requeue_mempressure,
 		  __entry->caller_ip)
 );
 
+TRACE_EVENT(xfs_gc_delay_dquot,
+	TP_PROTO(struct xfs_dquot *dqp, unsigned int tag, unsigned int shift),
+	TP_ARGS(dqp, tag, shift),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u32, id)
+		__field(xfs_dqtype_t, type)
+		__field(unsigned int, tag)
+		__field(unsigned int, shift)
+		__field(unsigned long long, reserved)
+		__field(unsigned long long, hi_mark)
+		__field(unsigned long long, lo_mark)
+	),
+	TP_fast_assign(
+		__entry->dev = dqp->q_mount->m_super->s_dev;
+		__entry->id = dqp->q_id;
+		__entry->type = dqp->q_type;
+		__entry->reserved = dqp->q_blk.reserved;
+		__entry->hi_mark = dqp->q_prealloc_hi_wmark;
+		__entry->lo_mark = dqp->q_prealloc_lo_wmark;
+		__entry->tag = tag;
+		__entry->shift = shift;
+	),
+	TP_printk("dev %d:%d tag %u shift %u dqid 0x%x dqtype %s reserved %llu hi %llu lo %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag,
+		  __entry->shift,
+		  __entry->id,
+		  __print_flags(__entry->type, "|", XFS_DQTYPE_STRINGS),
+		  __entry->reserved,
+		  __entry->hi_mark,
+		  __entry->lo_mark)
+);
+
 TRACE_EVENT(xfs_gc_delay_fdblocks,
 	TP_PROTO(struct xfs_mount *mp, unsigned int tag, unsigned int shift),
 	TP_ARGS(mp, tag, shift),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/20] xfs: reduce inactivation delay when realtime extents are tight
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 10/20] xfs: reduce inactivation delay when quota are tight Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:44 ` [PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Implement the same scaling down of inodegc delays when we're tight on
realtime extents that we do for free blocks on the data device.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_mount.c  |   13 ++++++++-----
 fs/xfs/xfs_mount.h  |    1 +
 fs/xfs/xfs_trace.h  |   22 ++++++++++++++++++++++
 4 files changed, 80 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7ba80d7bff41..91a1dc7eb352 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -279,6 +279,47 @@ xfs_gc_delay_dquot(
 	return delay_ms;
 }
 
+/*
+ * Scale down the background work delay if we're low on free rt extents.
+ * Return value is in ms.
+ */
+static inline unsigned int
+xfs_gc_delay_freertx(
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
+	unsigned int		tag,
+	unsigned int		delay_ms)
+{
+	int64_t			freertx;
+	unsigned int		shift = 0;
+
+	if (ip && !XFS_IS_REALTIME_INODE(ip))
+		return delay_ms;
+	if (!xfs_sb_version_hasrealtime(&mp->m_sb))
+		return delay_ms;
+
+	spin_lock(&mp->m_sb_lock);
+	freertx = mp->m_sb.sb_rextents;
+	spin_unlock(&mp->m_sb_lock);
+
+	if (freertx < mp->m_low_rtexts[XFS_LOWSP_5_PCNT]) {
+		shift = 2;
+		if (freertx < mp->m_low_rtexts[XFS_LOWSP_4_PCNT])
+			shift++;
+		if (freertx < mp->m_low_rtexts[XFS_LOWSP_3_PCNT])
+			shift++;
+		if (freertx < mp->m_low_rtexts[XFS_LOWSP_2_PCNT])
+			shift++;
+		if (freertx < mp->m_low_rtexts[XFS_LOWSP_1_PCNT])
+			shift++;
+	}
+
+	if (shift)
+		trace_xfs_gc_delay_frextents(mp, tag, shift);
+
+	return delay_ms >> shift;
+}
+
 /*
  * Scale down the background work delay if we're low on free space.  Similar to
  * the way that we throttle preallocations, we halve the delay time for every
@@ -324,7 +365,7 @@ xfs_gc_delay_ms(
 	unsigned int		tag)
 {
 	unsigned int		default_ms;
-	unsigned int		udelay, gdelay, pdelay, fdelay;
+	unsigned int		udelay, gdelay, pdelay, fdelay, rdelay;
 
 	switch (tag) {
 	case XFS_ICI_INODEGC_TAG:
@@ -346,8 +387,14 @@ xfs_gc_delay_ms(
 	gdelay = xfs_gc_delay_dquot(ip, XFS_DQTYPE_GROUP, tag, default_ms);
 	pdelay = xfs_gc_delay_dquot(ip, XFS_DQTYPE_PROJ, tag, default_ms);
 	fdelay = xfs_gc_delay_freesp(mp, tag, default_ms);
+	rdelay = xfs_gc_delay_freertx(mp, ip, tag, default_ms);
 
-	return min(min(udelay, gdelay), min(pdelay, fdelay));
+	udelay = min(udelay, gdelay);
+	pdelay = min(pdelay, fdelay);
+
+	udelay = min(udelay, pdelay);
+
+	return min(udelay, rdelay);
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index ac953c486b9f..32b46593a169 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -365,13 +365,16 @@ void
 xfs_set_low_space_thresholds(
 	struct xfs_mount	*mp)
 {
-	int i;
+	uint64_t		dblocks = mp->m_sb.sb_dblocks;
+	uint64_t		rtexts = mp->m_sb.sb_rextents;
+	int			i;
+
+	do_div(dblocks, 100);
+	do_div(rtexts, 100);
 
 	for (i = 0; i < XFS_LOWSP_MAX; i++) {
-		uint64_t space = mp->m_sb.sb_dblocks;
-
-		do_div(space, 100);
-		mp->m_low_space[i] = space * (i + 1);
+		mp->m_low_space[i] = dblocks * (i + 1);
+		mp->m_low_rtexts[i] = rtexts * (i + 1);
 	}
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7844b44d45ea..225b3d289336 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -133,6 +133,7 @@ typedef struct xfs_mount {
 	uint			m_qflags;	/* quota status flags */
 	uint64_t		m_flags;	/* global mount flags */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
+	int64_t			m_low_rtexts[XFS_LOWSP_MAX];
 	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
 						/* low free space thresholds */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 001fd202dbfb..0579775e1e15 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -269,6 +269,28 @@ TRACE_EVENT(xfs_gc_delay_fdblocks,
 		  __entry->fdblocks)
 );
 
+TRACE_EVENT(xfs_gc_delay_frextents,
+	TP_PROTO(struct xfs_mount *mp, unsigned int tag, unsigned int shift),
+	TP_ARGS(mp, tag, shift),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, frextents)
+		__field(unsigned int, tag)
+		__field(unsigned int, shift)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->frextents = mp->m_sb.sb_frextents;
+		__entry->tag = tag;
+		__entry->shift = shift;
+	),
+	TP_printk("dev %d:%d tag %u shift %u frextents %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag,
+		  __entry->shift,
+		  __entry->frextents)
+);
+
 DECLARE_EVENT_CLASS(xfs_gc_queue_class,
 	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),
 	TP_ARGS(mp, delay_ms),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (10 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 11/20] xfs: reduce inactivation delay when realtime extents " Darrick J. Wong
@ 2021-07-29 18:44 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:44 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
to try to free speculative preallocations when space is tight.  This
means that file writes, transaction reservation failures, quota limit
enforcement, and the EOFBLOCKS ioctl all call this function to free
space when things are tight.

Since inode inactivation is now a background task, this means that the
filesystem can be hanging on to unlinked but not yet freed space.  Add
this to the list of things that xfs_blockgc_free_* makes writer threads
scan for when they cannot reserve space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 91a1dc7eb352..3501f04d0914 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1836,16 +1836,23 @@ xfs_blockgc_worker(
 }
 
 /*
- * Try to free space in the filesystem by purging eofblocks and cowblocks.
+ * Try to free space in the filesystem by purging inactive inodes, eofblocks
+ * and cowblocks.
  */
 int
 xfs_blockgc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_icwalk	*icw)
 {
+	int			error;
+
 	trace_xfs_blockgc_free_space(mp, icw, _RET_IP_);
 
-	return xfs_icwalk(mp, XFS_ICWALK_BLOCKGC, icw);
+	error = xfs_icwalk(mp, XFS_ICWALK_BLOCKGC, icw);
+	if (error)
+		return error;
+
+	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (11 preceding siblings ...)
  2021-07-29 18:44 ` [PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 14/20] xfs: parallelize inode inactivation Darrick J. Wong
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Users have come to expect that the space accounting information in
statfs and getquota reports are fairly accurate.  Now that we inactivate
inodes from a background queue, these numbers can be thrown off by
whatever resources are singly-owned by the inodes in the queue.  Flush
the pending inactivations when userspace asks for a space usage report.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_qm_syscalls.c |    8 ++++++++
 fs/xfs/xfs_super.c       |    3 +++
 2 files changed, 11 insertions(+)


diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 982cd6613a4c..c6902f9d064c 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -481,6 +481,10 @@ xfs_qm_scall_getquota(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	/* Flush inodegc work at the start of a quota reporting scan. */
+	if (id == 0)
+		xfs_inodegc_flush(mp);
+
 	/*
 	 * Try to get the dquot. We don't want it allocated on disk, so don't
 	 * set doalloc. If it doesn't exist, we'll get ENOENT back.
@@ -519,6 +523,10 @@ xfs_qm_scall_getquota_next(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	/* Flush inodegc work at the start of a quota reporting scan. */
+	if (*id == 0)
+		xfs_inodegc_flush(mp);
+
 	error = xfs_qm_dqget_next(mp, *id, type, &dqp);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c8207da0bb38..1f82726d6265 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -757,6 +757,9 @@ xfs_fs_statfs(
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
+	/* Wait for whatever inactivations are in progress. */
+	xfs_inodegc_flush(mp);
+
 	statp->f_type = XFS_SUPER_MAGIC;
 	statp->f_namelen = MAXNAMELEN - 1;
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/20] xfs: parallelize inode inactivation
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (12 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-08-02  0:55   ` Dave Chinner
  2021-07-29 18:45 ` [PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight Darrick J. Wong
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c |   12 ++++++-
 fs/xfs/libxfs/xfs_ag.h |   10 +++++
 fs/xfs/xfs_icache.c    |   88 ++++++++++++++++++++++++++++--------------------
 fs/xfs/xfs_icache.h    |    2 +
 fs/xfs/xfs_mount.c     |    9 +----
 fs/xfs/xfs_mount.h     |    8 ----
 fs/xfs/xfs_super.c     |    2 -
 fs/xfs/xfs_trace.h     |   82 ++++++++++++++++++++++++++++++++-------------
 8 files changed, 134 insertions(+), 79 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 125a4b1f5be5..f000644e5da3 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -173,6 +173,7 @@ __xfs_free_perag(
 	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
 
 	ASSERT(!delayed_work_pending(&pag->pag_blockgc_work));
+	ASSERT(!delayed_work_pending(&pag->pag_inodegc_work));
 	ASSERT(atomic_read(&pag->pag_ref) == 0);
 	kmem_free(pag);
 }
@@ -195,7 +196,9 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		ASSERT(pag->pag_ici_needs_inactive == 0);
 
+		unregister_shrinker(&pag->pag_inodegc_shrink);
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 
@@ -254,15 +257,20 @@ xfs_initialize_perag(
 		spin_lock_init(&pag->pagb_lock);
 		spin_lock_init(&pag->pag_state_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
+		INIT_DELAYED_WORK(&pag->pag_inodegc_work, xfs_inodegc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		init_waitqueue_head(&pag->pagb_wait);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
 
-		error = xfs_buf_hash_init(pag);
+		error = xfs_inodegc_register_shrinker(pag);
 		if (error)
 			goto out_remove_pag;
 
+		error = xfs_buf_hash_init(pag);
+		if (error)
+			goto out_inodegc_shrink;
+
 		error = xfs_iunlink_init(pag);
 		if (error)
 			goto out_hash_destroy;
@@ -282,6 +290,8 @@ xfs_initialize_perag(
 
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
+out_inodegc_shrink:
+	unregister_shrinker(&pag->pag_inodegc_shrink);
 out_remove_pag:
 	radix_tree_delete(&mp->m_perag_tree, index);
 out_free_pag:
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index ad0d3480a4a2..28db7fc4ebc0 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -81,6 +81,12 @@ struct xfs_perag {
 
 	atomic_t        pagf_fstrms;    /* # of filestreams active in this AG */
 
+	/*
+	 * How many times has the memory shrinker poked us since the last time
+	 * inodegc was queued?
+	 */
+	atomic_t	pag_inodegc_reclaim;
+
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	unsigned int	pag_ici_needs_inactive;	/* inodes queued for inactivation */
@@ -97,6 +103,10 @@ struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
+	/* background inode inactivation */
+	struct delayed_work	pag_inodegc_work;
+	struct shrinker		pag_inodegc_shrink;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3501f04d0914..6e9ca483c100 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -420,9 +420,11 @@ xfs_blockgc_queue(
  */
 static void
 xfs_inodegc_queue(
-	struct xfs_mount        *mp,
+	struct xfs_perag	*pag,
 	struct xfs_inode	*ip)
 {
+	struct xfs_mount        *mp = pag->pag_mount;
+
 	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
 		return;
 
@@ -431,8 +433,8 @@ xfs_inodegc_queue(
 		unsigned int	delay;
 
 		delay = xfs_gc_delay_ms(mp, ip, XFS_ICI_INODEGC_TAG);
-		trace_xfs_inodegc_queue(mp, delay);
-		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+		trace_xfs_inodegc_queue(pag, delay);
+		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
 				msecs_to_jiffies(delay));
 	}
 	rcu_read_unlock();
@@ -444,17 +446,18 @@ xfs_inodegc_queue(
  */
 static void
 xfs_gc_requeue_now(
-	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
 	struct xfs_inode	*ip,
 	unsigned int		tag)
 {
 	struct delayed_work	*dwork;
+	struct xfs_mount	*mp = pag->pag_mount;
 	unsigned int		opflag_bit;
 	unsigned int		default_ms;
 
 	switch (tag) {
 	case XFS_ICI_INODEGC_TAG:
-		dwork = &mp->m_inodegc_work;
+		dwork = &pag->pag_inodegc_work;
 		default_ms = xfs_inodegc_ms;
 		opflag_bit = XFS_OPFLAG_INODEGC_RUNNING_BIT;
 		break;
@@ -473,7 +476,7 @@ xfs_gc_requeue_now(
 	if (xfs_gc_delay_ms(mp, ip, tag) == default_ms)
 		goto unlock;
 
-	trace_xfs_gc_requeue_now(mp, tag);
+	trace_xfs_gc_requeue_now(pag, tag);
 	queue_delayed_work(mp->m_gc_workqueue, dwork, 0);
 unlock:
 	rcu_read_unlock();
@@ -501,7 +504,7 @@ xfs_perag_set_inode_tag(
 		pag->pag_ici_needs_inactive++;
 
 	if (was_tagged) {
-		xfs_gc_requeue_now(mp, ip, tag);
+		xfs_gc_requeue_now(pag, ip, tag);
 		return;
 	}
 
@@ -519,7 +522,7 @@ xfs_perag_set_inode_tag(
 		xfs_blockgc_queue(pag);
 		break;
 	case XFS_ICI_INODEGC_TAG:
-		xfs_inodegc_queue(mp, ip);
+		xfs_inodegc_queue(pag, ip);
 		break;
 	}
 
@@ -597,8 +600,6 @@ static inline bool
 xfs_inodegc_want_throttle(
 	struct xfs_perag	*pag)
 {
-	struct xfs_mount	*mp = pag->pag_mount;
-
 	/*
 	 * If we're in memory reclaim context, we don't want to wait for inode
 	 * inactivation to finish because it can take a very long time to
@@ -615,8 +616,8 @@ xfs_inodegc_want_throttle(
 	}
 
 	/* Throttle if memory reclaim anywhere has triggered us. */
-	if (atomic_read(&mp->m_inodegc_reclaim) > 0) {
-		trace_xfs_inodegc_throttle_mempressure(mp);
+	if (atomic_read(&pag->pag_inodegc_reclaim) > 0) {
+		trace_xfs_inodegc_throttle_mempressure(pag);
 		return true;
 	}
 
@@ -683,10 +684,11 @@ xfs_inode_mark_reclaimable(
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
+
+	if (flush_inodegc && flush_work(&pag->pag_inodegc_work.work))
+		trace_xfs_inodegc_throttled(pag, __return_address);
+
 	xfs_perag_put(pag);
-
-	if (flush_inodegc && flush_work(&mp->m_inodegc_work.work))
-		trace_xfs_inodegc_throttled(mp, __return_address);
 }
 
 static inline void
@@ -2066,23 +2068,23 @@ void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_mount	*mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_inodegc_work);
+	struct xfs_perag	*pag = container_of(to_delayed_work(work),
+					struct xfs_perag, pag_inodegc_work);
 
 	/*
 	 * Inactivation never returns error codes and never fails to push a
 	 * tagged inode to reclaim.  Loop until there there's nothing left.
 	 */
-	while (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
-		trace_xfs_inodegc_worker(mp, __return_address);
-		xfs_icwalk(mp, XFS_ICWALK_INODEGC, NULL);
+	while (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG)) {
+		trace_xfs_inodegc_worker(pag, __return_address);
+		xfs_icwalk_ag(pag, XFS_ICWALK_INODEGC, NULL);
 	}
 
 	/*
 	 * We inactivated all the inodes we could, so disable the throttling
 	 * of new inactivations that happens when memory gets tight.
 	 */
-	atomic_set(&mp->m_inodegc_reclaim, 0);
+	atomic_set(&pag->pag_inodegc_reclaim, 0);
 }
 
 /*
@@ -2093,8 +2095,13 @@ void
 xfs_inodegc_flush(
 	struct xfs_mount	*mp)
 {
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
 	trace_xfs_inodegc_flush(mp, __return_address);
-	flush_delayed_work(&mp->m_inodegc_work);
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG)
+		flush_delayed_work(&pag->pag_inodegc_work);
 }
 
 /* Disable the inode inactivation background worker and wait for it to stop. */
@@ -2102,10 +2109,14 @@ void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
 	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
 		return;
 
-	cancel_delayed_work_sync(&mp->m_inodegc_work);
+	for_each_perag(mp, agno, pag)
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 	trace_xfs_inodegc_stop(mp, __return_address);
 }
 
@@ -2117,11 +2128,15 @@ void
 xfs_inodegc_start(
 	struct xfs_mount	*mp)
 {
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
 	if (test_and_set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
 		return;
 
 	trace_xfs_inodegc_start(mp, __return_address);
-	xfs_inodegc_queue(mp, NULL);
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG)
+		xfs_inodegc_queue(pag, NULL);
 }
 
 /*
@@ -2140,11 +2155,11 @@ xfs_inodegc_shrink_count(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_mount	*mp;
+	struct xfs_perag	*pag;
 
-	mp = container_of(shrink, struct xfs_mount, m_inodegc_shrink);
+	pag = container_of(shrink, struct xfs_perag, pag_inodegc_shrink);
 
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
+	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG))
 		return XFS_INODEGC_SHRINK_COUNT;
 
 	return 0;
@@ -2155,7 +2170,7 @@ xfs_inodegc_shrink_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_mount	*mp;
+	struct xfs_perag	*pag;
 
 	/*
 	 * Inode inactivation work requires NOFS allocations, so don't make
@@ -2164,14 +2179,15 @@ xfs_inodegc_shrink_scan(
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	mp = container_of(shrink, struct xfs_mount, m_inodegc_shrink);
+	pag = container_of(shrink, struct xfs_perag, pag_inodegc_shrink);
 
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
-		trace_xfs_inodegc_requeue_mempressure(mp, sc->nr_to_scan,
+	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG)) {
+		struct xfs_mount *mp = pag->pag_mount;
+
+		trace_xfs_inodegc_requeue_mempressure(pag, sc->nr_to_scan,
 				__return_address);
-
-		atomic_inc(&mp->m_inodegc_reclaim);
-		mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+		atomic_inc(&pag->pag_inodegc_reclaim);
+		mod_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0);
 	}
 
 	return 0;
@@ -2180,9 +2196,9 @@ xfs_inodegc_shrink_scan(
 /* Register a shrinker so we can accelerate inodegc and throttle queuing. */
 int
 xfs_inodegc_register_shrinker(
-	struct xfs_mount	*mp)
+	struct xfs_perag	*pag)
 {
-	struct shrinker		*shrink = &mp->m_inodegc_shrink;
+	struct shrinker		*shrink = &pag->pag_inodegc_shrink;
 
 	shrink->count_objects = xfs_inodegc_shrink_count;
 	shrink->scan_objects = xfs_inodegc_shrink_scan;
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index e38c8bc5461f..7622efe6fd58 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -78,6 +78,6 @@ void xfs_inodegc_worker(struct work_struct *work);
 void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
-int xfs_inodegc_register_shrinker(struct xfs_mount *mp);
+int xfs_inodegc_register_shrinker(struct xfs_perag *pag);
 
 #endif
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 32b46593a169..37afb0e0d879 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -769,10 +769,6 @@ xfs_mountfs(
 		goto out_free_perag;
 	}
 
-	error = xfs_inodegc_register_shrinker(mp);
-	if (error)
-		goto out_fail_wait;
-
 	/*
 	 * Log's mount-time initialization. The first part of recovery can place
 	 * some items on the AIL, to be handled when recovery is finished or
@@ -783,7 +779,7 @@ xfs_mountfs(
 			      XFS_FSB_TO_BB(mp, sbp->sb_logblocks));
 	if (error) {
 		xfs_warn(mp, "log mount failed");
-		goto out_inodegc_shrink;
+		goto out_fail_wait;
 	}
 
 	/* Make sure the summary counts are ok. */
@@ -977,8 +973,6 @@ xfs_mountfs(
 	xfs_unmount_flush_inodes(mp);
  out_log_dealloc:
 	xfs_log_mount_cancel(mp);
- out_inodegc_shrink:
-	unregister_shrinker(&mp->m_inodegc_shrink);
  out_fail_wait:
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_buftarg_drain(mp->m_logdev_targp);
@@ -1059,7 +1053,6 @@ xfs_unmountfs(
 #if defined(DEBUG)
 	xfs_errortag_clearall(mp);
 #endif
-	unregister_shrinker(&mp->m_inodegc_shrink);
 	xfs_free_perag(mp);
 
 	xfs_errortag_del(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 225b3d289336..edd5c4fd6533 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -192,8 +192,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
-	struct delayed_work	m_inodegc_work; /* background inode inactive */
-	struct shrinker		m_inodegc_shrink;
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -221,12 +219,6 @@ typedef struct xfs_mount {
 	uint32_t		m_generation;
 	struct mutex		m_growlock;	/* growfs mutex */
 
-	/*
-	 * How many times has the memory shrinker poked us since the last time
-	 * inodegc was queued?
-	 */
-	atomic_t		m_inodegc_reclaim;
-
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1f82726d6265..2451f6d1690f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1849,8 +1849,6 @@ static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
-	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
-	atomic_set(&mp->m_inodegc_reclaim, 0);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0579775e1e15..2c504c3e63e6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -123,7 +123,7 @@ TRACE_EVENT(xlog_intent_recovery_failed,
 		  __entry->error, __entry->function)
 );
 
-DECLARE_EVENT_CLASS(xfs_perag_class,
+DECLARE_EVENT_CLASS(xfs_perag_ref_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, int refcount,
 		 unsigned long caller_ip),
 	TP_ARGS(mp, agno, refcount, caller_ip),
@@ -147,7 +147,7 @@ DECLARE_EVENT_CLASS(xfs_perag_class,
 );
 
 #define DEFINE_PERAG_REF_EVENT(name)	\
-DEFINE_EVENT(xfs_perag_class, name,	\
+DEFINE_EVENT(xfs_perag_ref_class, name,	\
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, int refcount,	\
 		 unsigned long caller_ip),					\
 	TP_ARGS(mp, agno, refcount, caller_ip))
@@ -189,30 +189,57 @@ DEFINE_EVENT(xfs_fs_class, name,					\
 DEFINE_FS_EVENT(xfs_inodegc_flush);
 DEFINE_FS_EVENT(xfs_inodegc_start);
 DEFINE_FS_EVENT(xfs_inodegc_stop);
-DEFINE_FS_EVENT(xfs_inodegc_worker);
-DEFINE_FS_EVENT(xfs_inodegc_throttled);
 DEFINE_FS_EVENT(xfs_fs_sync_fs);
 DEFINE_FS_EVENT(xfs_inodegc_delay_mempressure);
 
 TRACE_EVENT(xfs_inodegc_requeue_mempressure,
-	TP_PROTO(struct xfs_mount *mp, unsigned long nr, void *caller_ip),
-	TP_ARGS(mp, nr, caller_ip),
+	TP_PROTO(struct xfs_perag *pag, unsigned long nr, void *caller_ip),
+	TP_ARGS(pag, nr, caller_ip),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
 		__field(unsigned long, nr)
 		__field(void *, caller_ip)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
 		__entry->nr = nr;
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d nr_to_scan %lu caller %pS",
+	TP_printk("dev %d:%d agno %u nr_to_scan %lu caller %pS",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
 		  __entry->nr,
 		  __entry->caller_ip)
 );
 
+DECLARE_EVENT_CLASS(xfs_perag_class,
+	TP_PROTO(struct xfs_perag *pag, void *caller_ip),
+	TP_ARGS(pag, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d agno %u caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_PERAG_EVENT(name)	\
+DEFINE_EVENT(xfs_perag_class, name,					\
+	TP_PROTO(struct xfs_perag *pag, void *caller_ip), \
+	TP_ARGS(pag, caller_ip))
+DEFINE_PERAG_EVENT(xfs_inodegc_throttled);
+DEFINE_PERAG_EVENT(xfs_inodegc_worker);
+
 TRACE_EVENT(xfs_gc_delay_dquot,
 	TP_PROTO(struct xfs_dquot *dqp, unsigned int tag, unsigned int shift),
 	TP_ARGS(dqp, tag, shift),
@@ -292,55 +319,64 @@ TRACE_EVENT(xfs_gc_delay_frextents,
 );
 
 DECLARE_EVENT_CLASS(xfs_gc_queue_class,
-	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),
-	TP_ARGS(mp, delay_ms),
+	TP_PROTO(struct xfs_perag *pag, unsigned int delay_ms),
+	TP_ARGS(pag, delay_ms),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
 		__field(unsigned int, delay_ms)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
 		__entry->delay_ms = delay_ms;
 	),
-	TP_printk("dev %d:%d delay_ms %u",
+	TP_printk("dev %d:%d agno %u delay_ms %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
 		  __entry->delay_ms)
 );
 #define DEFINE_GC_QUEUE_EVENT(name)	\
 DEFINE_EVENT(xfs_gc_queue_class, name,	\
-	TP_PROTO(struct xfs_mount *mp, unsigned int delay_ms),	\
-	TP_ARGS(mp, delay_ms))
+	TP_PROTO(struct xfs_perag *pag, unsigned int delay_ms),	\
+	TP_ARGS(pag, delay_ms))
 DEFINE_GC_QUEUE_EVENT(xfs_inodegc_queue);
 
 TRACE_EVENT(xfs_gc_requeue_now,
-	TP_PROTO(struct xfs_mount *mp, unsigned int tag),
-	TP_ARGS(mp, tag),
+	TP_PROTO(struct xfs_perag *pag, unsigned int tag),
+	TP_ARGS(pag, tag),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
 		__field(unsigned int, tag)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
 		__entry->tag = tag;
 	),
-	TP_printk("dev %d:%d tag %u",
+	TP_printk("dev %d:%d agno %u tag %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
 		  __entry->tag)
 );
 
 TRACE_EVENT(xfs_inodegc_throttle_mempressure,
-	TP_PROTO(struct xfs_mount *mp),
-	TP_ARGS(mp),
+	TP_PROTO(struct xfs_perag *pag),
+	TP_ARGS(pag),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
 		__field(int, votes)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->votes = atomic_read(&mp->m_inodegc_reclaim);
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->votes = atomic_read(&pag->pag_inodegc_reclaim);
 	),
-	TP_printk("dev %d:%d votes %d",
+	TP_printk("dev %d:%d agno %u votes %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
 		  __entry->votes)
 );
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (13 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 14/20] xfs: parallelize inode inactivation Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 16/20] xfs: queue inodegc worker immediately on backlog Darrick J. Wong
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Implement the same scaling down of inodegc delays when we're tight on
free space in an AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_mount.c  |    2 ++
 fs/xfs/xfs_mount.h  |    1 +
 fs/xfs/xfs_trace.h  |   27 ++++++++++++++++++++++++++
 4 files changed, 78 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 6e9ca483c100..17cc2ac76809 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -353,6 +353,47 @@ xfs_gc_delay_freesp(
 	return delay_ms >> shift;
 }
 
+/*
+ * Scale down the background work delay if we're low on free space in this AG.
+ * Similar to the way that we throttle preallocations, we halve the delay time
+ * for every low free space threshold that isn't met.  Return value is in ms.
+ */
+static inline unsigned int
+xfs_gc_delay_perag(
+	struct xfs_perag	*pag,
+	unsigned int		tag,
+	unsigned int		delay_ms)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	xfs_extlen_t		freesp;
+	unsigned int		shift = 0;
+
+	if (!pag->pagf_init)
+		return delay_ms;
+
+	/* Free space in this AG that can be allocated to file data */
+	freesp = pag->pagf_freeblks + pag->pagf_flcount;
+	freesp -= (pag->pag_meta_resv.ar_reserved +
+		   pag->pag_rmapbt_resv.ar_reserved);
+
+	if (freesp < mp->m_ag_low_space[XFS_LOWSP_5_PCNT]) {
+		shift = 2;
+		if (freesp < mp->m_ag_low_space[XFS_LOWSP_4_PCNT])
+			shift++;
+		if (freesp < mp->m_ag_low_space[XFS_LOWSP_3_PCNT])
+			shift++;
+		if (freesp < mp->m_ag_low_space[XFS_LOWSP_2_PCNT])
+			shift++;
+		if (freesp < mp->m_ag_low_space[XFS_LOWSP_1_PCNT])
+			shift++;
+	}
+
+	if (shift)
+		trace_xfs_gc_delay_agfreeblks(pag, tag, shift);
+
+	return delay_ms >> shift;
+}
+
 /*
  * Compute the lag between scheduling and executing some kind of background
  * garbage collection work.  Return value is in ms.  If an inode is passed in,
@@ -360,12 +401,13 @@ xfs_gc_delay_freesp(
  */
 static inline unsigned int
 xfs_gc_delay_ms(
-	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
 	struct xfs_inode	*ip,
 	unsigned int		tag)
 {
+	struct xfs_mount	*mp = pag->pag_mount;
 	unsigned int		default_ms;
-	unsigned int		udelay, gdelay, pdelay, fdelay, rdelay;
+	unsigned int		udelay, gdelay, pdelay, fdelay, rdelay, adelay;
 
 	switch (tag) {
 	case XFS_ICI_INODEGC_TAG:
@@ -388,9 +430,11 @@ xfs_gc_delay_ms(
 	pdelay = xfs_gc_delay_dquot(ip, XFS_DQTYPE_PROJ, tag, default_ms);
 	fdelay = xfs_gc_delay_freesp(mp, tag, default_ms);
 	rdelay = xfs_gc_delay_freertx(mp, ip, tag, default_ms);
+	adelay = xfs_gc_delay_perag(pag, tag, default_ms);
 
 	udelay = min(udelay, gdelay);
 	pdelay = min(pdelay, fdelay);
+	rdelay = min(rdelay, adelay);
 
 	udelay = min(udelay, pdelay);
 
@@ -432,7 +476,7 @@ xfs_inodegc_queue(
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
 		unsigned int	delay;
 
-		delay = xfs_gc_delay_ms(mp, ip, XFS_ICI_INODEGC_TAG);
+		delay = xfs_gc_delay_ms(pag, ip, XFS_ICI_INODEGC_TAG);
 		trace_xfs_inodegc_queue(pag, delay);
 		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
 				msecs_to_jiffies(delay));
@@ -473,7 +517,7 @@ xfs_gc_requeue_now(
 	if (!radix_tree_tagged(&mp->m_perag_tree, tag))
 		goto unlock;
 
-	if (xfs_gc_delay_ms(mp, ip, tag) == default_ms)
+	if (xfs_gc_delay_ms(pag, ip, tag) == default_ms)
 		goto unlock;
 
 	trace_xfs_gc_requeue_now(pag, tag);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 37afb0e0d879..811ce8e9310e 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -367,6 +367,7 @@ xfs_set_low_space_thresholds(
 {
 	uint64_t		dblocks = mp->m_sb.sb_dblocks;
 	uint64_t		rtexts = mp->m_sb.sb_rextents;
+	uint32_t		agblocks = mp->m_sb.sb_agblocks / 100;
 	int			i;
 
 	do_div(dblocks, 100);
@@ -375,6 +376,7 @@ xfs_set_low_space_thresholds(
 	for (i = 0; i < XFS_LOWSP_MAX; i++) {
 		mp->m_low_space[i] = dblocks * (i + 1);
 		mp->m_low_rtexts[i] = rtexts * (i + 1);
+		mp->m_ag_low_space[i] = agblocks * (i + 1);
 	}
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index edd5c4fd6533..74ca2a458b14 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -131,6 +131,7 @@ typedef struct xfs_mount {
 	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint			m_qflags;	/* quota status flags */
+	int32_t			m_ag_low_space[XFS_LOWSP_MAX];
 	uint64_t		m_flags;	/* global mount flags */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 	int64_t			m_low_rtexts[XFS_LOWSP_MAX];
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2c504c3e63e6..43fb699e6aaf 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -318,6 +318,33 @@ TRACE_EVENT(xfs_gc_delay_frextents,
 		  __entry->frextents)
 );
 
+TRACE_EVENT(xfs_gc_delay_agfreeblks,
+	TP_PROTO(struct xfs_perag *pag, unsigned int tag, unsigned int shift),
+	TP_ARGS(pag, tag, shift),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(unsigned int, freeblks)
+		__field(unsigned int, tag)
+		__field(unsigned int, shift)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->freeblks = pag->pagf_freeblks + pag->pagf_flcount;
+		__entry->freeblks -= (pag->pag_meta_resv.ar_reserved +
+				      pag->pag_rmapbt_resv.ar_reserved);
+		__entry->tag = tag;
+		__entry->shift = shift;
+	),
+	TP_printk("dev %d:%d tag %u shift %u agno %u freeblks %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag,
+		  __entry->shift,
+		  __entry->agno,
+		  __entry->freeblks)
+);
+
 DECLARE_EVENT_CLASS(xfs_gc_queue_class,
 	TP_PROTO(struct xfs_perag *pag, unsigned int delay_ms),
 	TP_ARGS(pag, delay_ms),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/20] xfs: queue inodegc worker immediately on backlog
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (14 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

If an AG has hit the maximum number of inodes that it can queue for
inactivation, schedule the worker immediately.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    6 ++++++
 fs/xfs/xfs_trace.h  |    1 +
 2 files changed, 7 insertions(+)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 17cc2ac76809..513d380b8b55 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -419,6 +419,12 @@ xfs_gc_delay_ms(
 					__return_address);
 			return 0;
 		}
+
+		/* Kick the worker immediately if we've hit the max backlog. */
+		if (pag->pag_ici_needs_inactive > XFS_INODEGC_MAX_BACKLOG) {
+			trace_xfs_inodegc_delay_backlog(pag);
+			return 0;
+		}
 		break;
 	default:
 		ASSERT(0);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 43fb699e6aaf..26fc5cf08d5b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -430,6 +430,7 @@ DEFINE_EVENT(xfs_inodegc_backlog_class, name,	\
 	TP_PROTO(struct xfs_perag *pag),	\
 	TP_ARGS(pag))
 DEFINE_INODEGC_BACKLOG_EVENT(xfs_inodegc_throttle_backlog);
+DEFINE_INODEGC_BACKLOG_EVENT(xfs_inodegc_delay_backlog);
 
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (15 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 16/20] xfs: queue inodegc worker immediately on backlog Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space Darrick J. Wong
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the infrastructure to switch background workers on and
off at will, fix the block gc worker code so that we don't actually run
the worker when the filesystem is frozen, same as we do for deferred
inactivation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |    9 +++++----
 fs/xfs/xfs_icache.c   |   38 ++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_mount.c    |    1 +
 fs/xfs/xfs_mount.h    |    7 +++++++
 fs/xfs/xfs_super.c    |    9 ++++++---
 fs/xfs/xfs_trace.h    |    4 ++++
 6 files changed, 53 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 06b697f72f23..e86854171b0c 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -893,11 +893,12 @@ xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
 	/*
-	 * Readonly filesystems do not perform inactivation, so there's no
-	 * need to restart the worker.
+	 * Readonly filesystems do not perform inactivation or speculative
+	 * preallocation, so there's no need to restart the workers.
 	 */
-	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
+	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY)) {
 		xfs_inodegc_start(sc->mp);
-	xfs_blockgc_start(sc->mp);
+		xfs_blockgc_start(sc->mp);
+	}
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 513d380b8b55..9b1274f25ed0 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -455,11 +455,19 @@ static inline void
 xfs_blockgc_queue(
 	struct xfs_perag	*pag)
 {
+	struct xfs_mount        *mp = pag->pag_mount;
+
+	if (!test_bit(XFS_OPFLAG_BLOCKGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
 	rcu_read_lock();
-	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
-		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
-				   &pag->pag_blockgc_work,
-				   msecs_to_jiffies(xfs_blockgc_secs * 1000));
+	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG)) {
+		unsigned int	delay = xfs_blockgc_secs * 1000;
+
+		trace_xfs_blockgc_queue(pag, delay);
+		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_blockgc_work,
+				msecs_to_jiffies(delay));
+	}
 	rcu_read_unlock();
 }
 
@@ -1786,8 +1794,12 @@ xfs_blockgc_stop(
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
-	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
+	if (!test_and_clear_bit(XFS_OPFLAG_BLOCKGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	for_each_perag(mp, agno, pag)
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
+	trace_xfs_blockgc_stop(mp, __return_address);
 }
 
 /* Enable post-EOF and CoW block auto-reclamation. */
@@ -1798,6 +1810,10 @@ xfs_blockgc_start(
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
+	if (!test_and_set_bit(XFS_OPFLAG_BLOCKGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	trace_xfs_blockgc_start(mp, __return_address);
 	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
 		xfs_blockgc_queue(pag);
 }
@@ -1855,6 +1871,13 @@ xfs_blockgc_scan_inode(
 	unsigned int		lockflags = 0;
 	int			error;
 
+	/*
+	 * Speculative preallocation gc isn't supposed to run when the fs is
+	 * frozen because we don't want kernel threads to block on transaction
+	 * allocation.
+	 */
+	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
+
 	error = xfs_inode_free_eofblocks(ip, icw, &lockflags);
 	if (error)
 		goto unlock;
@@ -1877,13 +1900,12 @@ xfs_blockgc_worker(
 	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
-	if (!sb_start_write_trylock(mp->m_super))
-		return;
+	trace_xfs_blockgc_worker(pag, __return_address);
+
 	error = xfs_icwalk_ag(pag, XFS_ICWALK_BLOCKGC, NULL);
 	if (error)
 		xfs_info(mp, "AG %u preallocation gc worker failed, err=%d",
 				pag->pag_agno, error);
-	sb_end_write(mp->m_super);
 	xfs_blockgc_queue(pag);
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 811ce8e9310e..acb1ebacdf8f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -791,6 +791,7 @@ xfs_mountfs(
 
 	/* Enable background inode inactivation workers. */
 	xfs_inodegc_start(mp);
+	xfs_blockgc_start(mp);
 
 	/*
 	 * Get and sanity-check the root inode.
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 74ca2a458b14..446c3a8f57c4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -278,6 +278,13 @@ enum xfs_opflag_bits {
 	 * waiting to be processed.
 	 */
 	XFS_OPFLAG_INODEGC_RUNNING_BIT	= 0,
+
+	/*
+	 * If set, background speculative prealloc gc worker threads will be
+	 * scheduled to process queued blockgc work.  If not, inodes retain
+	 * their preallocations until explicitly deleted.
+	 */
+	XFS_OPFLAG_BLOCKGC_RUNNING_BIT	= 1,
 };
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2451f6d1690f..7e2df0170e51 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -881,14 +881,17 @@ xfs_fs_unfreeze(
 
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
-	xfs_blockgc_start(mp);
 
 	/*
 	 * Don't reactivate the inodegc worker on a readonly filesystem because
-	 * inodes are sent directly to reclaim.
+	 * inodes are sent directly to reclaim.  Don't reactivate the blockgc
+	 * worker because there are no speculative preallocations on a readonly
+	 * filesystem.
 	 */
-	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+		xfs_blockgc_start(mp);
 		xfs_inodegc_start(mp);
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 26fc5cf08d5b..80dc15a14173 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -191,6 +191,8 @@ DEFINE_FS_EVENT(xfs_inodegc_start);
 DEFINE_FS_EVENT(xfs_inodegc_stop);
 DEFINE_FS_EVENT(xfs_fs_sync_fs);
 DEFINE_FS_EVENT(xfs_inodegc_delay_mempressure);
+DEFINE_FS_EVENT(xfs_blockgc_start);
+DEFINE_FS_EVENT(xfs_blockgc_stop);
 
 TRACE_EVENT(xfs_inodegc_requeue_mempressure,
 	TP_PROTO(struct xfs_perag *pag, unsigned long nr, void *caller_ip),
@@ -239,6 +241,7 @@ DEFINE_EVENT(xfs_perag_class, name,					\
 	TP_ARGS(pag, caller_ip))
 DEFINE_PERAG_EVENT(xfs_inodegc_throttled);
 DEFINE_PERAG_EVENT(xfs_inodegc_worker);
+DEFINE_PERAG_EVENT(xfs_blockgc_worker);
 
 TRACE_EVENT(xfs_gc_delay_dquot,
 	TP_PROTO(struct xfs_dquot *dqp, unsigned int tag, unsigned int shift),
@@ -368,6 +371,7 @@ DEFINE_EVENT(xfs_gc_queue_class, name,	\
 	TP_PROTO(struct xfs_perag *pag, unsigned int delay_ms),	\
 	TP_ARGS(pag, delay_ms))
 DEFINE_GC_QUEUE_EVENT(xfs_inodegc_queue);
+DEFINE_GC_QUEUE_EVENT(xfs_blockgc_queue);
 
 TRACE_EVENT(xfs_gc_requeue_now,
 	TP_PROTO(struct xfs_perag *pag, unsigned int tag),


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (16 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 19/20] xfs: use background worker pool when transactions can't get " Darrick J. Wong
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the ability to scale down the lag between scheduling
and executing background cleanup work for inode inactivations, apply the
same logic to speculative preallocation gc.  In other words, be more
proactive about trimming unused speculative preallocations if space is
low.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9b1274f25ed0..59a9526a25ff 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -426,6 +426,9 @@ xfs_gc_delay_ms(
 			return 0;
 		}
 		break;
+	case XFS_ICI_BLOCKGC_TAG:
+		default_ms = xfs_blockgc_secs * 1000;
+		break;
 	default:
 		ASSERT(0);
 		return 0;
@@ -453,7 +456,8 @@ xfs_gc_delay_ms(
  */
 static inline void
 xfs_blockgc_queue(
-	struct xfs_perag	*pag)
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount        *mp = pag->pag_mount;
 
@@ -462,8 +466,9 @@ xfs_blockgc_queue(
 
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG)) {
-		unsigned int	delay = xfs_blockgc_secs * 1000;
+		unsigned int	delay;
 
+		delay = xfs_gc_delay_ms(pag, ip, XFS_ICI_BLOCKGC_TAG);
 		trace_xfs_blockgc_queue(pag, delay);
 		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_blockgc_work,
 				msecs_to_jiffies(delay));
@@ -519,6 +524,11 @@ xfs_gc_requeue_now(
 		default_ms = xfs_inodegc_ms;
 		opflag_bit = XFS_OPFLAG_INODEGC_RUNNING_BIT;
 		break;
+	case XFS_ICI_BLOCKGC_TAG:
+		dwork = &pag->pag_blockgc_work;
+		default_ms = xfs_blockgc_secs * 1000;
+		opflag_bit = XFS_OPFLAG_BLOCKGC_RUNNING_BIT;
+		break;
 	default:
 		return;
 	}
@@ -577,7 +587,7 @@ xfs_perag_set_inode_tag(
 		xfs_reclaim_work_queue(mp);
 		break;
 	case XFS_ICI_BLOCKGC_TAG:
-		xfs_blockgc_queue(pag);
+		xfs_blockgc_queue(pag, ip);
 		break;
 	case XFS_ICI_INODEGC_TAG:
 		xfs_inodegc_queue(pag, ip);
@@ -1815,7 +1825,7 @@ xfs_blockgc_start(
 
 	trace_xfs_blockgc_start(mp, __return_address);
 	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
-		xfs_blockgc_queue(pag);
+		xfs_blockgc_queue(pag, NULL);
 }
 
 /* Don't try to run block gc on an inode that's in any of these states. */
@@ -1906,7 +1916,7 @@ xfs_blockgc_worker(
 	if (error)
 		xfs_info(mp, "AG %u preallocation gc worker failed, err=%d",
 				pag->pag_agno, error);
-	xfs_blockgc_queue(pag);
+	xfs_blockgc_queue(pag, NULL);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 19/20] xfs: use background worker pool when transactions can't get free space
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (17 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-07-29 18:45 ` [PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
  2021-08-02 10:35 ` [PATCHSET v8 00/20] xfs: deferred inode inactivation Dave Chinner
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
call xfs_blockgc_free_space with a NULL icwalk structure to try to free
space.  Each frontend thread that encounters this situation starts its
own walk of the inode cache to see if it can find anything, which is
wasteful since we don't have any additional selection criteria.  For
this one common case, create a function that reschedules all pending
background work immediately and flushes the workqueue so that the scan
can run in parallel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   19 +++++++++++++++++++
 fs/xfs/xfs_icache.h |    1 +
 fs/xfs/xfs_trace.h  |    1 +
 fs/xfs/xfs_trans.c  |    5 +----
 4 files changed, 22 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 59a9526a25ff..b21d1d37bcb0 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1939,6 +1939,25 @@ xfs_blockgc_free_space(
 	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
 }
 
+/*
+ * Reclaim all the free space that we can by scheduling the background blockgc
+ * and inodegc workers immediately and waiting for them all to clear.
+ */
+void
+xfs_blockgc_flush_all(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	trace_xfs_blockgc_flush_all(mp, __return_address);
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
+		flush_delayed_work(&pag->pag_blockgc_work);
+
+	xfs_inodegc_flush(mp);
+}
+
 /*
  * Run cow/eofblocks scans on the supplied dquots.  We don't know exactly which
  * quota caused an allocation failure, so we make a best effort by including
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 7622efe6fd58..e256af3ed9b2 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -59,6 +59,7 @@ int xfs_blockgc_free_dquots(struct xfs_mount *mp, struct xfs_dquot *udqp,
 		unsigned int iwalk_flags);
 int xfs_blockgc_free_quota(struct xfs_inode *ip, unsigned int iwalk_flags);
 int xfs_blockgc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icm);
+void xfs_blockgc_flush_all(struct xfs_mount *mp);
 
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 80dc15a14173..b6dd25f19d24 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -193,6 +193,7 @@ DEFINE_FS_EVENT(xfs_fs_sync_fs);
 DEFINE_FS_EVENT(xfs_inodegc_delay_mempressure);
 DEFINE_FS_EVENT(xfs_blockgc_start);
 DEFINE_FS_EVENT(xfs_blockgc_stop);
+DEFINE_FS_EVENT(xfs_blockgc_flush_all);
 
 TRACE_EVENT(xfs_inodegc_requeue_mempressure,
 	TP_PROTO(struct xfs_perag *pag, unsigned long nr, void *caller_ip),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 87bffd12c20c..83abaa219616 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -295,10 +295,7 @@ xfs_trans_alloc(
 		 * Do not perform a synchronous scan because callers can hold
 		 * other locks.
 		 */
-		error = xfs_blockgc_free_space(mp, NULL);
-		if (error)
-			return error;
-
+		xfs_blockgc_flush_all(mp);
 		want_retry = false;
 		goto retry;
 	}


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (18 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 19/20] xfs: use background worker pool when transactions can't get " Darrick J. Wong
@ 2021-07-29 18:45 ` Darrick J. Wong
  2021-08-02 10:35 ` [PATCHSET v8 00/20] xfs: deferred inode inactivation Dave Chinner
  20 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-29 18:45 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

When we're servicing an INUMBERS or BULKSTAT request or running
quotacheck, grab an empty transaction so that we can use its inherent
recursive buffer locking abilities to detect inode btree cycles without
hitting ABBA buffer deadlocks.  This patch requires the deferred inode
inactivation patchset because xfs_irele cannot directly call
xfs_inactive when the iwalk itself has an (empty) transaction.

Found by fuzzing an inode btree pointer to introduce a cycle into the
tree (xfs/365).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_itable.c |   42 +++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_iwalk.c  |   33 ++++++++++++++++++++++++++++-----
 2 files changed, 65 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index f331975a16de..84c17a9f9869 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -19,6 +19,7 @@
 #include "xfs_error.h"
 #include "xfs_icache.h"
 #include "xfs_health.h"
+#include "xfs_trans.h"
 
 /*
  * Bulk Stat
@@ -163,6 +164,7 @@ xfs_bulkstat_one(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error;
 
 	if (breq->mnt_userns != &init_user_ns) {
@@ -178,9 +180,18 @@ xfs_bulkstat_one(
 	if (!bc.buf)
 		return -ENOMEM;
 
-	error = xfs_bulkstat_one_int(breq->mp, breq->mnt_userns, NULL,
-				     breq->startino, &bc);
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
 
+	error = xfs_bulkstat_one_int(breq->mp, breq->mnt_userns, tp,
+			breq->startino, &bc);
+	xfs_trans_cancel(tp);
+out:
 	kmem_free(bc.buf);
 
 	/*
@@ -244,6 +255,7 @@ xfs_bulkstat(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error;
 
 	if (breq->mnt_userns != &init_user_ns) {
@@ -259,9 +271,18 @@ xfs_bulkstat(
 	if (!bc.buf)
 		return -ENOMEM;
 
-	error = xfs_iwalk(breq->mp, NULL, breq->startino, breq->flags,
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
+
+	error = xfs_iwalk(breq->mp, tp, breq->startino, breq->flags,
 			xfs_bulkstat_iwalk, breq->icount, &bc);
-
+	xfs_trans_cancel(tp);
+out:
 	kmem_free(bc.buf);
 
 	/*
@@ -374,13 +395,24 @@ xfs_inumbers(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error = 0;
 
 	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
 		return 0;
 
-	error = xfs_inobt_walk(breq->mp, NULL, breq->startino, breq->flags,
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
+
+	error = xfs_inobt_walk(breq->mp, tp, breq->startino, breq->flags,
 			xfs_inumbers_walk, breq->icount, &ic);
+	xfs_trans_cancel(tp);
+out:
 
 	/*
 	 * We found some inode groups, so clear the error status and return
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 917d51eefee3..7558486f4937 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -83,6 +83,9 @@ struct xfs_iwalk_ag {
 
 	/* Skip empty inobt records? */
 	unsigned int			skip_empty:1;
+
+	/* Drop the (hopefully empty) transaction when calling iwalk_fn. */
+	unsigned int			drop_trans:1;
 };
 
 /*
@@ -352,7 +355,6 @@ xfs_iwalk_run_callbacks(
 	int				*has_more)
 {
 	struct xfs_mount		*mp = iwag->mp;
-	struct xfs_trans		*tp = iwag->tp;
 	struct xfs_inobt_rec_incore	*irec;
 	xfs_agino_t			next_agino;
 	int				error;
@@ -362,10 +364,15 @@ xfs_iwalk_run_callbacks(
 	ASSERT(iwag->nr_recs > 0);
 
 	/* Delete cursor but remember the last record we cached... */
-	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
+	xfs_iwalk_del_inobt(iwag->tp, curpp, agi_bpp, 0);
 	irec = &iwag->recs[iwag->nr_recs - 1];
 	ASSERT(next_agino >= irec->ir_startino + XFS_INODES_PER_CHUNK);
 
+	if (iwag->drop_trans) {
+		xfs_trans_cancel(iwag->tp);
+		iwag->tp = NULL;
+	}
+
 	error = xfs_iwalk_ag_recs(iwag);
 	if (error)
 		return error;
@@ -376,8 +383,15 @@ xfs_iwalk_run_callbacks(
 	if (!has_more)
 		return 0;
 
+	if (iwag->drop_trans) {
+		error = xfs_trans_alloc_empty(mp, &iwag->tp);
+		if (error)
+			return error;
+	}
+
 	/* ...and recreate the cursor just past where we left off. */
-	error = xfs_inobt_cur(mp, tp, iwag->pag, XFS_BTNUM_INO, curpp, agi_bpp);
+	error = xfs_inobt_cur(mp, iwag->tp, iwag->pag, XFS_BTNUM_INO, curpp,
+			agi_bpp);
 	if (error)
 		return error;
 
@@ -390,7 +404,6 @@ xfs_iwalk_ag(
 	struct xfs_iwalk_ag		*iwag)
 {
 	struct xfs_mount		*mp = iwag->mp;
-	struct xfs_trans		*tp = iwag->tp;
 	struct xfs_perag		*pag = iwag->pag;
 	struct xfs_buf			*agi_bp = NULL;
 	struct xfs_btree_cur		*cur = NULL;
@@ -469,7 +482,7 @@ xfs_iwalk_ag(
 	error = xfs_iwalk_run_callbacks(iwag, &cur, &agi_bp, &has_more);
 
 out:
-	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+	xfs_iwalk_del_inobt(iwag->tp, &cur, &agi_bp, error);
 	return error;
 }
 
@@ -599,8 +612,18 @@ xfs_iwalk_ag_work(
 	error = xfs_iwalk_alloc(iwag);
 	if (error)
 		goto out;
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(mp, &iwag->tp);
+	if (error)
+		goto out;
+	iwag->drop_trans = 1;
 
 	error = xfs_iwalk_ag(iwag);
+	if (iwag->tp)
+		xfs_trans_cancel(iwag->tp);
 	xfs_iwalk_free(iwag);
 out:
 	xfs_perag_put(iwag->pag);


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue
  2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
@ 2021-07-30  4:24   ` Dave Chinner
  2021-07-31  4:21     ` Darrick J. Wong
  2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-07-30  4:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> defer the inactivation phase to a separate workqueue.  With this change,
> we can speed up directory tree deletions by reducing the duration of
> unlink() calls to the directory and unlinked list updates.
> 
> By moving the inactivation work to the background, we can reduce the
> total cost of deleting a lot of files by performing the file deletions
> in disk order instead of directory entry order, which can be arbitrary.
> 
> We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> The first flag helps our worker find inodes needing inactivation, and
> the second flag marks inodes that are in the process of being
> inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> 
> Unfortunately, deferring the inactivation has one huge downside --
> eventual consistency.  Since all the freeing is deferred to a worker
> thread, one can rm a file but the space doesn't come back immediately.
> This can cause some odd side effects with quota accounting and statfs,
> so we flush inactivation work during syncfs in order to maintain the
> existing behaviors, at least for callers that unlink() and sync().
> 
> For this patch we'll set the delay to zero to mimic the old timing as
> much as possible; in the next patch we'll play with different delay
> settings.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
.....
> +
> +/* Disable the inode inactivation background worker and wait for it to stop. */
> +void
> +xfs_inodegc_stop(
> +	struct xfs_mount	*mp)
> +{
> +	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> +		return;
> +
> +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> +	trace_xfs_inodegc_stop(mp, __return_address);
> +}

FWIW, this introduces a new mount field that does the same thing as the
m_opstate field I added in my feature flag cleanup series (i.e.
atomic operational state changes).  Personally I much prefer my
opstate stuff because this is state, not flags, and the namespace is
much less verbose...

THere's also conflicts all over the place because of that. All the
RO checks are busted, lots of the quota mods in your tree conflict
with the sb_version_hasfeat -> has_feat conversion, etc.

We're going to have to reconcile this at some point soon...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue
  2021-07-30  4:24   ` Dave Chinner
@ 2021-07-31  4:21     ` Darrick J. Wong
  2021-08-01 21:49       ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-07-31  4:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Fri, Jul 30, 2021 at 02:24:00PM +1000, Dave Chinner wrote:
> On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> > defer the inactivation phase to a separate workqueue.  With this change,
> > we can speed up directory tree deletions by reducing the duration of
> > unlink() calls to the directory and unlinked list updates.
> > 
> > By moving the inactivation work to the background, we can reduce the
> > total cost of deleting a lot of files by performing the file deletions
> > in disk order instead of directory entry order, which can be arbitrary.
> > 
> > We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> > The first flag helps our worker find inodes needing inactivation, and
> > the second flag marks inodes that are in the process of being
> > inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> > inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> > 
> > Unfortunately, deferring the inactivation has one huge downside --
> > eventual consistency.  Since all the freeing is deferred to a worker
> > thread, one can rm a file but the space doesn't come back immediately.
> > This can cause some odd side effects with quota accounting and statfs,
> > so we flush inactivation work during syncfs in order to maintain the
> > existing behaviors, at least for callers that unlink() and sync().
> > 
> > For this patch we'll set the delay to zero to mimic the old timing as
> > much as possible; in the next patch we'll play with different delay
> > settings.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> .....
> > +
> > +/* Disable the inode inactivation background worker and wait for it to stop. */
> > +void
> > +xfs_inodegc_stop(
> > +	struct xfs_mount	*mp)
> > +{
> > +	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> > +		return;
> > +
> > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > +	trace_xfs_inodegc_stop(mp, __return_address);
> > +}
> 
> FWIW, this introduces a new mount field that does the same thing as the
> m_opstate field I added in my feature flag cleanup series (i.e.
> atomic operational state changes).  Personally I much prefer my
> opstate stuff because this is state, not flags, and the namespace is
> much less verbose...

Yes, well, is that ready to go?  Like, right /now/?  I already bolted
the quotaoff scrapping patchset on the front, after reworking the ENOSPC
retry loops and reworking quota apis before that...

> THere's also conflicts all over the place because of that. All the
> RO checks are busted,

Can we focus on /this/ patchset, then?  What specifically is broken
about the ro checking in it?

And since the shrinkers are always a source of amusement, what /is/ up
with it?  I don't really like having to feed it magic numbers just to
get it to do what I want, which is ... let it free some memory in the
first round, then we'll kick the background workers when the priority
bumps (er, decreases), and hope that's enough not to OOM the box.

--D

> lots of the quota mods in your tree conflict
> with the sb_version_hasfeat -> has_feat conversion, etc.
> 
> We're going to have to reconcile this at some point soon...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue
  2021-07-31  4:21     ` Darrick J. Wong
@ 2021-08-01 21:49       ` Dave Chinner
  2021-08-01 23:47         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-01 21:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Fri, Jul 30, 2021 at 09:21:12PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 30, 2021 at 02:24:00PM +1000, Dave Chinner wrote:
> > On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> > > defer the inactivation phase to a separate workqueue.  With this change,
> > > we can speed up directory tree deletions by reducing the duration of
> > > unlink() calls to the directory and unlinked list updates.
> > > 
> > > By moving the inactivation work to the background, we can reduce the
> > > total cost of deleting a lot of files by performing the file deletions
> > > in disk order instead of directory entry order, which can be arbitrary.
> > > 
> > > We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> > > The first flag helps our worker find inodes needing inactivation, and
> > > the second flag marks inodes that are in the process of being
> > > inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> > > inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> > > 
> > > Unfortunately, deferring the inactivation has one huge downside --
> > > eventual consistency.  Since all the freeing is deferred to a worker
> > > thread, one can rm a file but the space doesn't come back immediately.
> > > This can cause some odd side effects with quota accounting and statfs,
> > > so we flush inactivation work during syncfs in order to maintain the
> > > existing behaviors, at least for callers that unlink() and sync().
> > > 
> > > For this patch we'll set the delay to zero to mimic the old timing as
> > > much as possible; in the next patch we'll play with different delay
> > > settings.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > .....
> > > +
> > > +/* Disable the inode inactivation background worker and wait for it to stop. */
> > > +void
> > > +xfs_inodegc_stop(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> > > +		return;
> > > +
> > > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > > +	trace_xfs_inodegc_stop(mp, __return_address);
> > > +}
> > 
> > FWIW, this introduces a new mount field that does the same thing as the
> > m_opstate field I added in my feature flag cleanup series (i.e.
> > atomic operational state changes).  Personally I much prefer my
> > opstate stuff because this is state, not flags, and the namespace is
> > much less verbose...
> 
> Yes, well, is that ready to go?  Like, right /now/?  I already bolted
> the quotaoff scrapping patchset on the front, after reworking the ENOSPC
> retry loops and reworking quota apis before that...

Should be - that's why it's in my patch stack getting tested. But I
wasn't suggesting that you need to put it in first, just trying to
give you the heads up that there's a substantial conflict between
that and this patchset.

> > THere's also conflicts all over the place because of that. All the
> > RO checks are busted,
> 
> Can we focus on /this/ patchset, then?  What specifically is broken
> about the ro checking in it?

Sory, I wasn't particularly clear about that. What I meant was that
stuff like all the new RO and shutdown checks in this patchset don't
give conflicts but instead cause compilation failures. So the merge
isn't just a case of fixing conflicts, the code doesn't compile
(i.e. it is busted) after fixing all the reported merge conflicts.

> And since the shrinkers are always a source of amusement, what /is/ up
> with it?  I don't really like having to feed it magic numbers just to
> get it to do what I want, which is ... let it free some memory in the
> first round, then we'll kick the background workers when the priority
> bumps (er, decreases), and hope that's enough not to OOM the box.

Well, the shrinkers are intended as a one-shot memory pressure
notification as you are trying to use them. They are intended to be
told the amount of work that needs to be done to free memory, and
they they calculate how much of that work should be done based on
it's idea of the current level of memory pressure.

One shot shrinker triggers never tend to work well because they
treat all memory pressure the same - very light memory pressure is
dead with by the same big hammer than deals with OOM levels of
memory pressure.

As it is, I'm more concerned right now with finding out why there's
such large performance regressions in highly concurrent recursive
chmod/unlink workloads. I spend most of friday looking at it trying
to work out what behaviour was causing the regression, but I haven't
isolated it yet. I suspect that it is lockstepping between user
processes and background inactivation for the unlink - I'm seeing
the backlink rhashtable show up in the profiles which indicates the
unlinked list lengths are an issue and we're lockstepping the AGI.
It also may simply be that there is too much parallelism hammering
the transaction subsystem now....

IOWs, I'm basically going to have to pull this apart patch by patch
to tease out where the behaviours go wrong and see if there are ways
to avoid and mitigate those behaviours.  Hence I haven't even got to
the shrinker/oom considerations yet; there's a bigger performance
issue that needs to be understood first. It may be that they are
related, but right now we need to know why recursive chmod is
saw-toothing (it's not a lack of log space!) and concurrent unlinks
throughput has dropped by half...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue
  2021-08-01 21:49       ` Dave Chinner
@ 2021-08-01 23:47         ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-01 23:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Aug 02, 2021 at 07:49:10AM +1000, Dave Chinner wrote:
> On Fri, Jul 30, 2021 at 09:21:12PM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 30, 2021 at 02:24:00PM +1000, Dave Chinner wrote:
> > > On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote:
> > And since the shrinkers are always a source of amusement, what /is/ up
> > with it?  I don't really like having to feed it magic numbers just to
> > get it to do what I want, which is ... let it free some memory in the
> > first round, then we'll kick the background workers when the priority
> > bumps (er, decreases), and hope that's enough not to OOM the box.
> 
> Well, the shrinkers are intended as a one-shot memory pressure
> notification as you are trying to use them. They are intended to be
> told the amount of work that needs to be done to free memory, and
> they they calculate how much of that work should be done based on
> it's idea of the current level of memory pressure.
> 
> One shot shrinker triggers never tend to work well because they
> treat all memory pressure the same - very light memory pressure is
> dead with by the same big hammer than deals with OOM levels of
> memory pressure.
> 
> As it is, I'm more concerned right now with finding out why there's
> such large performance regressions in highly concurrent recursive
> chmod/unlink workloads. I spend most of friday looking at it trying
> to work out what behaviour was causing the regression, but I haven't
> isolated it yet.

So I pulled all of my patchsets out and I'm just looking at the
deferred inactivation changes now. rm -rf triggers a profile like:

-   68.94%     3.24%  [kernel]            [k] xlog_cil_commit
   - 65.70% xlog_cil_commit
      - 55.01% _raw_spin_lock
         - do_raw_spin_lock
              54.14% __pv_queued_spin_lock_slowpath
        2.26% memcpy_erms
      - 1.60% xfs_buf_item_committing
         - 1.57% xfs_buf_item_release
            - 0.70% xfs_buf_unlock
                 0.67% up
              0.57% xfs_buf_rele
        1.09% xfs_buf_item_format
      - 0.90% _raw_spin_unlock
         - 0.80% do_raw_spin_unlock
              0.61% __raw_callee_save___pv_queued_spin_unlock
      - 0.81% xfs_buf_item_size
           0.57% xfs_buf_item_size_segment.isra.0
        0.67% xfs_inode_item_format

And the path into here is split almost exactly 50/50 between
xfs_remove() (process context) and xfs_inactive (deferred context).

+   40.85%     0.02%  [kernel]            [k] xfs_remove
+   40.61%     0.00%  [kernel]            [k] xfs_inodegc_worker

rm -rf runtime without the patchset is 2m30s, but 3m41s with it.

So, essentially, the background inactivation increases the
concurrency through the transaction commit path and causes a massive
increase in CIL push lock contention (i.e. catastrophic lock
breakdown) which makes performance go backwards.

Nothing can be done about this except, well, merge the CIL
scalability patches in my patch stack that address this exact lock
contention problem.

> I suspect that it is lockstepping between user
> processes and background inactivation for the unlink - I'm seeing
> the backlink rhashtable show up in the profiles which indicates the
> unlinked list lengths are an issue and we're lockstepping the AGI.
> It also may simply be that there is too much parallelism hammering
> the transaction subsystem now....

The reason I've been seeing different symptoms is that my CIL
scalability patchset entirely eliminates this spinlock contention
and something else less obvious becomes the performance limiting
factor. i.e. the CPU profile looks like this instead:

+   33.33%     0.00%  [kernel]            [k] xfs_inodegc_worker
+   26.45%     5.49%  [kernel]            [k] xlog_cil_commit
+   23.13%     0.04%  [kernel]            [k] xfs_remove

And runtime is a little lower (3m20s) but still not "fast". The fact
that CPU time has gone down so much results in idle time  and
indicates we are now contending on a sleeping lock as the context
swtich profile indicates:

+   36.23%     0.00%  [kernel]            [k] xfs_buf_read_map
....
+   22.99%     0.00%  [kernel]            [k] down
+   22.99%     0.00%  [kernel]            [k] __down
+   22.99%     0.00%  [kernel]            [k] xfs_read_agi
....
+   11.83%     0.00%  [kernel]            [k] xfs_imap_to_bp

Over a third of the context switches are on locked buffers, 2/3s of
tehm on AGI buffers, and the rest are mostly on inode inode cluster
buffers likely doing unlinked list modifications.

IOWs, we traded CIL push lock contention for AGI and inode cluster
buffer lock stepping between the unlink() syscall context and the
background inactivation processes.

This isn't a new problem - I've known about it for years, and I have
been working towards solving it (e.g. the single unlinked list per
AGI patches).  In effect, the problem is that we always add newly
unlinked to the head - the AGI end - of the unlinked lists, so we
must always take the AGI lock in unlink() context just to update the
unlinked list. On the inactivation side, we always have to take the
AGI because we are freeing inodes.

With deferred inactivation, the unlinked lists grow large, so we
could avoid needing to modify the AGI by adding inodes to the tail
of the unlinked list instead of the head. Unfortunately, to do this
we currently need to store 64 agino tail pointers per AGI in memory.
I might try modifying my patches to do this so I can re-order the
unlinked list without needing to change on disk format. It's not
very data cache or memory efficient, but it likely will avoid most
of the AGI contention on these workloads.

However, none of this is ready for prime time, so the next thing
I'll look at is why both foreground and background are running at
the same time instead of batching effectively. i.e. unlink context
runs a bunch of unlinks adding inodes to the unlinked lists, then
background runs doing the deferred work while foreground stays out
of the way (e.g. is throttled). This will involve looking at traces,
so I suspect it's going to take me a day or two to extract repeating
patterns and understand them well enough to determine what to
do/look at next.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/20] xfs: throttle inodegc queuing on backlog
  2021-07-29 18:44 ` [PATCH 06/20] xfs: throttle inodegc queuing on backlog Darrick J. Wong
@ 2021-08-02  0:45   ` Dave Chinner
  2021-08-02  1:30     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-02  0:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Thu, Jul 29, 2021 at 11:44:26AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Track the number of inodes in each AG that are queued for inactivation,
> then use that information to decide if we're going to make threads that
> has queued an inode for inactivation wait for the background thread.
> The purpose of this high water mark is to establish a maximum bound on
> the backlog of work that can accumulate on a non-frozen filesystem.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag.c |    1 +
>  fs/xfs/libxfs/xfs_ag.h |    3 ++-
>  fs/xfs/xfs_icache.c    |   16 ++++++++++++++++
>  fs/xfs/xfs_trace.h     |   24 ++++++++++++++++++++++++
>  4 files changed, 43 insertions(+), 1 deletion(-)

Ok, this appears to cause fairly long latencies in unlink. I see it
overrun the throttle threshold and not throttle for some time:

rm-16440 [016]  5391.083568: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 65537
rm-16440 [016]  5391.083622: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 65538
rm-16440 [016]  5391.083689: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 65539
.....
rm-16440 [016]  5391.216007: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 67193
rm-16440 [016]  5391.216069: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 67194
rm-16440 [016]  5391.216179: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 67195
rm-16440 [016]  5391.231293: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 66807

You can see from the traces above that a typical
unlink() runs in about 60-70 microseconds. Notably, when background
inactivation kicks in, that blew out to 15ms for a single unlink.
Also, we can see that it has overrun 150ms past when it first hits the throttle
threshold before background inactivation kicks in (we can see the
inactive count come down). The next trace from this process is:

rm-16440 [016]  5394.335940: xfs_inodegc_throttled: dev 251:0 agno 3 caller xfs_fs_destroy_inode+0xbb

Because it now waits on flush_work() to complete the background
inactivation before it can run again. IOWs, this user process just
got blocked for over 3 seconds waiting for internal GC to do it's
stuff.

This blows out the long tail latencies that userspace sees and this
will really hurt random processes that drop the last reference to
files that are going to be reclaimed immediately. (e.g. any
unlink() that is run).

There is no reason for waiting for the entire backlog to be
processed here. This really needs to be watermarked, so that when we
hit the high watermark we immediately sleep until the background
reclaim brings it back down below the low watermark.

In this case, we run about 20,000 inactivations/s, so inactivations
take about 50us to run. We want to limit the blocking of any given
process that is throttled to something controllable and practical.
e.g. 100ms, which indicates taht the high and low watermarks should
be somewhere around 5000 operations apart.

So, when something hits the high watermark, it sets a "queue
throttling" bit, forces the perag gc work to run immediately, and
goes to sleep on the throttle bit. Any new operations that hit that
perag also sleep on the "queue throttle" bit. When the GC work
brings the queue down below the low watermark, it wakes all the
waiters and keeps running, allowing user processes to add to the
queue again while it is draining it.

With this sort of setup, we shouldn't need really deep queues -
maybe a few thousand inodes at most - and we guarantee that the
background GC has a period of time where it largely has exclusive
access to the AGI and inode cluster buffers to run batched
inactivation as quickly as possible. We also largey bound the length
of time that user processes block on the background GC work, and
that will be good for keeping long tail latencies under control.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/20] xfs: parallelize inode inactivation
  2021-07-29 18:45 ` [PATCH 14/20] xfs: parallelize inode inactivation Darrick J. Wong
@ 2021-08-02  0:55   ` Dave Chinner
  2021-08-02 21:33     ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-02  0:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Thu, Jul 29, 2021 at 11:45:10AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split the inode inactivation work into per-AG work items so that we can
> take advantage of parallelization.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag.c |   12 ++++++-
>  fs/xfs/libxfs/xfs_ag.h |   10 +++++
>  fs/xfs/xfs_icache.c    |   88 ++++++++++++++++++++++++++++--------------------
>  fs/xfs/xfs_icache.h    |    2 +
>  fs/xfs/xfs_mount.c     |    9 +----
>  fs/xfs/xfs_mount.h     |    8 ----
>  fs/xfs/xfs_super.c     |    2 -
>  fs/xfs/xfs_trace.h     |   82 ++++++++++++++++++++++++++++++++-------------
>  8 files changed, 134 insertions(+), 79 deletions(-)

....

> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -420,9 +420,11 @@ xfs_blockgc_queue(
>   */
>  static void
>  xfs_inodegc_queue(
> -	struct xfs_mount        *mp,
> +	struct xfs_perag	*pag,
>  	struct xfs_inode	*ip)
>  {
> +	struct xfs_mount        *mp = pag->pag_mount;
> +
>  	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
>  		return;
>  
> @@ -431,8 +433,8 @@ xfs_inodegc_queue(
>  		unsigned int	delay;
>  
>  		delay = xfs_gc_delay_ms(mp, ip, XFS_ICI_INODEGC_TAG);
> -		trace_xfs_inodegc_queue(mp, delay);
> -		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
> +		trace_xfs_inodegc_queue(pag, delay);
> +		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
>  				msecs_to_jiffies(delay));
>  	}
>  	rcu_read_unlock();

I think you missed this change in xfs_inodegc_queue():

@@ -492,7 +492,7 @@ xfs_inodegc_queue(
 		return;
 
 	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
+	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG)) {
 		unsigned int    delay;
 
 		delay = xfs_gc_delay_ms(pag, ip, XFS_ICI_INODEGC_TAG);

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/20] xfs: throttle inodegc queuing on backlog
  2021-08-02  0:45   ` Dave Chinner
@ 2021-08-02  1:30     ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-02  1:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Aug 02, 2021 at 10:45:59AM +1000, Dave Chinner wrote:
> On Thu, Jul 29, 2021 at 11:44:26AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Track the number of inodes in each AG that are queued for inactivation,
> > then use that information to decide if we're going to make threads that
> > has queued an inode for inactivation wait for the background thread.
> > The purpose of this high water mark is to establish a maximum bound on
> > the backlog of work that can accumulate on a non-frozen filesystem.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_ag.c |    1 +
> >  fs/xfs/libxfs/xfs_ag.h |    3 ++-
> >  fs/xfs/xfs_icache.c    |   16 ++++++++++++++++
> >  fs/xfs/xfs_trace.h     |   24 ++++++++++++++++++++++++
> >  4 files changed, 43 insertions(+), 1 deletion(-)
> 
> Ok, this appears to cause fairly long latencies in unlink. I see it
> overrun the throttle threshold and not throttle for some time:
> 
> rm-16440 [016]  5391.083568: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 65537
> rm-16440 [016]  5391.083622: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 65538
> rm-16440 [016]  5391.083689: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 65539
> .....
> rm-16440 [016]  5391.216007: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 67193
> rm-16440 [016]  5391.216069: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 67194
> rm-16440 [016]  5391.216179: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 67195
> rm-16440 [016]  5391.231293: xfs_inodegc_throttle_backlog: dev 251:0 agno 3 needs_inactive 66807
> 
> You can see from the traces above that a typical
> unlink() runs in about 60-70 microseconds. Notably, when background
> inactivation kicks in, that blew out to 15ms for a single unlink.
> Also, we can see that it has overrun 150ms past when it first hits the throttle
> threshold before background inactivation kicks in (we can see the
> inactive count come down). The next trace from this process is:
> 
> rm-16440 [016]  5394.335940: xfs_inodegc_throttled: dev 251:0 agno 3 caller xfs_fs_destroy_inode+0xbb
> 
> Because it now waits on flush_work() to complete the background
> inactivation before it can run again. IOWs, this user process just
> got blocked for over 3 seconds waiting for internal GC to do it's
> stuff.
> 
> This blows out the long tail latencies that userspace sees and this
> will really hurt random processes that drop the last reference to
> files that are going to be reclaimed immediately. (e.g. any
> unlink() that is run).
> 
> There is no reason for waiting for the entire backlog to be
> processed here. This really needs to be watermarked, so that when we
> hit the high watermark we immediately sleep until the background
> reclaim brings it back down below the low watermark.
> 
> In this case, we run about 20,000 inactivations/s, so inactivations
> take about 50us to run. We want to limit the blocking of any given
> process that is throttled to something controllable and practical.
> e.g. 100ms, which indicates taht the high and low watermarks should
> be somewhere around 5000 operations apart.
> 
> So, when something hits the high watermark, it sets a "queue
> throttling" bit, forces the perag gc work to run immediately, and
> goes to sleep on the throttle bit. Any new operations that hit that
> perag also sleep on the "queue throttle" bit. When the GC work
> brings the queue down below the low watermark, it wakes all the
> waiters and keeps running, allowing user processes to add to the
> queue again while it is draining it.
> 
> With this sort of setup, we shouldn't need really deep queues -
> maybe a few thousand inodes at most - and we guarantee that the
> background GC has a period of time where it largely has exclusive
> access to the AGI and inode cluster buffers to run batched
> inactivation as quickly as possible. We also largey bound the length
> of time that user processes block on the background GC work, and
> that will be good for keeping long tail latencies under control.

So this:

@@ -753,7 +753,13 @@ xfs_inode_mark_reclaimable(
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
 
-	if (flush_inodegc && flush_work(&pag->pag_inodegc_work.work))
+	/*
+	 * XXX: throttling doesn't kick in until work is actually running.
+	 * Seeing overruns in the thousands of queued inodes, then taking
+	 * seconds to flush the entire work. Looks like this needs watermarks,
+	 * not a big workqueue flush hammer.
+	 */
+	if (flush_inodegc && flush_delayed_work(&pag->pag_inodegc_work))
 		trace_xfs_inodegc_throttled(pag, __return_address);
 
 	xfs_perag_put(pag);

Brings the unlink workload runtime down from 3m40s to 3m25s,
indicating that the throttling earlier does seem to have some
effect. It's kinda hard to really measure effectively because of all
the spinlock contention in the CIL, but it does also reduce the
userspace latencies to about 2.5-2.7s.

Dropping the backlog to 8192 (from 65536) gets rid of all the
visible stuttering in the rm -rf workload, and brings the runtime
down to 3m15s. So it definitely looks to me like smaller backlog
queue depths are more efficient but not enough by themselves to
erase the perf regression caused by added lock contention...

I'll keep digging on this - I might, at this point, just work from
the base of my CIL scalability patchset just to take the CIL lock
contention out of the picture altogether....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCHSET v8 00/20] xfs: deferred inode inactivation
  2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
                   ` (19 preceding siblings ...)
  2021-07-29 18:45 ` [PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
@ 2021-08-02 10:35 ` Dave Chinner
  20 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-02 10:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, hch

On Thu, Jul 29, 2021 at 11:43:53AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> This patch series implements deferred inode inactivation.  Inactivation
> is what happens when an open file loses its last incore reference: if
> the file has speculative preallocations, they must be freed, and if the
> file is unlinked, all forks must be truncated, and the inode marked
> freed in the inode chunk and the inode btrees.
> 
> Currently, all of this activity is performed in frontend threads when
> the last in-memory reference is lost and/or the vfs decides to drop the
> inode.  Three complaints stem from this behavior: first, that the time
> to unlink (in the worst case) depends on both the complexity of the
> directory as well as the the number of extents in that file; second,
> that deleting a directory tree is inefficient and seeky because we free
> the inodes in readdir order, not disk order; and third, the upcoming
> online repair feature needs to be able to xfs_irele while scanning a
> filesystem in transaction context.  It cannot perform inode inactivation
> in this context because xfs does not support nested transactions.
> 
> The implementation will be familiar to those who have studied how XFS
> scans for reclaimable in-core inodes -- we create a couple more inode
> state flags to mark an inode as needing inactivation and being in the
> middle of inactivation.  When inodes need inactivation, we set
> NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
> deferred work item.  The deferred worker runs in an unbounded workqueue,
> scanning the inode radix tree for tagged inodes to inactivate, and
> performing all the on-disk metadata updates.  Once the inode has been
> inactivated, it is left in the reclaim state and the background reclaim
> worker (or direct reclaim) will get to it eventually.
> 
> Doing the inactivations from kernel threads solves the first problem by
> constraining the amount of work done by the unlink() call to removing
> the directory entry.  It solves the third problem by moving inactivation
> to a separate process.  Because the inactivations are done in order of
> inode number, we solve the second problem by performing updates in (we
> hope) disk order.  This also decreases the amount of time it takes to
> let go of an inode cluster if we're deleting entire directory trees.
> 
> There are three big warts I can think of in this series: first, because
> the actual freeing of nlink==0 inodes is now done in the background,
> this means that the system will be busy making metadata updates for some
> time after the unlink() call returns.  This temporarily reduces
> available iops.  Second, in order to retain the behavior that deleting
> 100TB of unshared data should result in a free space gain of 100TB, the
> statvfs and quota reporting ioctls wait for inactivation to finish,
> which increases the long tail latency of those calls.  This behavior is,
> unfortunately, key to not introducing regressions in fstests.  The third
> problem is that the deferrals keep memory usage higher for longer,
> reduce opportunities to throttle the frontend when metadata load is
> heavy, and the unbounded workqueues can create transaction storms.

Yup, those transaction storms are a problem (where all the CIL
contention is coming from, I think), but there's a bigger
issue....

The problem I see is that way the background work has been
structured is that we lose all the implicit CPU affinity we end up
with by having all the objects in a directory be in the same AG.
This means that when we do work across different directories in
different processes, the filesystem objects they operate on are
largely CPU, node and AG affine. The only contention point is
typically the transaction commit (i.e. the CIL).

However, the use of the mark-and-sweep AG work technique for
inactivation breaks this implicit CPU affinity for the inactivation
work that is usually done in process context when it drops the last
reference to the file. We now just mark the inode for inactivation,
and move on to the next thing. Some time later, an unbound workqueue
thread walks across all the AG and processes all those
inactivations. It's not affine to the process that placed the inode
in the queue, so the inactivation takes place on a mostly cold data
cache.  Hence we end up loading the data set into CPU caches a
second time to do the inactivation, rather than running on a hot
local cache.

The other problem here is that with the really deep queues (65536
inodes before throttling) we grow the unlinked lists a lot and so
there is more overhead managing them. i.e. We end up with more cold
cache misses and buffer lookups...

SO, really, now that I dig into the workings of the perag based
deferred mechanism and tried to help it regain all the lost
performance, I'm not sure that mark-and-sweep works for this
infrastructure.

To put my money where my mouth is, I just wrote a small patch on top
this series that rips out the mark-and-sweep stuff and replaces it
with a small, simple, lockless per-cpu inactivation queue. I set it
up to trigger the work to be run at 32 queued inodes, and converted
the workqueue to be bound to the CPU and have no concurrency on the
CPU. IOWs, we get a per-cpu worker thread to process the per-cpu
inactivation queue. I gutted all the throttling, all the enospc
and quota stuff, and neutered recycling of inodes that need
inactivation. Yeah, I ripped the guts out of it. But with that, the
unlink performance is largely back to the pre-inactivation rate.

I see AGI buffer lockstepping where I'd expect to see it, and
there's no additional contention on other locks because it mostly
keeps the cpu affinity of the data set being inactivated intact.
Being bound to the same CPU and only running for a short period of
time means that it pre-empts the process that released the inodes
and doesn't trigger it to be scheduled away onto a different CPU.
Hence the workload still largely runs on hot(-ish) caches and so
runs at nearly the same efficiency/IPC as the pre-inactivation code.

Hacky patch below, let me know what you think.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

xfs: hacks on deferred inodes.

From: Dave Chinner <dchinner@redhat.com>

A couple of minor bug fixes.

Real change in this is per-cpu deferred inode gc queues. It's a fast
hack to demonstrate the principle, and I'm worried I've screwed
something major up because it largely has worked first go and
behaved as I expected it to work

The unlink workload is now demonstrating clear AGI
lockstepping, with a very large proportion of the inode freeing
operations hitting the AGI lock like so:

-   61.61%     0.00%  [kernel]            [k] xfs_inodegc_worker
   - xfs_inodegc_worker
      - 61.56% xfs_inodegc_inactivate
         - 61.54% xfs_inactive
            - 61.50% xfs_inactive_ifree.isra.0
               - 60.04% xfs_ifree
                  - 57.49% xfs_iunlink_remove
                     - 57.17% xfs_read_agi
                        - 57.15% xfs_trans_read_buf_map
                           - xfs_buf_read_map
                              - 57.14% xfs_buf_get_map
                                 - xfs_buf_find
                                    - 57.09% xfs_buf_lock

The context switch rate is running between 120-150k/s and unlinks
are saw-toothing between 0 and 300k/s. There is no noise in the
behaviour, the CPU usage is saw-toothing in exactly the same pattern
as the unlink rate, etc.

I've stripped out all the enospc, quota, and memory pressure trims -
it's just a small, lockless per-cpu deferred work queue that caps
runs with a depth of ~32 inodes. The queues use llist to provide a
lockless list for the multiple producer, single consumer pattern. I
probably need to change the work queues to be bound so that they run
on the same CPU as the object was queued, and that might actually
reduce the lockstepping problems somewhat.

Ok, I forgot to bind the workqueue again. Ok, now I've done that and
limited it to one work per cpu (because there's one queue per CPU)
and all the lockstepping largely disappears....

Old code on 32-way unlink took 3m40s. This version comes in at 3m2s
for the last process to complete, with the majority of rm's
completing in 2m45si (28 of 32 processes under 2m50s). The unlink
rate is up where I'd expect it to be for most of the test
(~350,000/sec), just the 40-50s it tails off as the context switch
rate sky-rockets to 250-300k/s....

Overall, very promising performance and so far is much simpler. It
may need a backlog throttle, but I won't know that until I've run it
on other workloads...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c |  12 +-
 fs/xfs/libxfs/xfs_ag.h |   4 -
 fs/xfs/xfs_icache.c    | 469 ++++++++++++++-----------------------------------
 fs/xfs/xfs_inode.h     |   1 +
 fs/xfs/xfs_mount.h     |  13 ++
 fs/xfs/xfs_super.c     |  42 ++++-
 6 files changed, 189 insertions(+), 352 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index f000644e5da3..125a4b1f5be5 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -173,7 +173,6 @@ __xfs_free_perag(
 	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
 
 	ASSERT(!delayed_work_pending(&pag->pag_blockgc_work));
-	ASSERT(!delayed_work_pending(&pag->pag_inodegc_work));
 	ASSERT(atomic_read(&pag->pag_ref) == 0);
 	kmem_free(pag);
 }
@@ -196,9 +195,7 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		ASSERT(pag->pag_ici_needs_inactive == 0);
 
-		unregister_shrinker(&pag->pag_inodegc_shrink);
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
-		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 
@@ -257,19 +254,14 @@ xfs_initialize_perag(
 		spin_lock_init(&pag->pagb_lock);
 		spin_lock_init(&pag->pag_state_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
-		INIT_DELAYED_WORK(&pag->pag_inodegc_work, xfs_inodegc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		init_waitqueue_head(&pag->pagb_wait);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
 
-		error = xfs_inodegc_register_shrinker(pag);
-		if (error)
-			goto out_remove_pag;
-
 		error = xfs_buf_hash_init(pag);
 		if (error)
-			goto out_inodegc_shrink;
+			goto out_remove_pag;
 
 		error = xfs_iunlink_init(pag);
 		if (error)
@@ -290,8 +282,6 @@ xfs_initialize_perag(
 
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
-out_inodegc_shrink:
-	unregister_shrinker(&pag->pag_inodegc_shrink);
 out_remove_pag:
 	radix_tree_delete(&mp->m_perag_tree, index);
 out_free_pag:
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 28db7fc4ebc0..b49a8757cca2 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -103,10 +103,6 @@ struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
-	/* background inode inactivation */
-	struct delayed_work	pag_inodegc_work;
-	struct shrinker		pag_inodegc_shrink;
-
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index b21d1d37bcb0..17d0289414f5 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -32,14 +32,6 @@
 #define XFS_ICI_RECLAIM_TAG	0
 /* Inode has speculative preallocations (posteof or cow) to clean. */
 #define XFS_ICI_BLOCKGC_TAG	1
-/* Inode can be inactivated. */
-#define XFS_ICI_INODEGC_TAG	2
-
-/*
- * Upper bound on the number of inodes in each AG that can be queued for
- * inactivation at any given time, to avoid monopolizing the workqueue.
- */
-#define XFS_INODEGC_MAX_BACKLOG	(1024 * XFS_INODES_PER_CHUNK)
 
 /*
  * The goal for walking incore inodes.  These can correspond with incore inode
@@ -49,7 +41,6 @@ enum xfs_icwalk_goal {
 	/* Goals directly associated with tagged inodes. */
 	XFS_ICWALK_BLOCKGC	= XFS_ICI_BLOCKGC_TAG,
 	XFS_ICWALK_RECLAIM	= XFS_ICI_RECLAIM_TAG,
-	XFS_ICWALK_INODEGC	= XFS_ICI_INODEGC_TAG,
 };
 
 #define XFS_ICWALK_NULL_TAG	(-1U)
@@ -410,22 +401,6 @@ xfs_gc_delay_ms(
 	unsigned int		udelay, gdelay, pdelay, fdelay, rdelay, adelay;
 
 	switch (tag) {
-	case XFS_ICI_INODEGC_TAG:
-		default_ms = xfs_inodegc_ms;
-
-		/* If we're in a shrinker, kick off the worker immediately. */
-		if (current->reclaim_state != NULL) {
-			trace_xfs_inodegc_delay_mempressure(mp,
-					__return_address);
-			return 0;
-		}
-
-		/* Kick the worker immediately if we've hit the max backlog. */
-		if (pag->pag_ici_needs_inactive > XFS_INODEGC_MAX_BACKLOG) {
-			trace_xfs_inodegc_delay_backlog(pag);
-			return 0;
-		}
-		break;
 	case XFS_ICI_BLOCKGC_TAG:
 		default_ms = xfs_blockgc_secs * 1000;
 		break;
@@ -476,33 +451,6 @@ xfs_blockgc_queue(
 	rcu_read_unlock();
 }
 
-/*
- * Queue a background inactivation worker if there are inodes that need to be
- * inactivated and higher level xfs code hasn't disabled the background
- * workers.
- */
-static void
-xfs_inodegc_queue(
-	struct xfs_perag	*pag,
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount        *mp = pag->pag_mount;
-
-	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
-		return;
-
-	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
-		unsigned int	delay;
-
-		delay = xfs_gc_delay_ms(pag, ip, XFS_ICI_INODEGC_TAG);
-		trace_xfs_inodegc_queue(pag, delay);
-		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
-				msecs_to_jiffies(delay));
-	}
-	rcu_read_unlock();
-}
-
 /*
  * Reschedule the background inactivation worker immediately if space is
  * getting tight and the worker hasn't started running yet.
@@ -519,11 +467,6 @@ xfs_gc_requeue_now(
 	unsigned int		default_ms;
 
 	switch (tag) {
-	case XFS_ICI_INODEGC_TAG:
-		dwork = &pag->pag_inodegc_work;
-		default_ms = xfs_inodegc_ms;
-		opflag_bit = XFS_OPFLAG_INODEGC_RUNNING_BIT;
-		break;
 	case XFS_ICI_BLOCKGC_TAG:
 		dwork = &pag->pag_blockgc_work;
 		default_ms = xfs_blockgc_secs * 1000;
@@ -568,8 +511,6 @@ xfs_perag_set_inode_tag(
 
 	if (tag == XFS_ICI_RECLAIM_TAG)
 		pag->pag_ici_reclaimable++;
-	else if (tag == XFS_ICI_INODEGC_TAG)
-		pag->pag_ici_needs_inactive++;
 
 	if (was_tagged) {
 		xfs_gc_requeue_now(pag, ip, tag);
@@ -589,9 +530,6 @@ xfs_perag_set_inode_tag(
 	case XFS_ICI_BLOCKGC_TAG:
 		xfs_blockgc_queue(pag, ip);
 		break;
-	case XFS_ICI_INODEGC_TAG:
-		xfs_inodegc_queue(pag, ip);
-		break;
 	}
 
 	trace_xfs_perag_set_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
@@ -619,8 +557,6 @@ xfs_perag_clear_inode_tag(
 
 	if (tag == XFS_ICI_RECLAIM_TAG)
 		pag->pag_ici_reclaimable--;
-	else if (tag == XFS_ICI_INODEGC_TAG)
-		pag->pag_ici_needs_inactive--;
 
 	if (radix_tree_tagged(&pag->pag_ici_root, tag))
 		return;
@@ -633,132 +569,6 @@ xfs_perag_clear_inode_tag(
 	trace_xfs_perag_clear_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
 }
 
-#ifdef DEBUG
-static void
-xfs_check_delalloc(
-	struct xfs_inode	*ip,
-	int			whichfork)
-{
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
-	struct xfs_bmbt_irec	got;
-	struct xfs_iext_cursor	icur;
-
-	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
-		return;
-	do {
-		if (isnullstartblock(got.br_startblock)) {
-			xfs_warn(ip->i_mount,
-	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
-				ip->i_ino,
-				whichfork == XFS_DATA_FORK ? "data" : "cow",
-				got.br_startoff, got.br_blockcount);
-		}
-	} while (xfs_iext_next_extent(ifp, &icur, &got));
-}
-#else
-#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
-#endif
-
-/*
- * Decide if we're going to throttle frontend threads that are inactivating
- * inodes so that we don't overwhelm the background workers with inodes and OOM
- * the machine.
- */
-static inline bool
-xfs_inodegc_want_throttle(
-	struct xfs_perag	*pag)
-{
-	/*
-	 * If we're in memory reclaim context, we don't want to wait for inode
-	 * inactivation to finish because it can take a very long time to
-	 * commit all the metadata updates and push the inodes through memory
-	 * reclamation.  Also, we might be the background inodegc thread.
-	 */
-	if (current->reclaim_state != NULL)
-		return false;
-
-	/* Enforce an upper bound on how many inodes can queue up. */
-	if (pag->pag_ici_needs_inactive > XFS_INODEGC_MAX_BACKLOG) {
-		trace_xfs_inodegc_throttle_backlog(pag);
-		return true;
-	}
-
-	/* Throttle if memory reclaim anywhere has triggered us. */
-	if (atomic_read(&pag->pag_inodegc_reclaim) > 0) {
-		trace_xfs_inodegc_throttle_mempressure(pag);
-		return true;
-	}
-
-	return false;
-}
-
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_mark_reclaimable(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned int		tag;
-	bool			need_inactive;
-	bool			flush_inodegc = false;
-
-	need_inactive = xfs_inode_needs_inactive(ip);
-	if (!need_inactive) {
-		/* Going straight to reclaim, so drop the dquots. */
-		xfs_qm_dqdetach(ip);
-
-		if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
-			xfs_check_delalloc(ip, XFS_DATA_FORK);
-			xfs_check_delalloc(ip, XFS_COW_FORK);
-			ASSERT(0);
-		}
-	}
-
-	XFS_STATS_INC(mp, vn_reclaim);
-
-	/*
-	 * We should never get here with any of the reclaim flags already set.
-	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
-
-	/*
-	 * We always use background reclaim here because even if the inode is
-	 * clean, it still may be under IO and hence we have wait for IO
-	 * completion to occur before we can reclaim the inode. The background
-	 * reclaim path handles this more efficiently than we can here, so
-	 * simply let background reclaim tear down all inodes.
-	 */
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	if (need_inactive) {
-		trace_xfs_inode_set_need_inactive(ip);
-		ip->i_flags |= XFS_NEED_INACTIVE;
-		tag = XFS_ICI_INODEGC_TAG;
-		flush_inodegc = xfs_inodegc_want_throttle(pag);
-	} else {
-		trace_xfs_inode_set_reclaimable(ip);
-		ip->i_flags |= XFS_IRECLAIMABLE;
-		tag = XFS_ICI_RECLAIM_TAG;
-	}
-
-	xfs_perag_set_inode_tag(pag, ip, tag);
-
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-
-	if (flush_inodegc && flush_work(&pag->pag_inodegc_work.work))
-		trace_xfs_inodegc_throttled(pag, __return_address);
-
-	xfs_perag_put(pag);
-}
-
 static inline void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
@@ -820,7 +630,7 @@ xfs_iget_recycle(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
-	unsigned int		tag;
+	unsigned int		tag = UINT_MAX;
 	int			error;
 
 	trace_xfs_iget_recycle(ip);
@@ -835,7 +645,6 @@ xfs_iget_recycle(
 		tag = XFS_ICI_RECLAIM_TAG;
 		ip->i_flags |= XFS_IRECLAIM;
 	} else if (ip->i_flags & XFS_NEED_INACTIVE) {
-		tag = XFS_ICI_INODEGC_TAG;
 		ip->i_flags |= XFS_INACTIVATING;
 	} else {
 		ASSERT(0);
@@ -878,7 +687,8 @@ xfs_iget_recycle(
 	 */
 	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 	ip->i_flags |= XFS_INEW;
-	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
+	if (tag != UINT_MAX)
+		xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
 	inode->i_state = I_NEW;
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -965,10 +775,17 @@ xfs_iget_cache_hit(
 	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING))
 		goto out_skip;
 
-	/* Unlinked inodes cannot be re-grabbed. */
-	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE)) {
-		error = -ENOENT;
-		goto out_error;
+	if (ip->i_flags & XFS_NEED_INACTIVE) {
+		/* Unlinked inodes cannot be re-grabbed. */
+		if (VFS_I(ip)->i_nlink == 0) {
+			error = -ENOENT;
+			goto out_error;
+		}
+		/*
+		 * XXX: need to trigger a gc list flush before we can allow
+		 * inactivated inodes past here.
+		 */
+		goto out_skip;
 	}
 
 	/*
@@ -1936,7 +1753,8 @@ xfs_blockgc_free_space(
 	if (error)
 		return error;
 
-	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
+	xfs_inodegc_flush(mp);
+	return 0;
 }
 
 /*
@@ -2024,6 +1842,33 @@ xfs_blockgc_free_quota(
 			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), iwalk_flags);
 }
 
+
+#ifdef DEBUG
+static void
+xfs_check_delalloc(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	struct xfs_bmbt_irec	got;
+	struct xfs_iext_cursor	icur;
+
+	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
+		return;
+	do {
+		if (isnullstartblock(got.br_startblock)) {
+			xfs_warn(ip->i_mount,
+	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
+				ip->i_ino,
+				whichfork == XFS_DATA_FORK ? "data" : "cow",
+				got.br_startoff, got.br_blockcount);
+		}
+	} while (xfs_iext_next_extent(ifp, &icur, &got));
+}
+#else
+#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
+#endif
+
 /*
  * Inode Inactivation and Reclaimation
  * ===================================
@@ -2069,42 +1914,6 @@ xfs_blockgc_free_quota(
  * incore inode tree lock and then the inode i_flags lock, in that order.
  */
 
-/*
- * Decide if the given @ip is eligible for inactivation, and grab it if so.
- * Returns true if it's ready to go or false if we should just ignore it.
- *
- * Skip inodes that don't need inactivation or are being inactivated (or
- * recycled) by another thread.  Inodes should not be tagged for inactivation
- * while also in INEW or any reclaim state.
- *
- * Otherwise, mark this inode as being inactivated even if the fs is shut down
- * because we need xfs_inodegc_inactivate to push this inode into the reclaim
- * state.
- */
-static bool
-xfs_inodegc_igrab(
-	struct xfs_inode	*ip)
-{
-	bool			ret = false;
-
-	ASSERT(rcu_read_lock_held());
-
-	/* Check for stale RCU freed inode */
-	spin_lock(&ip->i_flags_lock);
-	if (!ip->i_ino)
-		goto out_unlock_noent;
-
-	if ((ip->i_flags & XFS_NEED_INACTIVE) &&
-	    !(ip->i_flags & XFS_INACTIVATING)) {
-		ret = true;
-		ip->i_flags |= XFS_INACTIVATING;
-	}
-
-out_unlock_noent:
-	spin_unlock(&ip->i_flags_lock);
-	return ret;
-}
-
 /*
  * Free all speculative preallocations and possibly even the inode itself.
  * This is the last chance to make changes to an otherwise unreferenced file
@@ -2112,12 +1921,10 @@ xfs_inodegc_igrab(
  */
 static void
 xfs_inodegc_inactivate(
-	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	struct xfs_icwalk	*icw)
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
 
 	/*
 	 * Inactivation isn't supposed to run when the fs is frozen because
@@ -2125,20 +1932,6 @@ xfs_inodegc_inactivate(
 	 */
 	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
 
-	/*
-	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed to
-	 * pass in a icw structure to look for inodes to inactivate
-	 * immediately to free some resources.  If this inode isn't a match,
-	 * put it back on the shelf and move on.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	if (!xfs_icwalk_match(ip, icw)) {
-		ip->i_flags &= ~XFS_INACTIVATING;
-		spin_unlock(&ip->i_flags_lock);
-		return;
-	}
-	spin_unlock(&ip->i_flags_lock);
-
 	trace_xfs_inode_inactivating(ip);
 
 	xfs_inactive(ip);
@@ -2150,6 +1943,7 @@ xfs_inodegc_inactivate(
 	}
 
 	/* Schedule the inactivated inode for reclaim. */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
 
@@ -2157,11 +1951,11 @@ xfs_inodegc_inactivate(
 	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
 	ip->i_flags |= XFS_IRECLAIMABLE;
 
-	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_INODEGC_TAG);
 	xfs_perag_set_inode_tag(pag, ip, XFS_ICI_RECLAIM_TAG);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
 }
 
 /* Inactivate inodes until we run out. */
@@ -2169,23 +1963,16 @@ void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_perag	*pag = container_of(to_delayed_work(work),
-					struct xfs_perag, pag_inodegc_work);
+	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
+							work);
+	struct llist_node	*node = llist_del_all(&gc->list);
+	struct xfs_inode	*ip, *n;
 
-	/*
-	 * Inactivation never returns error codes and never fails to push a
-	 * tagged inode to reclaim.  Loop until there there's nothing left.
-	 */
-	while (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG)) {
-		trace_xfs_inodegc_worker(pag, __return_address);
-		xfs_icwalk_ag(pag, XFS_ICWALK_INODEGC, NULL);
+	gc->items = 0;
+	llist_for_each_entry_safe(ip, n, node, i_gclist) {
+		xfs_iflags_set(ip, XFS_INACTIVATING);
+		xfs_inodegc_inactivate(ip);
 	}
-
-	/*
-	 * We inactivated all the inodes we could, so disable the throttling
-	 * of new inactivations that happens when memory gets tight.
-	 */
-	atomic_set(&pag->pag_inodegc_reclaim, 0);
 }
 
 /*
@@ -2196,13 +1983,15 @@ void
 xfs_inodegc_flush(
 	struct xfs_mount	*mp)
 {
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		agno;
+	struct xfs_inodegc	*gc;
+	int			cpu;
 
 	trace_xfs_inodegc_flush(mp, __return_address);
 
-	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG)
-		flush_delayed_work(&pag->pag_inodegc_work);
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		flush_work(&gc->work);
+	}
 }
 
 /* Disable the inode inactivation background worker and wait for it to stop. */
@@ -2210,14 +1999,16 @@ void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		agno;
+	struct xfs_inodegc	*gc;
+	int			cpu;
 
 	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
 		return;
 
-	for_each_perag(mp, agno, pag)
-		cancel_delayed_work_sync(&pag->pag_inodegc_work);
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		cancel_work_sync(&gc->work);
+	}
 	trace_xfs_inodegc_stop(mp, __return_address);
 }
 
@@ -2229,85 +2020,100 @@ void
 xfs_inodegc_start(
 	struct xfs_mount	*mp)
 {
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		agno;
+	struct xfs_inodegc	*gc;
+	int			cpu;
 
 	if (test_and_set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
 		return;
 
 	trace_xfs_inodegc_start(mp, __return_address);
-	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG)
-		xfs_inodegc_queue(pag, NULL);
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		if (!llist_empty(&gc->list))
+			queue_work(mp->m_gc_workqueue, &gc->work);
+	}
 }
 
-/*
- * Register a phony shrinker so that we can speed up background inodegc and
- * throttle new inodegc queuing when there's memory pressure.  Inactivation
- * does not itself free any memory but it does make inodes reclaimable, which
- * eventually frees memory.  The count function, seek value, and batch value
- * are crafted to trigger the scan function any time the shrinker is not being
- * called from a background idle scan (i.e. the second time).
- */
-#define XFS_INODEGC_SHRINK_COUNT	(1UL << DEF_PRIORITY)
-#define XFS_INODEGC_SHRINK_BATCH	(LONG_MAX)
 
-static unsigned long
-xfs_inodegc_shrink_count(
-	struct shrinker		*shrink,
-	struct shrink_control	*sc)
+static void
+xfs_inodegc_queue(
+	struct xfs_inode	*ip)
 {
-	struct xfs_perag	*pag;
-
-	pag = container_of(shrink, struct xfs_perag, pag_inodegc_shrink);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_inodegc	*gc;
 
-	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG))
-		return XFS_INODEGC_SHRINK_COUNT;
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags |= XFS_NEED_INACTIVE;
+	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags)) {
+		ip->i_flags |= XFS_INACTIVATING;
+		spin_unlock(&ip->i_flags_lock);
+		xfs_inodegc_inactivate(ip);
+		return;
+	}
+	spin_unlock(&ip->i_flags_lock);
 
-	return 0;
+	gc = get_cpu_ptr(mp->m_inodegc);
+	llist_add(&ip->i_gclist, &gc->list);
+	if (++gc->items > 32)
+		queue_work(mp->m_gc_workqueue, &gc->work);
+	put_cpu_ptr(gc);
 }
 
-static unsigned long
-xfs_inodegc_shrink_scan(
-	struct shrinker		*shrink,
-	struct shrink_control	*sc)
+/*
+ * We set the inode flag atomically with the radix tree tag.
+ * Once we get tag lookups on the radix tree, this inode flag
+ * can go away.
+ */
+void
+xfs_inode_mark_reclaimable(
+	struct xfs_inode	*ip)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	bool			need_inactive;
+
+	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
-	 * Inode inactivation work requires NOFS allocations, so don't make
-	 * things worse if the caller wanted a NOFS allocation.
+	 * We should never get here with any of the reclaim flags already set.
 	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
 
-	pag = container_of(shrink, struct xfs_perag, pag_inodegc_shrink);
+	need_inactive = xfs_inode_needs_inactive(ip);
+	if (need_inactive) {
+		xfs_inodegc_queue(ip);
+		return;
+	}
 
-	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG)) {
-		struct xfs_mount *mp = pag->pag_mount;
+	/* Going straight to reclaim, so drop the dquots. */
+	xfs_qm_dqdetach(ip);
 
-		trace_xfs_inodegc_requeue_mempressure(pag, sc->nr_to_scan,
-				__return_address);
-		atomic_inc(&pag->pag_inodegc_reclaim);
-		mod_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0);
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+		xfs_check_delalloc(ip, XFS_DATA_FORK);
+		xfs_check_delalloc(ip, XFS_COW_FORK);
+		ASSERT(0);
 	}
 
-	return 0;
-}
 
-/* Register a shrinker so we can accelerate inodegc and throttle queuing. */
-int
-xfs_inodegc_register_shrinker(
-	struct xfs_perag	*pag)
-{
-	struct shrinker		*shrink = &pag->pag_inodegc_shrink;
+	/*
+	 * We always use background reclaim here because even if the inode is
+	 * clean, it still may be under IO and hence we have wait for IO
+	 * completion to occur before we can reclaim the inode. The background
+	 * reclaim path handles this more efficiently than we can here, so
+	 * simply let background reclaim tear down all inodes.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
 
-	shrink->count_objects = xfs_inodegc_shrink_count;
-	shrink->scan_objects = xfs_inodegc_shrink_scan;
-	shrink->seeks = 0;
-	shrink->flags = SHRINKER_NONSLAB;
-	shrink->batch = XFS_INODEGC_SHRINK_BATCH;
+	trace_xfs_inode_set_reclaimable(ip);
+	ip->i_flags |= XFS_IRECLAIMABLE;
+	xfs_perag_set_inode_tag(pag, ip, XFS_ICI_RECLAIM_TAG);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
 
-	return register_shrinker(shrink);
+	xfs_perag_put(pag);
 }
 
 /* XFS Inode Cache Walking Code */
@@ -2336,8 +2142,6 @@ xfs_icwalk_igrab(
 		return xfs_blockgc_igrab(ip);
 	case XFS_ICWALK_RECLAIM:
 		return xfs_reclaim_igrab(ip, icw);
-	case XFS_ICWALK_INODEGC:
-		return xfs_inodegc_igrab(ip);
 	default:
 		return false;
 	}
@@ -2363,9 +2167,6 @@ xfs_icwalk_process_inode(
 	case XFS_ICWALK_RECLAIM:
 		xfs_reclaim_inode(ip, pag);
 		break;
-	case XFS_ICWALK_INODEGC:
-		xfs_inodegc_inactivate(ip, pag, icw);
-		break;
 	}
 	return error;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index fa5be0d071ad..4ef0667689f3 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -27,6 +27,7 @@ typedef struct xfs_inode {
 	struct xfs_dquot	*i_udquot;	/* user dquot */
 	struct xfs_dquot	*i_gdquot;	/* group dquot */
 	struct xfs_dquot	*i_pdquot;	/* project dquot */
+	struct llist_node	i_gclist;
 
 	/* Inode location stuff */
 	xfs_ino_t		i_ino;		/* inode number (agno/agino)*/
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 83bd288d55b8..99d447aac153 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -56,6 +56,12 @@ struct xfs_error_cfg {
 	long		retry_timeout;	/* in jiffies, -1 = infinite */
 };
 
+struct xfs_inodegc {
+	struct llist_head	list;
+	struct work_struct	work;
+	int			items;
+};
+
 /*
  * The struct xfsmount layout is optimised to separate read-mostly variables
  * from variables that are frequently modified. We put the read-mostly variables
@@ -219,6 +225,13 @@ typedef struct xfs_mount {
 	uint32_t		m_generation;
 	struct mutex		m_growlock;	/* growfs mutex */
 
+	void __percpu		*m_inodegc;	/* percpu inodegc structures */
+
+	struct inodegc {
+		struct llist_head	list;
+		struct delayed_work	work;
+	}			inodegc_list;
+
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 920ab6c3c983..0aa2af155072 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -503,8 +503,8 @@ xfs_init_mount_workqueues(
 		goto out_destroy_unwritten;
 
 	mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s",
-			WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM,
-			0, mp->m_super->s_id);
+			WQ_SYSFS | WQ_FREEZABLE | WQ_MEM_RECLAIM,
+			1, mp->m_super->s_id);
 	if (!mp->m_gc_workqueue)
 		goto out_destroy_reclaim;
 
@@ -1009,6 +1009,35 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_delalloc_blks);
 }
 
+static int
+xfs_inodegc_init_percpu(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	mp->m_inodegc = alloc_percpu(struct xfs_inodegc);
+	if (!mp->m_inodegc)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		init_llist_head(&gc->list);
+		gc->items = 0;
+                INIT_WORK(&gc->work, xfs_inodegc_worker);
+	}
+	return 0;
+}
+
+static void
+xfs_inodegc_free_percpu(
+	struct xfs_mount	*mp)
+{
+	if (!mp->m_inodegc)
+		return;
+	free_percpu(mp->m_inodegc);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
@@ -1025,6 +1054,7 @@ xfs_fs_put_super(
 
 	xfs_freesb(mp);
 	free_percpu(mp->m_stats.xs_stats);
+	xfs_inodegc_free_percpu(mp);
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
@@ -1396,11 +1426,15 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_destroy_workqueues;
 
+	error = xfs_inodegc_init_percpu(mp);
+	if (error)
+		goto out_destroy_counters;
+
 	/* Allocate stats memory before we do operations that might use it */
 	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
 	if (!mp->m_stats.xs_stats) {
 		error = -ENOMEM;
-		goto out_destroy_counters;
+		goto out_destroy_inodegc;
 	}
 
 	error = xfs_readsb(mp, flags);
@@ -1605,6 +1639,8 @@ xfs_fs_fill_super(
 	free_percpu(mp->m_stats.xs_stats);
  out_destroy_counters:
 	xfs_destroy_percpu_counters(mp);
+ out_destroy_inodegc:
+	xfs_inodegc_free_percpu(mp);
  out_destroy_workqueues:
 	xfs_destroy_mount_workqueues(mp);
  out_close_devices:

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/20] xfs: parallelize inode inactivation
  2021-08-02  0:55   ` Dave Chinner
@ 2021-08-02 21:33     ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-02 21:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Mon, Aug 02, 2021 at 10:55:32AM +1000, Dave Chinner wrote:
> On Thu, Jul 29, 2021 at 11:45:10AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Split the inode inactivation work into per-AG work items so that we can
> > take advantage of parallelization.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_ag.c |   12 ++++++-
> >  fs/xfs/libxfs/xfs_ag.h |   10 +++++
> >  fs/xfs/xfs_icache.c    |   88 ++++++++++++++++++++++++++++--------------------
> >  fs/xfs/xfs_icache.h    |    2 +
> >  fs/xfs/xfs_mount.c     |    9 +----
> >  fs/xfs/xfs_mount.h     |    8 ----
> >  fs/xfs/xfs_super.c     |    2 -
> >  fs/xfs/xfs_trace.h     |   82 ++++++++++++++++++++++++++++++++-------------
> >  8 files changed, 134 insertions(+), 79 deletions(-)
> 
> ....
> 
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -420,9 +420,11 @@ xfs_blockgc_queue(
> >   */
> >  static void
> >  xfs_inodegc_queue(
> > -	struct xfs_mount        *mp,
> > +	struct xfs_perag	*pag,
> >  	struct xfs_inode	*ip)
> >  {
> > +	struct xfs_mount        *mp = pag->pag_mount;
> > +
> >  	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> >  		return;
> >  
> > @@ -431,8 +433,8 @@ xfs_inodegc_queue(
> >  		unsigned int	delay;
> >  
> >  		delay = xfs_gc_delay_ms(mp, ip, XFS_ICI_INODEGC_TAG);
> > -		trace_xfs_inodegc_queue(mp, delay);
> > -		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
> > +		trace_xfs_inodegc_queue(pag, delay);
> > +		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
> >  				msecs_to_jiffies(delay));
> >  	}
> >  	rcu_read_unlock();
> 
> I think you missed this change in xfs_inodegc_queue():
> 
> @@ -492,7 +492,7 @@ xfs_inodegc_queue(
>  		return;
>  
>  	rcu_read_lock();
> -	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG)) {
> +	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INODEGC_TAG)) {

Yep, I've rebased this series so many times that merge conflict
resolution mutations have crept in.  Fixed; thank you. :(

(And FWIW for v9 I moved this patch to be immediately after the patch
that changes xfs to use the radix tree tags; this reduces the churn in
struct xfs_mount somewhat.)

--D

>  		unsigned int    delay;
>  
>  		delay = xfs_gc_delay_ms(pag, ip, XFS_ICI_INODEGC_TAG);
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues
  2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
  2021-07-30  4:24   ` Dave Chinner
@ 2021-08-03  8:34   ` Dave Chinner
  2021-08-03 20:20     ` Darrick J. Wong
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
  1 sibling, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-03  8:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch


From: Dave Chinner <dchinner@redhat.com>

Move inode inactivation to background work contexts so that it no
longer runs in the context that releases the final reference to an
inode. This will allow process work that ends up blocking on
inactivation to continue doing work while the filesytem processes
the inactivation in the background.

A typical demonstration of this is unlinking an inode with lots of
extents. The extents are removed during inactivation, so this blocks
the process that unlinked the inode from the directory structure. By
moving the inactivation to the background process, the userspace
applicaiton can keep working (e.g. unlinking the next inode in the
directory) while the inactivation work on the previous inode is
done by a different CPU.

The implementation of the queue is relatively simple. We use a
per-cpu lockless linked list (llist) to queue inodes for
inactivation without requiring serialisation mechanisms, and a work
item to allow the queue to be processed by a CPU bound worker
thread. We also keep a count of the queue depth so that we can
trigger work after a number of deferred inactivations have been
queued.

The use of a bound workqueue with a single work depth allows the
workqueue to run one work item per CPU. We queue the work item on
the CPU we are currently running on, and so this essentially gives
us affine per-cpu worker threads for the per-cpu queues. THis
maintains the effective CPU affinity that occurs within XFS at the
AG level due to all objects in a directory being local to an AG.
Hence inactivation work tends to run on the same CPU that last
accessed all the objects that inactivation accesses and this
maintains hot CPU caches for unlink workloads.

A depth of 32 inodes was chosen to match the number of inodes in an
inode cluster buffer. This hopefully allows sequential
allocation/unlink behaviours to defering inactivation of all the
inodes in a single cluster buffer at a time, further helping
maintain hot CPU and buffer cache accesses while running
inactivations.

A hard per-cpu queue throttle of 256 inode has been set to avoid
runaway queuing when inodes that take a long to time inactivate are
being processed. For example, when unlinking inodes with large
numbers of extents that can take a lot of processing to free.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---

Hi Darrick,

This is the current version of the per-cpu deferred queues updated
to replace patch 3 in this series. There are no performance
regressions that I've measured with this, and most of fstests is
passing. There are some failures that I haven't looked at yet -
g/055, g/102, g/219, g/226, g/233, and so on. THese tests did not
fail with my original "hack the queue onto the end of the series"
patch - there were zero regressions from that patch so clearly some
of the fixes later in this patch series are still necessary. Or I
screwed up/missed a flush location that those tests would have
otherwise triggered. I suspect patch 19(?) that triggers an inodegc
flush from the blockgc flush at ENOSPC might be one of the missing
pieces...

Hence I don't think these failures have to do with the relative lack
of throttling, low space management or memory pressure detection.
More tests are failing on my 16GB test VM than the 512MB test VM,
and in general I haven't seen memory pressure have any impact on
this queuing mechanism at all.

I suspect that means most of the rest of the patchset is not
necessary for inodegc management. I haven't yet gone through them
to see which ones address the failures I'm seeing, so that's the
next step here.

It would be good if you can run this through you test setups for
this patchset to see if it behaves well in those situations. If it
reproduces the same failures as Im seeing, then maybe by the time
I'm awake again you've worked out which remaining bits of the
patchset are still required....

Cheers,

Dave.

 fs/xfs/scrub/common.c    |   7 +
 fs/xfs/xfs_icache.c      | 338 +++++++++++++++++++++++++++++++++++------------
 fs/xfs/xfs_icache.h      |   5 +
 fs/xfs/xfs_inode.h       |  20 ++-
 fs/xfs/xfs_log_recover.c |   7 +
 fs/xfs/xfs_mount.c       |  26 +++-
 fs/xfs/xfs_mount.h       |  34 ++++-
 fs/xfs/xfs_super.c       | 111 +++++++++++++++-
 fs/xfs/xfs_trace.h       |  50 ++++++-
 9 files changed, 505 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 8558ca05e11d..06b697f72f23 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -884,6 +884,7 @@ xchk_stop_reaping(
 {
 	sc->flags |= XCHK_REAPING_DISABLED;
 	xfs_blockgc_stop(sc->mp);
+	xfs_inodegc_stop(sc->mp);
 }
 
 /* Restart background reaping of resources. */
@@ -891,6 +892,12 @@ void
 xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
+	/*
+	 * Readonly filesystems do not perform inactivation, so there's no
+	 * need to restart the worker.
+	 */
+	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(sc->mp);
 	xfs_blockgc_start(sc->mp);
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 709507cc83ae..b1c2cab3c690 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -213,7 +213,7 @@ xfs_blockgc_queue(
 {
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
-		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
+		queue_delayed_work(pag->pag_mount->m_blockgc_wq,
 				   &pag->pag_blockgc_work,
 				   msecs_to_jiffies(xfs_blockgc_secs * 1000));
 	rcu_read_unlock();
@@ -292,86 +292,6 @@ xfs_perag_clear_inode_tag(
 	trace_xfs_perag_clear_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
 }
 
-#ifdef DEBUG
-static void
-xfs_check_delalloc(
-	struct xfs_inode	*ip,
-	int			whichfork)
-{
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
-	struct xfs_bmbt_irec	got;
-	struct xfs_iext_cursor	icur;
-
-	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
-		return;
-	do {
-		if (isnullstartblock(got.br_startblock)) {
-			xfs_warn(ip->i_mount,
-	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
-				ip->i_ino,
-				whichfork == XFS_DATA_FORK ? "data" : "cow",
-				got.br_startoff, got.br_blockcount);
-		}
-	} while (xfs_iext_next_extent(ifp, &icur, &got));
-}
-#else
-#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
-#endif
-
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_mark_reclaimable(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	bool			need_inactive = xfs_inode_needs_inactive(ip);
-
-	if (!need_inactive) {
-		/* Going straight to reclaim, so drop the dquots. */
-		xfs_qm_dqdetach(ip);
-	} else {
-		xfs_inactive(ip);
-	}
-
-	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
-		xfs_check_delalloc(ip, XFS_DATA_FORK);
-		xfs_check_delalloc(ip, XFS_COW_FORK);
-		ASSERT(0);
-	}
-
-	XFS_STATS_INC(mp, vn_reclaim);
-
-	/*
-	 * We should never get here with one of the reclaim flags already set.
-	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
-
-	/*
-	 * We always use background reclaim here because even if the inode is
-	 * clean, it still may be under IO and hence we have wait for IO
-	 * completion to occur before we can reclaim the inode. The background
-	 * reclaim path handles this more efficiently than we can here, so
-	 * simply let background reclaim tear down all inodes.
-	 */
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
-
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
-}
-
 static inline void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
@@ -569,6 +489,15 @@ xfs_iget_cache_hit(
 	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM))
 		goto out_skip;
 
+	if (ip->i_flags & XFS_NEED_INACTIVE) {
+		/* Unlinked inodes cannot be re-grabbed. */
+		if (VFS_I(ip)->i_nlink == 0) {
+			error = -ENOENT;
+			goto out_error;
+		}
+		goto out_inodegc_flush;
+	}
+
 	/*
 	 * Check the inode free state is valid. This also detects lookup
 	 * racing with unlinks.
@@ -616,6 +545,12 @@ xfs_iget_cache_hit(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 	return error;
+
+out_inodegc_flush:
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+	xfs_inodegc_flush(mp);
+	return -EAGAIN;
 }
 
 static int
@@ -943,6 +878,7 @@ xfs_reclaim_inode(
 
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
+	trace_xfs_inode_reclaiming(ip);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1420,6 +1356,8 @@ xfs_blockgc_start(
 
 /* Don't try to run block gc on an inode that's in any of these states. */
 #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
+					 XFS_NEED_INACTIVE | \
+					 XFS_INACTIVATING | \
 					 XFS_IRECLAIMABLE | \
 					 XFS_IRECLAIM)
 /*
@@ -1794,3 +1732,241 @@ xfs_icwalk(
 	return last_error;
 	BUILD_BUG_ON(XFS_ICWALK_PRIVATE_FLAGS & XFS_ICWALK_FLAGS_VALID);
 }
+
+#ifdef DEBUG
+static void
+xfs_check_delalloc(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	struct xfs_bmbt_irec	got;
+	struct xfs_iext_cursor	icur;
+
+	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
+		return;
+	do {
+		if (isnullstartblock(got.br_startblock)) {
+			xfs_warn(ip->i_mount,
+	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
+				ip->i_ino,
+				whichfork == XFS_DATA_FORK ? "data" : "cow",
+				got.br_startoff, got.br_blockcount);
+		}
+	} while (xfs_iext_next_extent(ifp, &icur, &got));
+}
+#else
+#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
+#endif
+
+/* Schedule the inode for reclaim. */
+static void
+xfs_inodegc_set_reclaimable(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount        *mp = ip->i_mount;
+	struct xfs_perag	*pag;
+
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+		xfs_check_delalloc(ip, XFS_DATA_FORK);
+		xfs_check_delalloc(ip, XFS_COW_FORK);
+		ASSERT(0);
+	}
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	trace_xfs_inode_set_reclaimable(ip);
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	ip->i_flags |= XFS_IRECLAIMABLE;
+	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
+				XFS_ICI_RECLAIM_TAG);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+}
+
+/*
+ * Free all speculative preallocations and possibly even the inode itself.
+ * This is the last chance to make changes to an otherwise unreferenced file
+ * before incore reclamation happens.
+ */
+static void
+xfs_inodegc_inactivate(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount        *mp = ip->i_mount;
+
+	/*
+	* Inactivation isn't supposed to run when the fs is frozen because
+	* we don't want kernel threads to block on transaction allocation.
+	*/
+	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	trace_xfs_inode_inactivating(ip);
+	xfs_inactive(ip);
+	xfs_inodegc_set_reclaimable(ip);
+}
+
+void
+xfs_inodegc_worker(
+	struct work_struct	*work)
+{
+	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
+							work);
+	struct llist_node	*node = llist_del_all(&gc->list);
+	struct xfs_inode	*ip, *n;
+
+	trace_xfs_inodegc_worker(NULL, __return_address);
+
+	WRITE_ONCE(gc->items, 0);
+	llist_for_each_entry_safe(ip, n, node, i_gclist) {
+		xfs_iflags_set(ip, XFS_INACTIVATING);
+		xfs_inodegc_inactivate(ip);
+	}
+}
+
+/*
+ * Force all currently queued inode inactivation work to run immediately, and
+ * wait for the work to finish. Two pass - queue all the work first pass, wait
+ * for it in a second pass.
+ */
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	trace_xfs_inodegc_flush(mp, __return_address);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+	}
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		flush_work(&gc->work);
+	}
+}
+
+/*
+ * Flush all the pending work and then disable the inode inactivation background
+ * workers and wait for them to stop.
+ */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	xfs_inodegc_flush(mp);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		cancel_work_sync(&gc->work);
+	}
+	trace_xfs_inodegc_stop(mp, __return_address);
+}
+
+/*
+ * Enable the inode inactivation background workers and schedule deferred inode
+ * inactivation work if there is any.
+ */
+void
+xfs_inodegc_start(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (test_and_set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	trace_xfs_inodegc_start(mp, __return_address);
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		if (!llist_empty(&gc->list))
+			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+	}
+}
+
+/*
+ * Queue a background inactivation worker if there are inodes that need to be
+ * inactivated and higher level xfs code hasn't disabled the background
+ * workers.
+ */
+static void
+xfs_inodegc_queue(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_inodegc	*gc;
+	int			items;
+
+	trace_xfs_inode_set_need_inactive(ip);
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags |= XFS_NEED_INACTIVE;
+	spin_unlock(&ip->i_flags_lock);
+
+	gc = get_cpu_ptr(mp->m_inodegc);
+	llist_add(&ip->i_gclist, &gc->list);
+	items = READ_ONCE(gc->items);
+	WRITE_ONCE(gc->items, items + 1);
+	put_cpu_ptr(gc);
+
+	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+	if (items > 32) {
+		trace_xfs_inodegc_queue(mp, __return_address);
+		queue_work(mp->m_inodegc_wq, &gc->work);
+	}
+	/* throttle */
+	if (items > 256) {
+		trace_xfs_inodegc_throttle(mp, __return_address);
+		flush_work(&gc->work);
+	}
+}
+
+/*
+ * We set the inode flag atomically with the radix tree tag.  Once we get tag
+ * lookups on the radix tree, this inode flag can go away.
+ *
+ * We always use background reclaim here because even if the inode is clean, it
+ * still may be under IO and hence we have wait for IO completion to occur
+ * before we can reclaim the inode. The background reclaim path handles this
+ * more efficiently than we can here, so simply let background reclaim tear down
+ * all inodes.
+ */
+void
+xfs_inode_mark_reclaimable(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			need_inactive;
+
+	XFS_STATS_INC(mp, vn_reclaim);
+
+	/*
+	 * We should never get here with any of the reclaim flags already set.
+	 */
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
+
+	need_inactive = xfs_inode_needs_inactive(ip);
+	if (need_inactive) {
+		xfs_inodegc_queue(ip);
+		return;
+	}
+
+	/* Going straight to reclaim, so drop the dquots. */
+	xfs_qm_dqdetach(ip);
+	xfs_inodegc_set_reclaimable(ip);
+}
+
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d0062ebb3f7a..c1dfc909a5b0 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -74,4 +74,9 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_flush(struct xfs_mount *mp);
+void xfs_inodegc_stop(struct xfs_mount *mp);
+void xfs_inodegc_start(struct xfs_mount *mp);
+
 #endif
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e3137bbc7b14..1f62b481d8c5 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -42,6 +42,7 @@ typedef struct xfs_inode {
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
 	atomic_t		i_pincount;	/* inode pin count */
+	struct llist_node	i_gclist;	/* deferred inactivation list */
 
 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
@@ -240,6 +241,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
+#define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
 /*
  * If this unlinked inode is in the middle of recovery, don't let drop_inode
  * truncate and free the inode.  This can happen if we iget the inode during
@@ -248,6 +250,21 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
+/* All inode state flags related to inode reclaim. */
+#define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
+				 XFS_IRECLAIM | \
+				 XFS_NEED_INACTIVE | \
+				 XFS_INACTIVATING)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -255,7 +272,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Flags for inode locking.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1721fce2ec94..a98d2429d795 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2786,6 +2786,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Flush the pending unlinked inodes to ensure that the inactivations
+	 * are fully completed on disk and the incore inodes can be reclaimed
+	 * before we signal that recovery is complete.
+	 */
+	xfs_inodegc_flush(mp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index baf7b323cb15..1f7e9a608f38 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -514,7 +514,8 @@ xfs_check_summary_counts(
  * Flush and reclaim dirty inodes in preparation for unmount. Inodes and
  * internal inode structures can be sitting in the CIL and AIL at this point,
  * so we need to unpin them, write them back and/or reclaim them before unmount
- * can proceed.
+ * can proceed.  In other words, callers are required to have inactivated all
+ * inodes.
  *
  * An inode cluster that has been freed can have its buffer still pinned in
  * memory because the transaction is still sitting in a iclog. The stale inodes
@@ -546,6 +547,7 @@ xfs_unmount_flush_inodes(
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
 
 	xfs_ail_push_all_sync(mp->m_ail);
+	xfs_inodegc_stop(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
@@ -782,6 +784,9 @@ xfs_mountfs(
 	if (error)
 		goto out_log_dealloc;
 
+	/* Enable background inode inactivation workers. */
+	xfs_inodegc_start(mp);
+
 	/*
 	 * Get and sanity-check the root inode.
 	 * Save the pointer to it in the mount structure.
@@ -942,6 +947,15 @@ xfs_mountfs(
 	xfs_irele(rip);
 	/* Clean out dquots that might be in memory after quotacheck. */
 	xfs_qm_unmount(mp);
+
+	/*
+	 * Inactivate all inodes that might still be in memory after a log
+	 * intent recovery failure so that reclaim can free them.  Metadata
+	 * inodes and the root directory shouldn't need inactivation, but the
+	 * mount failed for some reason, so pull down all the state and flee.
+	 */
+	xfs_inodegc_flush(mp);
+
 	/*
 	 * Flush all inode reclamation work and flush the log.
 	 * We have to do this /after/ rtunmount and qm_unmount because those
@@ -989,6 +1003,16 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes
+	 * that the VFS evicted earlier in the unmount process.  Freeing inodes
+	 * and discarding CoW fork preallocations can cause shape changes to
+	 * the free inode and refcount btrees, respectively, so we must finish
+	 * this before we discard the metadata space reservations.  Metadata
+	 * inodes and the root directory do not require inactivation.
+	 */
+	xfs_inodegc_flush(mp);
+
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c78b63fe779a..470013a48c17 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -56,6 +56,15 @@ struct xfs_error_cfg {
 	long		retry_timeout;	/* in jiffies, -1 = infinite */
 };
 
+/*
+ * Per-cpu deferred inode inactivation GC lists.
+ */
+struct xfs_inodegc {
+	struct llist_head	list;
+	struct work_struct	work;
+	int			items;
+};
+
 /*
  * The struct xfsmount layout is optimised to separate read-mostly variables
  * from variables that are frequently modified. We put the read-mostly variables
@@ -82,6 +91,8 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+	void __percpu		*m_inodegc;	/* percpu inodegc structures */
+
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -94,8 +105,9 @@ typedef struct xfs_mount {
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
-	struct workqueue_struct *m_gc_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
+	struct workqueue_struct *m_blockgc_wq;
+	struct workqueue_struct *m_inodegc_wq;
 
 	int			m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
@@ -154,6 +166,13 @@ typedef struct xfs_mount {
 	uint8_t			m_rt_checked;
 	uint8_t			m_rt_sick;
 
+	/*
+	 * This atomic bitset controls flags that alter the behavior of the
+	 * filesystem.  Use only the atomic bit helper functions here; see
+	 * XFS_OPFLAG_* for information about the actual flags.
+	 */
+	unsigned long		m_opflags;
+
 	/*
 	 * End of read-mostly variables. Frequently written variables and locks
 	 * should be placed below this comment from now on. The first variable
@@ -258,6 +277,19 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
 #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
 
+/*
+ * Operation flags -- each entry here is a bit index into m_opflags and is
+ * not itself a flag value.  Use the atomic bit functions to access.
+ */
+enum xfs_opflag_bits {
+	/*
+	 * If set, background inactivation worker threads will be scheduled to
+	 * process queued inodegc work.  If not, queued inodes remain in memory
+	 * waiting to be processed.
+	 */
+	XFS_OPFLAG_INODEGC_RUNNING_BIT	= 0,
+};
+
 /*
  * Max and min values for mount-option defined I/O
  * preallocation sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ef89a9a3ba9e..913d54eb4929 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -508,21 +508,29 @@ xfs_init_mount_workqueues(
 	if (!mp->m_reclaim_workqueue)
 		goto out_destroy_cil;
 
-	mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s",
+	mp->m_blockgc_wq = alloc_workqueue("xfs-blockgc/%s",
 			WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM,
 			0, mp->m_super->s_id);
-	if (!mp->m_gc_workqueue)
+	if (!mp->m_blockgc_wq)
 		goto out_destroy_reclaim;
 
+	mp->m_inodegc_wq = alloc_workqueue("xfs-inodegc/%s",
+			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			1, mp->m_super->s_id);
+	if (!mp->m_inodegc_wq)
+		goto out_destroy_blockgc;
+
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
 	if (!mp->m_sync_workqueue)
-		goto out_destroy_eofb;
+		goto out_destroy_inodegc;
 
 	return 0;
 
-out_destroy_eofb:
-	destroy_workqueue(mp->m_gc_workqueue);
+out_destroy_inodegc:
+	destroy_workqueue(mp->m_inodegc_wq);
+out_destroy_blockgc:
+	destroy_workqueue(mp->m_blockgc_wq);
 out_destroy_reclaim:
 	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_cil:
@@ -540,7 +548,8 @@ xfs_destroy_mount_workqueues(
 	struct xfs_mount	*mp)
 {
 	destroy_workqueue(mp->m_sync_workqueue);
-	destroy_workqueue(mp->m_gc_workqueue);
+	destroy_workqueue(mp->m_blockgc_wq);
+	destroy_workqueue(mp->m_inodegc_wq);
 	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
@@ -702,6 +711,8 @@ xfs_fs_sync_fs(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	trace_xfs_fs_sync_fs(mp, __return_address);
+
 	/*
 	 * Doing anything during the async pass would be counterproductive.
 	 */
@@ -718,6 +729,25 @@ xfs_fs_sync_fs(
 		flush_delayed_work(&mp->m_log->l_work);
 	}
 
+	/*
+	 * Flush all deferred inode inactivation work so that the free space
+	 * counters will reflect recent deletions.  Do not force the log again
+	 * because log recovery can restart the inactivation from the info that
+	 * we just wrote into the ondisk log.
+	 *
+	 * For regular operation this isn't strictly necessary since we aren't
+	 * required to guarantee that unlinking frees space immediately, but
+	 * that is how XFS historically behaved.
+	 *
+	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
+	 * last chance to complete the inactivation work before the filesystem
+	 * freezes and the log is quiesced.  The background worker will not
+	 * activate again until the fs is thawed because the VFS won't evict
+	 * any more inodes until freeze_super drops s_umount and we disable the
+	 * worker in xfs_fs_freeze.
+	 */
+	xfs_inodegc_flush(mp);
+
 	return 0;
 }
 
@@ -832,6 +862,17 @@ xfs_fs_freeze(
 	 */
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
+
+	/*
+	 * Stop the inodegc background worker.  freeze_super already flushed
+	 * all pending inodegc work when it sync'd the filesystem after setting
+	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
+	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
+	 * If the filesystem is read-write, inactivated inodes will queue but
+	 * the worker will not run until the filesystem thaws or unmounts.
+	 */
+	xfs_inodegc_stop(mp);
+
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);
@@ -847,6 +888,14 @@ xfs_fs_unfreeze(
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
 	xfs_blockgc_start(mp);
+
+	/*
+	 * Don't reactivate the inodegc worker on a readonly filesystem because
+	 * inodes are sent directly to reclaim.
+	 */
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -972,6 +1021,35 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_delalloc_blks);
 }
 
+static int
+xfs_inodegc_init_percpu(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	mp->m_inodegc = alloc_percpu(struct xfs_inodegc);
+	if (!mp->m_inodegc)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		init_llist_head(&gc->list);
+		gc->items = 0;
+                INIT_WORK(&gc->work, xfs_inodegc_worker);
+	}
+	return 0;
+}
+
+static void
+xfs_inodegc_free_percpu(
+	struct xfs_mount	*mp)
+{
+	if (!mp->m_inodegc)
+		return;
+	free_percpu(mp->m_inodegc);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
@@ -988,6 +1066,7 @@ xfs_fs_put_super(
 
 	xfs_freesb(mp);
 	free_percpu(mp->m_stats.xs_stats);
+	xfs_inodegc_free_percpu(mp);
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
@@ -1359,11 +1438,15 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_destroy_workqueues;
 
+	error = xfs_inodegc_init_percpu(mp);
+	if (error)
+		goto out_destroy_counters;
+
 	/* Allocate stats memory before we do operations that might use it */
 	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
 	if (!mp->m_stats.xs_stats) {
 		error = -ENOMEM;
-		goto out_destroy_counters;
+		goto out_destroy_inodegc;
 	}
 
 	error = xfs_readsb(mp, flags);
@@ -1566,6 +1649,8 @@ xfs_fs_fill_super(
 	xfs_freesb(mp);
  out_free_stats:
 	free_percpu(mp->m_stats.xs_stats);
+ out_destroy_inodegc:
+	xfs_inodegc_free_percpu(mp);
  out_destroy_counters:
 	xfs_destroy_percpu_counters(mp);
  out_destroy_workqueues:
@@ -1649,6 +1734,9 @@ xfs_remount_rw(
 	if (error && error != -ENOSPC)
 		return error;
 
+	/* Re-enable the background inode inactivation worker. */
+	xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -1671,6 +1759,15 @@ xfs_remount_ro(
 		return error;
 	}
 
+	/*
+	 * Stop the inodegc background worker.  xfs_fs_reconfigure already
+	 * flushed all pending inodegc work when it sync'd the filesystem.
+	 * The VFS holds s_umount, so we know that inodes cannot enter
+	 * xfs_fs_destroy_inode during a remount operation.  In readonly mode
+	 * we send inodes straight to reclaim, so no inodes will be queued.
+	 */
+	xfs_inodegc_stop(mp);
+
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
 	if (error) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 19260291ff8b..c2fac46a029b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -157,6 +157,45 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
+DECLARE_EVENT_CLASS(xfs_fs_class,
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
+	TP_ARGS(mp, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, mflags)
+		__field(unsigned long, opflags)
+		__field(unsigned long, sbflags)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		if (mp) {
+			__entry->dev = mp->m_super->s_dev;
+			__entry->mflags = mp->m_flags;
+			__entry->opflags = mp->m_opflags;
+			__entry->sbflags = mp->m_super->s_flags;
+		}
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d m_flags 0x%llx m_opflags 0x%lx s_flags 0x%lx caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->mflags,
+		  __entry->opflags,
+		  __entry->sbflags,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_FS_EVENT(name)	\
+DEFINE_EVENT(xfs_fs_class, name,					\
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
+	TP_ARGS(mp, caller_ip))
+DEFINE_FS_EVENT(xfs_inodegc_flush);
+DEFINE_FS_EVENT(xfs_inodegc_start);
+DEFINE_FS_EVENT(xfs_inodegc_stop);
+DEFINE_FS_EVENT(xfs_inodegc_worker);
+DEFINE_FS_EVENT(xfs_inodegc_queue);
+DEFINE_FS_EVENT(xfs_inodegc_throttle);
+DEFINE_FS_EVENT(xfs_fs_sync_fs);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),
@@ -616,14 +655,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -667,6 +709,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues
  2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
@ 2021-08-03 20:20     ` Darrick J. Wong
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-03 20:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Tue, Aug 03, 2021 at 06:34:03PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Move inode inactivation to background work contexts so that it no
> longer runs in the context that releases the final reference to an
> inode. This will allow process work that ends up blocking on
> inactivation to continue doing work while the filesytem processes
> the inactivation in the background.
> 
> A typical demonstration of this is unlinking an inode with lots of
> extents. The extents are removed during inactivation, so this blocks
> the process that unlinked the inode from the directory structure. By
> moving the inactivation to the background process, the userspace
> applicaiton can keep working (e.g. unlinking the next inode in the
> directory) while the inactivation work on the previous inode is
> done by a different CPU.
> 
> The implementation of the queue is relatively simple. We use a
> per-cpu lockless linked list (llist) to queue inodes for
> inactivation without requiring serialisation mechanisms, and a work
> item to allow the queue to be processed by a CPU bound worker
> thread. We also keep a count of the queue depth so that we can
> trigger work after a number of deferred inactivations have been
> queued.
> 
> The use of a bound workqueue with a single work depth allows the
> workqueue to run one work item per CPU. We queue the work item on
> the CPU we are currently running on, and so this essentially gives
> us affine per-cpu worker threads for the per-cpu queues. THis
> maintains the effective CPU affinity that occurs within XFS at the
> AG level due to all objects in a directory being local to an AG.
> Hence inactivation work tends to run on the same CPU that last
> accessed all the objects that inactivation accesses and this
> maintains hot CPU caches for unlink workloads.
> 
> A depth of 32 inodes was chosen to match the number of inodes in an
> inode cluster buffer. This hopefully allows sequential
> allocation/unlink behaviours to defering inactivation of all the
> inodes in a single cluster buffer at a time, further helping
> maintain hot CPU and buffer cache accesses while running
> inactivations.
> 
> A hard per-cpu queue throttle of 256 inode has been set to avoid
> runaway queuing when inodes that take a long to time inactivate are
> being processed. For example, when unlinking inodes with large
> numbers of extents that can take a lot of processing to free.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> 
> Hi Darrick,
> 
> This is the current version of the per-cpu deferred queues updated
> to replace patch 3 in this series. There are no performance
> regressions that I've measured with this, and most of fstests is
> passing. There are some failures that I haven't looked at yet -
> g/055, g/102, g/219, g/226, g/233, and so on. THese tests did not

Yeah, I saw all of those emitting various problems that all trace back
to ENOSPC or EDQUOT.  The various "bang on inodegc sooner than later"
patches in this series fix those problems; I think rework will simplify
the code changes a lot, since all we need to do now is force the
queue_work and the flush_work when space/quota/memory are low.

generic/219     - output mismatch (see /var/tmp/fstests/generic/219.out.bad)
    --- tests/generic/219.out   2021-05-13 11:47:55.683860312 -0700
    +++ /var/tmp/fstests/generic/219.out.bad    2021-08-03 10:49:26.554855113 -0700
    @@ -34,4 +34,4 @@
       Size: 49152        Filetype: Regular File
       File: "SCRATCH_MNT/mmap"
       Size: 49152        Filetype: Regular File
    -Usage OK (type=g)
    +Too many blocks used (type=g)
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/219.out /var/tmp/fstests/generic/219.out.bad'  to see the entire diff)
generic/371     - output mismatch (see /var/tmp/fstests/generic/371.out.bad)
    --- tests/generic/371.out   2021-05-13 11:47:55.712860228 -0700
    +++ /var/tmp/fstests/generic/371.out.bad    2021-08-03 10:58:10.161397704 -0700
    @@ -1,2 +1,198 @@
     QA output created by 371
     Silence is golden
    +fallocate: No space left on device
    +fallocate: No space left on device
    +fallocate: No space left on device
    +fallocate: No space left on device
    +fallocate: No space left on device
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/371.out /var/tmp/fstests/generic/371.out.bad'  to see the entire diff)
generic/427     [failed, exit status 1]- output mismatch (see /var/tmp/fstests/generic/427.out.bad)
    --- tests/generic/427.out   2021-05-13 11:47:55.723860196 -0700
    +++ /var/tmp/fstests/generic/427.out.bad    2021-08-03 11:00:27.076995641 -0700
    @@ -1,2 +1,2 @@
     QA output created by 427
    -Success, all done.
    +open: No space left on device
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/427.out /var/tmp/fstests/generic/427.out.bad'  to see the entire diff)
generic/511     - output mismatch (see /var/tmp/fstests/generic/511.out.bad)
    --- tests/generic/511.out   2021-05-13 11:47:55.738860153 -0700
    +++ /var/tmp/fstests/generic/511.out.bad    2021-08-03 11:03:29.524335265 -0700
    @@ -1,2 +1,5 @@
     QA output created by 511
    +touch: cannot touch '/opt/a': No space left on device
    +Seed set to 1
    +/opt/a: No space left on device
     Silence is golden
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/511.out /var/tmp/fstests/generic/511.out.bad'  to see the entire diff)
generic/531     _check_dmesg: something found in dmesg (see /var/tmp/fstests/generic/531.dmesg)
- output mismatch (see /var/tmp/fstests/generic/531.out.bad)
    --- tests/generic/531.out   2021-05-13 11:47:55.741860145 -0700
    +++ /var/tmp/fstests/generic/531.out.bad    2021-08-03 11:04:51.329004913 -0700
    @@ -1,2 +1,9 @@
     QA output created by 531
    +open?: Input/output error
    +open?: Input/output error
    +open?: Input/output error
    +open?: No such file or directory
    +open?: Input/output error
    +open?: Input/output error
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/531.out /var/tmp/fstests/generic/531.out.bad'  to see the entire diff)
generic/536     - output mismatch (see /var/tmp/fstests/generic/536.out.bad)
    --- tests/generic/536.out   2021-05-13 11:47:55.742860142 -0700
    +++ /var/tmp/fstests/generic/536.out.bad    2021-08-03 11:04:58.832711203 -0700
    @@ -1,3 +1,5 @@
     QA output created by 536
     file.1
    +hexdump: /opt/file.1: No such file or directory
     file.2
    +hexdump: /opt/file.2: No such file or directory
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/536.out /var/tmp/fstests/generic/536.out.bad'  to see the entire diff)
generic/603     - output mismatch (see /var/tmp/fstests/generic/603.out.bad)
    --- tests/generic/603.out   2021-05-13 11:47:55.755860104 -0700
    +++ /var/tmp/fstests/generic/603.out.bad    2021-08-03 11:07:23.155347585 -0700
    @@ -18,11 +18,15 @@
     ### Initialize files, and their mode and ownership
     --- Test block quota ---
     Write 225 blocks...
    +pwrite: Disk quota exceeded
     Rewrite 250 blocks plus 1 byte, over the block softlimit...
    +pwrite: Disk quota exceeded
     Try to write 1 one more block after grace...
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/603.out /var/tmp/fstests/generic/603.out.bad'  to see the entire diff)

I noticed that xfs/264 seems to get hung up on xfs_buftarg_wait when the
dmsetup suspend freezes the fs and we try to quiesce the log.
Unfortunately, that hanged my -g quick test. :/

> fail with my original "hack the queue onto the end of the series"
> patch - there were zero regressions from that patch so clearly some
> of the fixes later in this patch series are still necessary. Or I
> screwed up/missed a flush location that those tests would have
> otherwise triggered. I suspect patch 19(?) that triggers an inodegc
> flush from the blockgc flush at ENOSPC might be one of the missing
> pieces...
> 
> Hence I don't think these failures have to do with the relative lack
> of throttling, low space management or memory pressure detection.
> More tests are failing on my 16GB test VM than the 512MB test VM,
> and in general I haven't seen memory pressure have any impact on
> this queuing mechanism at all.

No surprises there, more RAM enables more laziness.

> I suspect that means most of the rest of the patchset is not
> necessary for inodegc management. I haven't yet gone through them
> to see which ones address the failures I'm seeing, so that's the
> next step here.
> 
> It would be good if you can run this through you test setups for
> this patchset to see if it behaves well in those situations. If it
> reproduces the same failures as Im seeing, then maybe by the time
> I'm awake again you've worked out which remaining bits of the
> patchset are still required....

Will do.

--D

> Cheers,
> 
> Dave.
> 
>  fs/xfs/scrub/common.c    |   7 +
>  fs/xfs/xfs_icache.c      | 338 +++++++++++++++++++++++++++++++++++------------
>  fs/xfs/xfs_icache.h      |   5 +
>  fs/xfs/xfs_inode.h       |  20 ++-
>  fs/xfs/xfs_log_recover.c |   7 +
>  fs/xfs/xfs_mount.c       |  26 +++-
>  fs/xfs/xfs_mount.h       |  34 ++++-
>  fs/xfs/xfs_super.c       | 111 +++++++++++++++-
>  fs/xfs/xfs_trace.h       |  50 ++++++-
>  9 files changed, 505 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index 8558ca05e11d..06b697f72f23 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -884,6 +884,7 @@ xchk_stop_reaping(
>  {
>  	sc->flags |= XCHK_REAPING_DISABLED;
>  	xfs_blockgc_stop(sc->mp);
> +	xfs_inodegc_stop(sc->mp);
>  }
>  
>  /* Restart background reaping of resources. */
> @@ -891,6 +892,12 @@ void
>  xchk_start_reaping(
>  	struct xfs_scrub	*sc)
>  {
> +	/*
> +	 * Readonly filesystems do not perform inactivation, so there's no
> +	 * need to restart the worker.
> +	 */
> +	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
> +		xfs_inodegc_start(sc->mp);
>  	xfs_blockgc_start(sc->mp);
>  	sc->flags &= ~XCHK_REAPING_DISABLED;
>  }
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 709507cc83ae..b1c2cab3c690 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -213,7 +213,7 @@ xfs_blockgc_queue(
>  {
>  	rcu_read_lock();
>  	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
> -		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
> +		queue_delayed_work(pag->pag_mount->m_blockgc_wq,
>  				   &pag->pag_blockgc_work,
>  				   msecs_to_jiffies(xfs_blockgc_secs * 1000));
>  	rcu_read_unlock();
> @@ -292,86 +292,6 @@ xfs_perag_clear_inode_tag(
>  	trace_xfs_perag_clear_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
>  }
>  
> -#ifdef DEBUG
> -static void
> -xfs_check_delalloc(
> -	struct xfs_inode	*ip,
> -	int			whichfork)
> -{
> -	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> -	struct xfs_bmbt_irec	got;
> -	struct xfs_iext_cursor	icur;
> -
> -	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
> -		return;
> -	do {
> -		if (isnullstartblock(got.br_startblock)) {
> -			xfs_warn(ip->i_mount,
> -	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
> -				ip->i_ino,
> -				whichfork == XFS_DATA_FORK ? "data" : "cow",
> -				got.br_startoff, got.br_blockcount);
> -		}
> -	} while (xfs_iext_next_extent(ifp, &icur, &got));
> -}
> -#else
> -#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
> -#endif
> -
> -/*
> - * We set the inode flag atomically with the radix tree tag.
> - * Once we get tag lookups on the radix tree, this inode flag
> - * can go away.
> - */
> -void
> -xfs_inode_mark_reclaimable(
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_perag	*pag;
> -	bool			need_inactive = xfs_inode_needs_inactive(ip);
> -
> -	if (!need_inactive) {
> -		/* Going straight to reclaim, so drop the dquots. */
> -		xfs_qm_dqdetach(ip);
> -	} else {
> -		xfs_inactive(ip);
> -	}
> -
> -	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
> -		xfs_check_delalloc(ip, XFS_DATA_FORK);
> -		xfs_check_delalloc(ip, XFS_COW_FORK);
> -		ASSERT(0);
> -	}
> -
> -	XFS_STATS_INC(mp, vn_reclaim);
> -
> -	/*
> -	 * We should never get here with one of the reclaim flags already set.
> -	 */
> -	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
> -	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
> -
> -	/*
> -	 * We always use background reclaim here because even if the inode is
> -	 * clean, it still may be under IO and hence we have wait for IO
> -	 * completion to occur before we can reclaim the inode. The background
> -	 * reclaim path handles this more efficiently than we can here, so
> -	 * simply let background reclaim tear down all inodes.
> -	 */
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> -	spin_lock(&pag->pag_ici_lock);
> -	spin_lock(&ip->i_flags_lock);
> -
> -	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			XFS_ICI_RECLAIM_TAG);
> -	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> -
> -	spin_unlock(&ip->i_flags_lock);
> -	spin_unlock(&pag->pag_ici_lock);
> -	xfs_perag_put(pag);
> -}
> -
>  static inline void
>  xfs_inew_wait(
>  	struct xfs_inode	*ip)
> @@ -569,6 +489,15 @@ xfs_iget_cache_hit(
>  	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM))
>  		goto out_skip;
>  
> +	if (ip->i_flags & XFS_NEED_INACTIVE) {
> +		/* Unlinked inodes cannot be re-grabbed. */
> +		if (VFS_I(ip)->i_nlink == 0) {
> +			error = -ENOENT;
> +			goto out_error;
> +		}
> +		goto out_inodegc_flush;
> +	}
> +
>  	/*
>  	 * Check the inode free state is valid. This also detects lookup
>  	 * racing with unlinks.
> @@ -616,6 +545,12 @@ xfs_iget_cache_hit(
>  	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
>  	return error;
> +
> +out_inodegc_flush:
> +	spin_unlock(&ip->i_flags_lock);
> +	rcu_read_unlock();
> +	xfs_inodegc_flush(mp);
> +	return -EAGAIN;
>  }
>  
>  static int
> @@ -943,6 +878,7 @@ xfs_reclaim_inode(
>  
>  	xfs_iflags_clear(ip, XFS_IFLUSHING);
>  reclaim:
> +	trace_xfs_inode_reclaiming(ip);
>  
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always appears
> @@ -1420,6 +1356,8 @@ xfs_blockgc_start(
>  
>  /* Don't try to run block gc on an inode that's in any of these states. */
>  #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
> +					 XFS_NEED_INACTIVE | \
> +					 XFS_INACTIVATING | \
>  					 XFS_IRECLAIMABLE | \
>  					 XFS_IRECLAIM)
>  /*
> @@ -1794,3 +1732,241 @@ xfs_icwalk(
>  	return last_error;
>  	BUILD_BUG_ON(XFS_ICWALK_PRIVATE_FLAGS & XFS_ICWALK_FLAGS_VALID);
>  }
> +
> +#ifdef DEBUG
> +static void
> +xfs_check_delalloc(
> +	struct xfs_inode	*ip,
> +	int			whichfork)
> +{
> +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> +	struct xfs_bmbt_irec	got;
> +	struct xfs_iext_cursor	icur;
> +
> +	if (!ifp || !xfs_iext_lookup_extent(ip, ifp, 0, &icur, &got))
> +		return;
> +	do {
> +		if (isnullstartblock(got.br_startblock)) {
> +			xfs_warn(ip->i_mount,
> +	"ino %llx %s fork has delalloc extent at [0x%llx:0x%llx]",
> +				ip->i_ino,
> +				whichfork == XFS_DATA_FORK ? "data" : "cow",
> +				got.br_startoff, got.br_blockcount);
> +		}
> +	} while (xfs_iext_next_extent(ifp, &icur, &got));
> +}
> +#else
> +#define xfs_check_delalloc(ip, whichfork)	do { } while (0)
> +#endif
> +
> +/* Schedule the inode for reclaim. */
> +static void
> +xfs_inodegc_set_reclaimable(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount        *mp = ip->i_mount;
> +	struct xfs_perag	*pag;
> +
> +	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
> +		xfs_check_delalloc(ip, XFS_DATA_FORK);
> +		xfs_check_delalloc(ip, XFS_COW_FORK);
> +		ASSERT(0);
> +	}
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +	spin_lock(&pag->pag_ici_lock);
> +	spin_lock(&ip->i_flags_lock);
> +
> +	trace_xfs_inode_set_reclaimable(ip);
> +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> +	ip->i_flags |= XFS_IRECLAIMABLE;
> +	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> +				XFS_ICI_RECLAIM_TAG);
> +
> +	spin_unlock(&ip->i_flags_lock);
> +	spin_unlock(&pag->pag_ici_lock);
> +	xfs_perag_put(pag);
> +}
> +
> +/*
> + * Free all speculative preallocations and possibly even the inode itself.
> + * This is the last chance to make changes to an otherwise unreferenced file
> + * before incore reclamation happens.
> + */
> +static void
> +xfs_inodegc_inactivate(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount        *mp = ip->i_mount;
> +
> +	/*
> +	* Inactivation isn't supposed to run when the fs is frozen because
> +	* we don't want kernel threads to block on transaction allocation.
> +	*/
> +	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
> +
> +	trace_xfs_inode_inactivating(ip);
> +	xfs_inactive(ip);
> +	xfs_inodegc_set_reclaimable(ip);
> +}
> +
> +void
> +xfs_inodegc_worker(
> +	struct work_struct	*work)
> +{
> +	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
> +							work);
> +	struct llist_node	*node = llist_del_all(&gc->list);
> +	struct xfs_inode	*ip, *n;
> +
> +	trace_xfs_inodegc_worker(NULL, __return_address);
> +
> +	WRITE_ONCE(gc->items, 0);
> +	llist_for_each_entry_safe(ip, n, node, i_gclist) {
> +		xfs_iflags_set(ip, XFS_INACTIVATING);
> +		xfs_inodegc_inactivate(ip);
> +	}
> +}
> +
> +/*
> + * Force all currently queued inode inactivation work to run immediately, and
> + * wait for the work to finish. Two pass - queue all the work first pass, wait
> + * for it in a second pass.
> + */
> +void
> +xfs_inodegc_flush(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_inodegc	*gc;
> +	int			cpu;
> +
> +	trace_xfs_inodegc_flush(mp, __return_address);
> +
> +	for_each_online_cpu(cpu) {
> +		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> +		queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> +	}
> +
> +	for_each_online_cpu(cpu) {
> +		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> +		flush_work(&gc->work);
> +	}
> +}
> +
> +/*
> + * Flush all the pending work and then disable the inode inactivation background
> + * workers and wait for them to stop.
> + */
> +void
> +xfs_inodegc_stop(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_inodegc	*gc;
> +	int			cpu;
> +
> +	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> +		return;
> +
> +	xfs_inodegc_flush(mp);
> +
> +	for_each_online_cpu(cpu) {
> +		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> +		cancel_work_sync(&gc->work);
> +	}
> +	trace_xfs_inodegc_stop(mp, __return_address);
> +}
> +
> +/*
> + * Enable the inode inactivation background workers and schedule deferred inode
> + * inactivation work if there is any.
> + */
> +void
> +xfs_inodegc_start(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_inodegc	*gc;
> +	int			cpu;
> +
> +	if (test_and_set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> +		return;
> +
> +	trace_xfs_inodegc_start(mp, __return_address);
> +	for_each_online_cpu(cpu) {
> +		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> +		if (!llist_empty(&gc->list))
> +			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> +	}
> +}
> +
> +/*
> + * Queue a background inactivation worker if there are inodes that need to be
> + * inactivated and higher level xfs code hasn't disabled the background
> + * workers.
> + */
> +static void
> +xfs_inodegc_queue(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_inodegc	*gc;
> +	int			items;
> +
> +	trace_xfs_inode_set_need_inactive(ip);
> +	spin_lock(&ip->i_flags_lock);
> +	ip->i_flags |= XFS_NEED_INACTIVE;
> +	spin_unlock(&ip->i_flags_lock);
> +
> +	gc = get_cpu_ptr(mp->m_inodegc);
> +	llist_add(&ip->i_gclist, &gc->list);
> +	items = READ_ONCE(gc->items);
> +	WRITE_ONCE(gc->items, items + 1);
> +	put_cpu_ptr(gc);
> +
> +	if (!test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
> +		return;
> +	if (items > 32) {
> +		trace_xfs_inodegc_queue(mp, __return_address);
> +		queue_work(mp->m_inodegc_wq, &gc->work);
> +	}
> +	/* throttle */
> +	if (items > 256) {
> +		trace_xfs_inodegc_throttle(mp, __return_address);
> +		flush_work(&gc->work);
> +	}
> +}
> +
> +/*
> + * We set the inode flag atomically with the radix tree tag.  Once we get tag
> + * lookups on the radix tree, this inode flag can go away.
> + *
> + * We always use background reclaim here because even if the inode is clean, it
> + * still may be under IO and hence we have wait for IO completion to occur
> + * before we can reclaim the inode. The background reclaim path handles this
> + * more efficiently than we can here, so simply let background reclaim tear down
> + * all inodes.
> + */
> +void
> +xfs_inode_mark_reclaimable(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	bool			need_inactive;
> +
> +	XFS_STATS_INC(mp, vn_reclaim);
> +
> +	/*
> +	 * We should never get here with any of the reclaim flags already set.
> +	 */
> +	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
> +
> +	need_inactive = xfs_inode_needs_inactive(ip);
> +	if (need_inactive) {
> +		xfs_inodegc_queue(ip);
> +		return;
> +	}
> +
> +	/* Going straight to reclaim, so drop the dquots. */
> +	xfs_qm_dqdetach(ip);
> +	xfs_inodegc_set_reclaimable(ip);
> +}
> +
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index d0062ebb3f7a..c1dfc909a5b0 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -74,4 +74,9 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
>  void xfs_blockgc_stop(struct xfs_mount *mp);
>  void xfs_blockgc_start(struct xfs_mount *mp);
>  
> +void xfs_inodegc_worker(struct work_struct *work);
> +void xfs_inodegc_flush(struct xfs_mount *mp);
> +void xfs_inodegc_stop(struct xfs_mount *mp);
> +void xfs_inodegc_start(struct xfs_mount *mp);
> +
>  #endif
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index e3137bbc7b14..1f62b481d8c5 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -42,6 +42,7 @@ typedef struct xfs_inode {
>  	mrlock_t		i_lock;		/* inode lock */
>  	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
>  	atomic_t		i_pincount;	/* inode pin count */
> +	struct llist_node	i_gclist;	/* deferred inactivation list */
>  
>  	/*
>  	 * Bitsets of inode metadata that have been checked and/or are sick.
> @@ -240,6 +241,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
>  #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
>  #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
>  #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
> +#define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
>  /*
>   * If this unlinked inode is in the middle of recovery, don't let drop_inode
>   * truncate and free the inode.  This can happen if we iget the inode during
> @@ -248,6 +250,21 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
>  #define XFS_IRECOVERY		(1 << 11)
>  #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
>  
> +/*
> + * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
> + * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
> + * INACTIVATING bit will be set to keep iget away from this inode.  After the
> + * inactivation completes, both flags will be cleared and the inode is a
> + * plain old IRECLAIMABLE inode.
> + */
> +#define XFS_INACTIVATING	(1 << 13)
> +
> +/* All inode state flags related to inode reclaim. */
> +#define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
> +				 XFS_IRECLAIM | \
> +				 XFS_NEED_INACTIVE | \
> +				 XFS_INACTIVATING)
> +
>  /*
>   * Per-lifetime flags need to be reset when re-using a reclaimable inode during
>   * inode lookup. This prevents unintended behaviour on the new inode from
> @@ -255,7 +272,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
>   */
>  #define XFS_IRECLAIM_RESET_FLAGS	\
>  	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
> -	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
> +	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
> +	 XFS_INACTIVATING)
>  
>  /*
>   * Flags for inode locking.
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 1721fce2ec94..a98d2429d795 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2786,6 +2786,13 @@ xlog_recover_process_iunlinks(
>  		}
>  		xfs_buf_rele(agibp);
>  	}
> +
> +	/*
> +	 * Flush the pending unlinked inodes to ensure that the inactivations
> +	 * are fully completed on disk and the incore inodes can be reclaimed
> +	 * before we signal that recovery is complete.
> +	 */
> +	xfs_inodegc_flush(mp);
>  }
>  
>  STATIC void
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index baf7b323cb15..1f7e9a608f38 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -514,7 +514,8 @@ xfs_check_summary_counts(
>   * Flush and reclaim dirty inodes in preparation for unmount. Inodes and
>   * internal inode structures can be sitting in the CIL and AIL at this point,
>   * so we need to unpin them, write them back and/or reclaim them before unmount
> - * can proceed.
> + * can proceed.  In other words, callers are required to have inactivated all
> + * inodes.
>   *
>   * An inode cluster that has been freed can have its buffer still pinned in
>   * memory because the transaction is still sitting in a iclog. The stale inodes
> @@ -546,6 +547,7 @@ xfs_unmount_flush_inodes(
>  	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
>  
>  	xfs_ail_push_all_sync(mp->m_ail);
> +	xfs_inodegc_stop(mp);
>  	cancel_delayed_work_sync(&mp->m_reclaim_work);
>  	xfs_reclaim_inodes(mp);
>  	xfs_health_unmount(mp);
> @@ -782,6 +784,9 @@ xfs_mountfs(
>  	if (error)
>  		goto out_log_dealloc;
>  
> +	/* Enable background inode inactivation workers. */
> +	xfs_inodegc_start(mp);
> +
>  	/*
>  	 * Get and sanity-check the root inode.
>  	 * Save the pointer to it in the mount structure.
> @@ -942,6 +947,15 @@ xfs_mountfs(
>  	xfs_irele(rip);
>  	/* Clean out dquots that might be in memory after quotacheck. */
>  	xfs_qm_unmount(mp);
> +
> +	/*
> +	 * Inactivate all inodes that might still be in memory after a log
> +	 * intent recovery failure so that reclaim can free them.  Metadata
> +	 * inodes and the root directory shouldn't need inactivation, but the
> +	 * mount failed for some reason, so pull down all the state and flee.
> +	 */
> +	xfs_inodegc_flush(mp);
> +
>  	/*
>  	 * Flush all inode reclamation work and flush the log.
>  	 * We have to do this /after/ rtunmount and qm_unmount because those
> @@ -989,6 +1003,16 @@ xfs_unmountfs(
>  	uint64_t		resblks;
>  	int			error;
>  
> +	/*
> +	 * Perform all on-disk metadata updates required to inactivate inodes
> +	 * that the VFS evicted earlier in the unmount process.  Freeing inodes
> +	 * and discarding CoW fork preallocations can cause shape changes to
> +	 * the free inode and refcount btrees, respectively, so we must finish
> +	 * this before we discard the metadata space reservations.  Metadata
> +	 * inodes and the root directory do not require inactivation.
> +	 */
> +	xfs_inodegc_flush(mp);
> +
>  	xfs_blockgc_stop(mp);
>  	xfs_fs_unreserve_ag_blocks(mp);
>  	xfs_qm_unmount_quotas(mp);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c78b63fe779a..470013a48c17 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -56,6 +56,15 @@ struct xfs_error_cfg {
>  	long		retry_timeout;	/* in jiffies, -1 = infinite */
>  };
>  
> +/*
> + * Per-cpu deferred inode inactivation GC lists.
> + */
> +struct xfs_inodegc {
> +	struct llist_head	list;
> +	struct work_struct	work;
> +	int			items;
> +};
> +
>  /*
>   * The struct xfsmount layout is optimised to separate read-mostly variables
>   * from variables that are frequently modified. We put the read-mostly variables
> @@ -82,6 +91,8 @@ typedef struct xfs_mount {
>  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
>  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
>  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> +	void __percpu		*m_inodegc;	/* percpu inodegc structures */
> +
>  	/*
>  	 * Optional cache of rt summary level per bitmap block with the
>  	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> @@ -94,8 +105,9 @@ typedef struct xfs_mount {
>  	struct workqueue_struct	*m_unwritten_workqueue;
>  	struct workqueue_struct	*m_cil_workqueue;
>  	struct workqueue_struct	*m_reclaim_workqueue;
> -	struct workqueue_struct *m_gc_workqueue;
>  	struct workqueue_struct	*m_sync_workqueue;
> +	struct workqueue_struct *m_blockgc_wq;
> +	struct workqueue_struct *m_inodegc_wq;
>  
>  	int			m_bsize;	/* fs logical block size */
>  	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
> @@ -154,6 +166,13 @@ typedef struct xfs_mount {
>  	uint8_t			m_rt_checked;
>  	uint8_t			m_rt_sick;
>  
> +	/*
> +	 * This atomic bitset controls flags that alter the behavior of the
> +	 * filesystem.  Use only the atomic bit helper functions here; see
> +	 * XFS_OPFLAG_* for information about the actual flags.
> +	 */
> +	unsigned long		m_opflags;
> +
>  	/*
>  	 * End of read-mostly variables. Frequently written variables and locks
>  	 * should be placed below this comment from now on. The first variable
> @@ -258,6 +277,19 @@ typedef struct xfs_mount {
>  #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
>  #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
>  
> +/*
> + * Operation flags -- each entry here is a bit index into m_opflags and is
> + * not itself a flag value.  Use the atomic bit functions to access.
> + */
> +enum xfs_opflag_bits {
> +	/*
> +	 * If set, background inactivation worker threads will be scheduled to
> +	 * process queued inodegc work.  If not, queued inodes remain in memory
> +	 * waiting to be processed.
> +	 */
> +	XFS_OPFLAG_INODEGC_RUNNING_BIT	= 0,
> +};
> +
>  /*
>   * Max and min values for mount-option defined I/O
>   * preallocation sizes.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ef89a9a3ba9e..913d54eb4929 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -508,21 +508,29 @@ xfs_init_mount_workqueues(
>  	if (!mp->m_reclaim_workqueue)
>  		goto out_destroy_cil;
>  
> -	mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s",
> +	mp->m_blockgc_wq = alloc_workqueue("xfs-blockgc/%s",
>  			WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM,
>  			0, mp->m_super->s_id);
> -	if (!mp->m_gc_workqueue)
> +	if (!mp->m_blockgc_wq)
>  		goto out_destroy_reclaim;
>  
> +	mp->m_inodegc_wq = alloc_workqueue("xfs-inodegc/%s",
> +			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
> +			1, mp->m_super->s_id);
> +	if (!mp->m_inodegc_wq)
> +		goto out_destroy_blockgc;
> +
>  	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
>  			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
>  	if (!mp->m_sync_workqueue)
> -		goto out_destroy_eofb;
> +		goto out_destroy_inodegc;
>  
>  	return 0;
>  
> -out_destroy_eofb:
> -	destroy_workqueue(mp->m_gc_workqueue);
> +out_destroy_inodegc:
> +	destroy_workqueue(mp->m_inodegc_wq);
> +out_destroy_blockgc:
> +	destroy_workqueue(mp->m_blockgc_wq);
>  out_destroy_reclaim:
>  	destroy_workqueue(mp->m_reclaim_workqueue);
>  out_destroy_cil:
> @@ -540,7 +548,8 @@ xfs_destroy_mount_workqueues(
>  	struct xfs_mount	*mp)
>  {
>  	destroy_workqueue(mp->m_sync_workqueue);
> -	destroy_workqueue(mp->m_gc_workqueue);
> +	destroy_workqueue(mp->m_blockgc_wq);
> +	destroy_workqueue(mp->m_inodegc_wq);
>  	destroy_workqueue(mp->m_reclaim_workqueue);
>  	destroy_workqueue(mp->m_cil_workqueue);
>  	destroy_workqueue(mp->m_unwritten_workqueue);
> @@ -702,6 +711,8 @@ xfs_fs_sync_fs(
>  {
>  	struct xfs_mount	*mp = XFS_M(sb);
>  
> +	trace_xfs_fs_sync_fs(mp, __return_address);
> +
>  	/*
>  	 * Doing anything during the async pass would be counterproductive.
>  	 */
> @@ -718,6 +729,25 @@ xfs_fs_sync_fs(
>  		flush_delayed_work(&mp->m_log->l_work);
>  	}
>  
> +	/*
> +	 * Flush all deferred inode inactivation work so that the free space
> +	 * counters will reflect recent deletions.  Do not force the log again
> +	 * because log recovery can restart the inactivation from the info that
> +	 * we just wrote into the ondisk log.
> +	 *
> +	 * For regular operation this isn't strictly necessary since we aren't
> +	 * required to guarantee that unlinking frees space immediately, but
> +	 * that is how XFS historically behaved.
> +	 *
> +	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
> +	 * last chance to complete the inactivation work before the filesystem
> +	 * freezes and the log is quiesced.  The background worker will not
> +	 * activate again until the fs is thawed because the VFS won't evict
> +	 * any more inodes until freeze_super drops s_umount and we disable the
> +	 * worker in xfs_fs_freeze.
> +	 */
> +	xfs_inodegc_flush(mp);
> +
>  	return 0;
>  }
>  
> @@ -832,6 +862,17 @@ xfs_fs_freeze(
>  	 */
>  	flags = memalloc_nofs_save();
>  	xfs_blockgc_stop(mp);
> +
> +	/*
> +	 * Stop the inodegc background worker.  freeze_super already flushed
> +	 * all pending inodegc work when it sync'd the filesystem after setting
> +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> +	 * If the filesystem is read-write, inactivated inodes will queue but
> +	 * the worker will not run until the filesystem thaws or unmounts.
> +	 */
> +	xfs_inodegc_stop(mp);
> +
>  	xfs_save_resvblks(mp);
>  	ret = xfs_log_quiesce(mp);
>  	memalloc_nofs_restore(flags);
> @@ -847,6 +888,14 @@ xfs_fs_unfreeze(
>  	xfs_restore_resvblks(mp);
>  	xfs_log_work_queue(mp);
>  	xfs_blockgc_start(mp);
> +
> +	/*
> +	 * Don't reactivate the inodegc worker on a readonly filesystem because
> +	 * inodes are sent directly to reclaim.
> +	 */
> +	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
> +		xfs_inodegc_start(mp);
> +
>  	return 0;
>  }
>  
> @@ -972,6 +1021,35 @@ xfs_destroy_percpu_counters(
>  	percpu_counter_destroy(&mp->m_delalloc_blks);
>  }
>  
> +static int
> +xfs_inodegc_init_percpu(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_inodegc	*gc;
> +	int			cpu;
> +
> +	mp->m_inodegc = alloc_percpu(struct xfs_inodegc);
> +	if (!mp->m_inodegc)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> +		init_llist_head(&gc->list);
> +		gc->items = 0;
> +                INIT_WORK(&gc->work, xfs_inodegc_worker);
> +	}
> +	return 0;
> +}
> +
> +static void
> +xfs_inodegc_free_percpu(
> +	struct xfs_mount	*mp)
> +{
> +	if (!mp->m_inodegc)
> +		return;
> +	free_percpu(mp->m_inodegc);
> +}
> +
>  static void
>  xfs_fs_put_super(
>  	struct super_block	*sb)
> @@ -988,6 +1066,7 @@ xfs_fs_put_super(
>  
>  	xfs_freesb(mp);
>  	free_percpu(mp->m_stats.xs_stats);
> +	xfs_inodegc_free_percpu(mp);
>  	xfs_destroy_percpu_counters(mp);
>  	xfs_destroy_mount_workqueues(mp);
>  	xfs_close_devices(mp);
> @@ -1359,11 +1438,15 @@ xfs_fs_fill_super(
>  	if (error)
>  		goto out_destroy_workqueues;
>  
> +	error = xfs_inodegc_init_percpu(mp);
> +	if (error)
> +		goto out_destroy_counters;
> +
>  	/* Allocate stats memory before we do operations that might use it */
>  	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
>  	if (!mp->m_stats.xs_stats) {
>  		error = -ENOMEM;
> -		goto out_destroy_counters;
> +		goto out_destroy_inodegc;
>  	}
>  
>  	error = xfs_readsb(mp, flags);
> @@ -1566,6 +1649,8 @@ xfs_fs_fill_super(
>  	xfs_freesb(mp);
>   out_free_stats:
>  	free_percpu(mp->m_stats.xs_stats);
> + out_destroy_inodegc:
> +	xfs_inodegc_free_percpu(mp);
>   out_destroy_counters:
>  	xfs_destroy_percpu_counters(mp);
>   out_destroy_workqueues:
> @@ -1649,6 +1734,9 @@ xfs_remount_rw(
>  	if (error && error != -ENOSPC)
>  		return error;
>  
> +	/* Re-enable the background inode inactivation worker. */
> +	xfs_inodegc_start(mp);
> +
>  	return 0;
>  }
>  
> @@ -1671,6 +1759,15 @@ xfs_remount_ro(
>  		return error;
>  	}
>  
> +	/*
> +	 * Stop the inodegc background worker.  xfs_fs_reconfigure already
> +	 * flushed all pending inodegc work when it sync'd the filesystem.
> +	 * The VFS holds s_umount, so we know that inodes cannot enter
> +	 * xfs_fs_destroy_inode during a remount operation.  In readonly mode
> +	 * we send inodes straight to reclaim, so no inodes will be queued.
> +	 */
> +	xfs_inodegc_stop(mp);
> +
>  	/* Free the per-AG metadata reservation pool. */
>  	error = xfs_fs_unreserve_ag_blocks(mp);
>  	if (error) {
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 19260291ff8b..c2fac46a029b 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -157,6 +157,45 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
>  DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
>  DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
>  
> +DECLARE_EVENT_CLASS(xfs_fs_class,
> +	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
> +	TP_ARGS(mp, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned long long, mflags)
> +		__field(unsigned long, opflags)
> +		__field(unsigned long, sbflags)
> +		__field(void *, caller_ip)
> +	),
> +	TP_fast_assign(
> +		if (mp) {
> +			__entry->dev = mp->m_super->s_dev;
> +			__entry->mflags = mp->m_flags;
> +			__entry->opflags = mp->m_opflags;
> +			__entry->sbflags = mp->m_super->s_flags;
> +		}
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("dev %d:%d m_flags 0x%llx m_opflags 0x%lx s_flags 0x%lx caller %pS",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->mflags,
> +		  __entry->opflags,
> +		  __entry->sbflags,
> +		  __entry->caller_ip)
> +);
> +
> +#define DEFINE_FS_EVENT(name)	\
> +DEFINE_EVENT(xfs_fs_class, name,					\
> +	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
> +	TP_ARGS(mp, caller_ip))
> +DEFINE_FS_EVENT(xfs_inodegc_flush);
> +DEFINE_FS_EVENT(xfs_inodegc_start);
> +DEFINE_FS_EVENT(xfs_inodegc_stop);
> +DEFINE_FS_EVENT(xfs_inodegc_worker);
> +DEFINE_FS_EVENT(xfs_inodegc_queue);
> +DEFINE_FS_EVENT(xfs_inodegc_throttle);
> +DEFINE_FS_EVENT(xfs_fs_sync_fs);
> +
>  DECLARE_EVENT_CLASS(xfs_ag_class,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
>  	TP_ARGS(mp, agno),
> @@ -616,14 +655,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
>  	TP_STRUCT__entry(
>  		__field(dev_t, dev)
>  		__field(xfs_ino_t, ino)
> +		__field(unsigned long, iflags)
>  	),
>  	TP_fast_assign(
>  		__entry->dev = VFS_I(ip)->i_sb->s_dev;
>  		__entry->ino = ip->i_ino;
> +		__entry->iflags = ip->i_flags;
>  	),
> -	TP_printk("dev %d:%d ino 0x%llx",
> +	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino)
> +		  __entry->ino,
> +		  __entry->iflags)
>  )
>  
>  #define DEFINE_INODE_EVENT(name) \
> @@ -667,6 +709,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
>  DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
>  DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
>  DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
> +DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
> +DEFINE_INODE_EVENT(xfs_inode_reclaiming);
> +DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
> +DEFINE_INODE_EVENT(xfs_inode_inactivating);
>  
>  /*
>   * ftrace's __print_symbolic requires that all enum values be wrapped in the

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues
  2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
  2021-08-03 20:20     ` Darrick J. Wong
@ 2021-08-04  3:20     ` Darrick J. Wong
  2021-08-04 10:03       ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
                         ` (5 more replies)
  1 sibling, 6 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-04  3:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

For everyone else following along at home, I've posted the current draft
version of this whole thing in:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15

Here's Dave's patch reworked slightly to fix a deadlock between icreate
and inactivation; conversion to use m_opstate and related macro stamping
goodness; and some code reorganization to make it easier to add the
throttling bits in the back two thirds of the series.

IOWs, I like this patch.  The runtime for my crazy deltree benchmark
dropped from ~27 minutes to ~17 when the VM has 560M of RAM, and there's
no observable drop in performance when the VM has 16G of RAM.  I also
finally got it to run with 512M of RAM, whereas current TOT OOMs.

(Note: My crazy deltree benchmark is: I have a mdrestored sparse image
with 10m files that I use dm-snapshot so that I can repeatedly write to
it without needing to restore the image.  Then I mount the dm snapshot,
and try to delete every file in the fs.)

--D

From: Dave Chinner <dchinner@redhat.com>

xfs: per-cpu deferred inode inactivation queues

Move inode inactivation to background work contexts so that it no
longer runs in the context that releases the final reference to an
inode. This will allow process work that ends up blocking on
inactivation to continue doing work while the filesytem processes
the inactivation in the background.

A typical demonstration of this is unlinking an inode with lots of
extents. The extents are removed during inactivation, so this blocks
the process that unlinked the inode from the directory structure. By
moving the inactivation to the background process, the userspace
applicaiton can keep working (e.g. unlinking the next inode in the
directory) while the inactivation work on the previous inode is
done by a different CPU.

The implementation of the queue is relatively simple. We use a
per-cpu lockless linked list (llist) to queue inodes for
inactivation without requiring serialisation mechanisms, and a work
item to allow the queue to be processed by a CPU bound worker
thread. We also keep a count of the queue depth so that we can
trigger work after a number of deferred inactivations have been
queued.

The use of a bound workqueue with a single work depth allows the
workqueue to run one work item per CPU. We queue the work item on
the CPU we are currently running on, and so this essentially gives
us affine per-cpu worker threads for the per-cpu queues. THis
maintains the effective CPU affinity that occurs within XFS at the
AG level due to all objects in a directory being local to an AG.
Hence inactivation work tends to run on the same CPU that last
accessed all the objects that inactivation accesses and this
maintains hot CPU caches for unlink workloads.

A depth of 32 inodes was chosen to match the number of inodes in an
inode cluster buffer. This hopefully allows sequential
allocation/unlink behaviours to defering inactivation of all the
inodes in a single cluster buffer at a time, further helping
maintain hot CPU and buffer cache accesses while running
inactivations.

A hard per-cpu queue throttle of 256 inode has been set to avoid
runaway queuing when inodes that take a long to time inactivate are
being processed. For example, when unlinking inodes with large
numbers of extents that can take a lot of processing to free.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: tweak comments and tracepoints, convert opflags to state bits]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c    |    7 +
 fs/xfs/xfs_icache.c      |  301 ++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_icache.h      |    5 +
 fs/xfs/xfs_inode.h       |   20 +++
 fs/xfs/xfs_log_recover.c |    7 +
 fs/xfs/xfs_mount.c       |   26 ++++
 fs/xfs/xfs_mount.h       |   38 ++++++
 fs/xfs/xfs_super.c       |  113 ++++++++++++++++-
 fs/xfs/xfs_trace.h       |   53 ++++++++
 9 files changed, 517 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 8558ca05e11d..06b697f72f23 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -884,6 +884,7 @@ xchk_stop_reaping(
 {
 	sc->flags |= XCHK_REAPING_DISABLED;
 	xfs_blockgc_stop(sc->mp);
+	xfs_inodegc_stop(sc->mp);
 }
 
 /* Restart background reaping of resources. */
@@ -891,6 +892,12 @@ void
 xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
+	/*
+	 * Readonly filesystems do not perform inactivation, so there's no
+	 * need to restart the worker.
+	 */
+	if (!(sc->mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(sc->mp);
 	xfs_blockgc_start(sc->mp);
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index b9214733d0c3..625dea381d0f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -213,7 +213,7 @@ xfs_blockgc_queue(
 {
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
-		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
+		queue_delayed_work(pag->pag_mount->m_blockgc_wq,
 				   &pag->pag_blockgc_work,
 				   msecs_to_jiffies(xfs_blockgc_secs * 1000));
 	rcu_read_unlock();
@@ -478,17 +478,26 @@ xfs_iget_cache_hit(
 
 	/*
 	 * If we are racing with another cache hit that is currently
-	 * instantiating this inode or currently recycling it out of
-	 * reclaimable state, wait for the initialisation to complete
-	 * before continuing.
+	 * instantiating this inode, actively inactivating it, or currently
+	 * recycling it out of reclaimable state, wait for the initialisation
+	 * to complete before continuing.
 	 *
 	 * XXX(hch): eventually we should do something equivalent to
 	 *	     wait_on_inode to wait for these flags to be cleared
 	 *	     instead of polling for it.
 	 */
-	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM))
+	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING))
 		goto out_skip;
 
+	if (ip->i_flags & XFS_NEED_INACTIVE) {
+		/* Unlinked inodes cannot be re-grabbed. */
+		if (VFS_I(ip)->i_nlink == 0) {
+			error = -ENOENT;
+			goto out_error;
+		}
+		goto out_inodegc_flush;
+	}
+
 	/*
 	 * Check the inode free state is valid. This also detects lookup
 	 * racing with unlinks.
@@ -536,6 +545,12 @@ xfs_iget_cache_hit(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 	return error;
+
+out_inodegc_flush:
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+	xfs_inodegc_flush(mp);
+	return -EAGAIN;
 }
 
 static int
@@ -863,6 +878,7 @@ xfs_reclaim_inode(
 
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
+	trace_xfs_inode_reclaiming(ip);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1340,6 +1356,8 @@ xfs_blockgc_start(
 
 /* Don't try to run block gc on an inode that's in any of these states. */
 #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
+					 XFS_NEED_INACTIVE | \
+					 XFS_INACTIVATING | \
 					 XFS_IRECLAIMABLE | \
 					 XFS_IRECLAIM)
 /*
@@ -1741,25 +1759,13 @@ xfs_check_delalloc(
 #define xfs_check_delalloc(ip, whichfork)	do { } while (0)
 #endif
 
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_mark_reclaimable(
+/* Schedule the inode for reclaim. */
+static void
+xfs_inodegc_set_reclaimable(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
-	bool			need_inactive = xfs_inode_needs_inactive(ip);
-
-	if (!need_inactive) {
-		/* Going straight to reclaim, so drop the dquots. */
-		xfs_qm_dqdetach(ip);
-	} else {
-		xfs_inactive(ip);
-	}
 
 	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
@@ -1767,30 +1773,245 @@ xfs_inode_mark_reclaimable(
 		ASSERT(0);
 	}
 
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	trace_xfs_inode_set_reclaimable(ip);
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	ip->i_flags |= XFS_IRECLAIMABLE;
+	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			XFS_ICI_RECLAIM_TAG);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+}
+
+/*
+ * Free all speculative preallocations and possibly even the inode itself.
+ * This is the last chance to make changes to an otherwise unreferenced file
+ * before incore reclamation happens.
+ */
+static void
+xfs_inodegc_inactivate(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount        *mp = ip->i_mount;
+
+	/*
+	* Inactivation isn't supposed to run when the fs is frozen because
+	* we don't want kernel threads to block on transaction allocation.
+	*/
+	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	trace_xfs_inode_inactivating(ip);
+	xfs_inactive(ip);
+	xfs_inodegc_set_reclaimable(ip);
+}
+
+void
+xfs_inodegc_worker(
+	struct work_struct	*work)
+{
+	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
+							work);
+	struct llist_node	*node = llist_del_all(&gc->list);
+	struct xfs_inode	*ip, *n;
+
+	WRITE_ONCE(gc->items, 0);
+
+	if (!node)
+		return;
+
+	ip = llist_entry(node, struct xfs_inode, i_gclist);
+	trace_xfs_inodegc_worker(ip->i_mount, __return_address);
+
+	llist_for_each_entry_safe(ip, n, node, i_gclist) {
+		xfs_iflags_set(ip, XFS_INACTIVATING);
+		xfs_inodegc_inactivate(ip);
+	}
+}
+
+/*
+ * Force all currently queued inode inactivation work to run immediately, and
+ * wait for the work to finish. Two pass - queue all the work first pass, wait
+ * for it in a second pass.
+ */
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	trace_xfs_inodegc_flush(mp, __return_address);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+	}
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		flush_work(&gc->work);
+	}
+}
+
+/*
+ * Flush all the pending work and then disable the inode inactivation background
+ * workers and wait for them to stop.
+ */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (!xfs_clear_inodegc_enabled(mp))
+		return;
+
+	xfs_inodegc_flush(mp);
+
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		cancel_work_sync(&gc->work);
+	}
+	trace_xfs_inodegc_stop(mp, __return_address);
+}
+
+/*
+ * Enable the inode inactivation background workers and schedule deferred inode
+ * inactivation work if there is any.
+ */
+void
+xfs_inodegc_start(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	if (xfs_set_inodegc_enabled(mp))
+		return;
+
+	trace_xfs_inodegc_start(mp, __return_address);
+	for_each_online_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		if (!llist_empty(&gc->list))
+			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+	}
+}
+
+/*
+ * Schedule the inactivation worker when:
+ *
+ *  - We've accumulated more than one inode cluster buffer's worth of inodes.
+ */
+static inline bool
+xfs_inodegc_want_queue_work(
+	struct xfs_inode	*ip,
+	unsigned int		items)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (items > mp->m_ino_geo.inodes_per_cluster)
+		return true;
+
+	return false;
+}
+
+/*
+ * Upper bound on the number of inodes in each AG that can be queued for
+ * inactivation at any given time, to avoid monopolizing the workqueue.
+ */
+#define XFS_INODEGC_MAX_BACKLOG		(4 * XFS_INODES_PER_CHUNK)
+
+/*
+ * Make the frontend wait for inactivations when:
+ *
+ *  - The queue depth exceeds the maximum allowable percpu backlog.
+ */
+static inline bool
+xfs_inodegc_want_flush_work(
+	struct xfs_inode	*ip,
+	unsigned int		items)
+{
+	if (items > XFS_INODEGC_MAX_BACKLOG)
+		return true;
+
+	return false;
+}
+
+/*
+ * Queue a background inactivation worker if there are inodes that need to be
+ * inactivated and higher level xfs code hasn't disabled the background
+ * workers.
+ */
+static void
+xfs_inodegc_queue(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_inodegc	*gc;
+	int			items;
+
+	trace_xfs_inode_set_need_inactive(ip);
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags |= XFS_NEED_INACTIVE;
+	spin_unlock(&ip->i_flags_lock);
+
+	gc = get_cpu_ptr(mp->m_inodegc);
+	llist_add(&ip->i_gclist, &gc->list);
+	items = READ_ONCE(gc->items);
+	WRITE_ONCE(gc->items, items + 1);
+	put_cpu_ptr(gc);
+
+	if (!xfs_is_inodegc_enabled(mp))
+		return;
+
+	if (xfs_inodegc_want_queue_work(ip, items)) {
+		trace_xfs_inodegc_queue(mp, __return_address);
+		queue_work(mp->m_inodegc_wq, &gc->work);
+	}
+
+	if (xfs_inodegc_want_flush_work(ip, items)) {
+		trace_xfs_inodegc_throttle(mp, __return_address);
+		flush_work(&gc->work);
+	}
+}
+
+/*
+ * We set the inode flag atomically with the radix tree tag.  Once we get tag
+ * lookups on the radix tree, this inode flag can go away.
+ *
+ * We always use background reclaim here because even if the inode is clean, it
+ * still may be under IO and hence we have wait for IO completion to occur
+ * before we can reclaim the inode. The background reclaim path handles this
+ * more efficiently than we can here, so simply let background reclaim tear down
+ * all inodes.
+ */
+void
+xfs_inode_mark_reclaimable(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			need_inactive;
+
 	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
-	 * We should never get here with one of the reclaim flags already set.
+	 * We should never get here with any of the reclaim flags already set.
 	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
 
-	/*
-	 * We always use background reclaim here because even if the inode is
-	 * clean, it still may be under IO and hence we have wait for IO
-	 * completion to occur before we can reclaim the inode. The background
-	 * reclaim path handles this more efficiently than we can here, so
-	 * simply let background reclaim tear down all inodes.
-	 */
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	need_inactive = xfs_inode_needs_inactive(ip);
+	if (need_inactive) {
+		xfs_inodegc_queue(ip);
+		return;
+	}
 
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
+	/* Going straight to reclaim, so drop the dquots. */
+	xfs_qm_dqdetach(ip);
+	xfs_inodegc_set_reclaimable(ip);
 }
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d0062ebb3f7a..c1dfc909a5b0 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -74,4 +74,9 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_flush(struct xfs_mount *mp);
+void xfs_inodegc_stop(struct xfs_mount *mp);
+void xfs_inodegc_start(struct xfs_mount *mp);
+
 #endif
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e3137bbc7b14..1f62b481d8c5 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -42,6 +42,7 @@ typedef struct xfs_inode {
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
 	atomic_t		i_pincount;	/* inode pin count */
+	struct llist_node	i_gclist;	/* deferred inactivation list */
 
 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
@@ -240,6 +241,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
+#define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
 /*
  * If this unlinked inode is in the middle of recovery, don't let drop_inode
  * truncate and free the inode.  This can happen if we iget the inode during
@@ -248,6 +250,21 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
+/* All inode state flags related to inode reclaim. */
+#define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
+				 XFS_IRECLAIM | \
+				 XFS_NEED_INACTIVE | \
+				 XFS_INACTIVATING)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -255,7 +272,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Flags for inode locking.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1721fce2ec94..a98d2429d795 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2786,6 +2786,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Flush the pending unlinked inodes to ensure that the inactivations
+	 * are fully completed on disk and the incore inodes can be reclaimed
+	 * before we signal that recovery is complete.
+	 */
+	xfs_inodegc_flush(mp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index baf7b323cb15..1f7e9a608f38 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -514,7 +514,8 @@ xfs_check_summary_counts(
  * Flush and reclaim dirty inodes in preparation for unmount. Inodes and
  * internal inode structures can be sitting in the CIL and AIL at this point,
  * so we need to unpin them, write them back and/or reclaim them before unmount
- * can proceed.
+ * can proceed.  In other words, callers are required to have inactivated all
+ * inodes.
  *
  * An inode cluster that has been freed can have its buffer still pinned in
  * memory because the transaction is still sitting in a iclog. The stale inodes
@@ -546,6 +547,7 @@ xfs_unmount_flush_inodes(
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
 
 	xfs_ail_push_all_sync(mp->m_ail);
+	xfs_inodegc_stop(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
@@ -782,6 +784,9 @@ xfs_mountfs(
 	if (error)
 		goto out_log_dealloc;
 
+	/* Enable background inode inactivation workers. */
+	xfs_inodegc_start(mp);
+
 	/*
 	 * Get and sanity-check the root inode.
 	 * Save the pointer to it in the mount structure.
@@ -942,6 +947,15 @@ xfs_mountfs(
 	xfs_irele(rip);
 	/* Clean out dquots that might be in memory after quotacheck. */
 	xfs_qm_unmount(mp);
+
+	/*
+	 * Inactivate all inodes that might still be in memory after a log
+	 * intent recovery failure so that reclaim can free them.  Metadata
+	 * inodes and the root directory shouldn't need inactivation, but the
+	 * mount failed for some reason, so pull down all the state and flee.
+	 */
+	xfs_inodegc_flush(mp);
+
 	/*
 	 * Flush all inode reclamation work and flush the log.
 	 * We have to do this /after/ rtunmount and qm_unmount because those
@@ -989,6 +1003,16 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes
+	 * that the VFS evicted earlier in the unmount process.  Freeing inodes
+	 * and discarding CoW fork preallocations can cause shape changes to
+	 * the free inode and refcount btrees, respectively, so we must finish
+	 * this before we discard the metadata space reservations.  Metadata
+	 * inodes and the root directory do not require inactivation.
+	 */
+	xfs_inodegc_flush(mp);
+
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c78b63fe779a..949d33cb270f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -56,6 +56,15 @@ struct xfs_error_cfg {
 	long		retry_timeout;	/* in jiffies, -1 = infinite */
 };
 
+/*
+ * Per-cpu deferred inode inactivation GC lists.
+ */
+struct xfs_inodegc {
+	struct llist_head	list;
+	struct work_struct	work;
+	unsigned int		items;
+};
+
 /*
  * The struct xfsmount layout is optimised to separate read-mostly variables
  * from variables that are frequently modified. We put the read-mostly variables
@@ -82,6 +91,8 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+	void __percpu		*m_inodegc;	/* percpu inodegc structures */
+
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -94,8 +105,9 @@ typedef struct xfs_mount {
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
-	struct workqueue_struct *m_gc_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
+	struct workqueue_struct *m_blockgc_wq;
+	struct workqueue_struct *m_inodegc_wq;
 
 	int			m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
@@ -136,6 +148,7 @@ typedef struct xfs_mount {
 	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
 						/* low free space thresholds */
+	unsigned long		m_opstate;	/* dynamic state flags */
 	bool			m_always_cow;
 	bool			m_fail_unmount;
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
@@ -258,6 +271,29 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
 #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
 
+/*
+ * If set, inactivation worker threads will be scheduled to process queued
+ * inodegc work.  If not, queued inodes remain in memory waiting to be
+ * processed.
+ */
+#define XFS_STATE_INODEGC_ENABLED	0
+
+#define __XFS_IS_STATE(name, NAME) \
+static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
+{ \
+	return test_bit(XFS_STATE_ ## NAME, &mp->m_opstate); \
+} \
+static inline bool xfs_clear_ ## name (struct xfs_mount *mp) \
+{ \
+	return test_and_clear_bit(XFS_STATE_ ## NAME, &mp->m_opstate); \
+} \
+static inline bool xfs_set_ ## name (struct xfs_mount *mp) \
+{ \
+	return test_and_set_bit(XFS_STATE_ ## NAME, &mp->m_opstate); \
+}
+
+__XFS_IS_STATE(inodegc_enabled, INODEGC_ENABLED)
+
 /*
  * Max and min values for mount-option defined I/O
  * preallocation sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ef89a9a3ba9e..cc3af4f6efa8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -508,21 +508,29 @@ xfs_init_mount_workqueues(
 	if (!mp->m_reclaim_workqueue)
 		goto out_destroy_cil;
 
-	mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s",
-			WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM,
+	mp->m_blockgc_wq = alloc_workqueue("xfs-blockgc/%s",
+			XFS_WQFLAGS(WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM),
 			0, mp->m_super->s_id);
-	if (!mp->m_gc_workqueue)
+	if (!mp->m_blockgc_wq)
 		goto out_destroy_reclaim;
 
+	mp->m_inodegc_wq = alloc_workqueue("xfs-inodegc/%s",
+			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			1, mp->m_super->s_id);
+	if (!mp->m_inodegc_wq)
+		goto out_destroy_blockgc;
+
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
 	if (!mp->m_sync_workqueue)
-		goto out_destroy_eofb;
+		goto out_destroy_inodegc;
 
 	return 0;
 
-out_destroy_eofb:
-	destroy_workqueue(mp->m_gc_workqueue);
+out_destroy_inodegc:
+	destroy_workqueue(mp->m_inodegc_wq);
+out_destroy_blockgc:
+	destroy_workqueue(mp->m_blockgc_wq);
 out_destroy_reclaim:
 	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_cil:
@@ -540,7 +548,8 @@ xfs_destroy_mount_workqueues(
 	struct xfs_mount	*mp)
 {
 	destroy_workqueue(mp->m_sync_workqueue);
-	destroy_workqueue(mp->m_gc_workqueue);
+	destroy_workqueue(mp->m_blockgc_wq);
+	destroy_workqueue(mp->m_inodegc_wq);
 	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
@@ -702,6 +711,8 @@ xfs_fs_sync_fs(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	trace_xfs_fs_sync_fs(mp, __return_address);
+
 	/*
 	 * Doing anything during the async pass would be counterproductive.
 	 */
@@ -718,6 +729,25 @@ xfs_fs_sync_fs(
 		flush_delayed_work(&mp->m_log->l_work);
 	}
 
+	/*
+	 * Flush all deferred inode inactivation work so that the free space
+	 * counters will reflect recent deletions.  Do not force the log again
+	 * because log recovery can restart the inactivation from the info that
+	 * we just wrote into the ondisk log.
+	 *
+	 * For regular operation this isn't strictly necessary since we aren't
+	 * required to guarantee that unlinking frees space immediately, but
+	 * that is how XFS historically behaved.
+	 *
+	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
+	 * last chance to complete the inactivation work before the filesystem
+	 * freezes and the log is quiesced.  The background worker will not
+	 * activate again until the fs is thawed because the VFS won't evict
+	 * any more inodes until freeze_super drops s_umount and we disable the
+	 * worker in xfs_fs_freeze.
+	 */
+	xfs_inodegc_flush(mp);
+
 	return 0;
 }
 
@@ -832,6 +862,17 @@ xfs_fs_freeze(
 	 */
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
+
+	/*
+	 * Stop the inodegc background worker.  freeze_super already flushed
+	 * all pending inodegc work when it sync'd the filesystem after setting
+	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
+	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
+	 * If the filesystem is read-write, inactivated inodes will queue but
+	 * the worker will not run until the filesystem thaws or unmounts.
+	 */
+	xfs_inodegc_stop(mp);
+
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);
@@ -847,6 +888,14 @@ xfs_fs_unfreeze(
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
 	xfs_blockgc_start(mp);
+
+	/*
+	 * Don't reactivate the inodegc worker on a readonly filesystem because
+	 * inodes are sent directly to reclaim.
+	 */
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+		xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -972,6 +1021,35 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_delalloc_blks);
 }
 
+static int
+xfs_inodegc_init_percpu(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inodegc	*gc;
+	int			cpu;
+
+	mp->m_inodegc = alloc_percpu(struct xfs_inodegc);
+	if (!mp->m_inodegc)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		gc = per_cpu_ptr(mp->m_inodegc, cpu);
+		init_llist_head(&gc->list);
+		gc->items = 0;
+                INIT_WORK(&gc->work, xfs_inodegc_worker);
+	}
+	return 0;
+}
+
+static void
+xfs_inodegc_free_percpu(
+	struct xfs_mount	*mp)
+{
+	if (!mp->m_inodegc)
+		return;
+	free_percpu(mp->m_inodegc);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
@@ -988,6 +1066,7 @@ xfs_fs_put_super(
 
 	xfs_freesb(mp);
 	free_percpu(mp->m_stats.xs_stats);
+	xfs_inodegc_free_percpu(mp);
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
@@ -1359,11 +1438,15 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_destroy_workqueues;
 
+	error = xfs_inodegc_init_percpu(mp);
+	if (error)
+		goto out_destroy_counters;
+
 	/* Allocate stats memory before we do operations that might use it */
 	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
 	if (!mp->m_stats.xs_stats) {
 		error = -ENOMEM;
-		goto out_destroy_counters;
+		goto out_destroy_inodegc;
 	}
 
 	error = xfs_readsb(mp, flags);
@@ -1566,6 +1649,8 @@ xfs_fs_fill_super(
 	xfs_freesb(mp);
  out_free_stats:
 	free_percpu(mp->m_stats.xs_stats);
+ out_destroy_inodegc:
+	xfs_inodegc_free_percpu(mp);
  out_destroy_counters:
 	xfs_destroy_percpu_counters(mp);
  out_destroy_workqueues:
@@ -1649,6 +1734,9 @@ xfs_remount_rw(
 	if (error && error != -ENOSPC)
 		return error;
 
+	/* Re-enable the background inode inactivation worker. */
+	xfs_inodegc_start(mp);
+
 	return 0;
 }
 
@@ -1671,6 +1759,15 @@ xfs_remount_ro(
 		return error;
 	}
 
+	/*
+	 * Stop the inodegc background worker.  xfs_fs_reconfigure already
+	 * flushed all pending inodegc work when it sync'd the filesystem.
+	 * The VFS holds s_umount, so we know that inodes cannot enter
+	 * xfs_fs_destroy_inode during a remount operation.  In readonly mode
+	 * we send inodes straight to reclaim, so no inodes will be queued.
+	 */
+	xfs_inodegc_stop(mp);
+
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
 	if (error) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 19260291ff8b..bd8abb50b33a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -157,6 +157,48 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
+#define XFS_STATE_FLAGS \
+	{ (1UL << XFS_STATE_INODEGC_ENABLED),		"inodegc" }
+
+DECLARE_EVENT_CLASS(xfs_fs_class,
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip),
+	TP_ARGS(mp, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, mflags)
+		__field(unsigned long, opstate)
+		__field(unsigned long, sbflags)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		if (mp) {
+			__entry->dev = mp->m_super->s_dev;
+			__entry->mflags = mp->m_flags;
+			__entry->opstate = mp->m_opstate;
+			__entry->sbflags = mp->m_super->s_flags;
+		}
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d m_flags 0x%llx opstate (%s) s_flags 0x%lx caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->mflags,
+		  __print_flags(__entry->opstate, "|", XFS_STATE_FLAGS),
+		  __entry->sbflags,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_FS_EVENT(name)	\
+DEFINE_EVENT(xfs_fs_class, name,					\
+	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
+	TP_ARGS(mp, caller_ip))
+DEFINE_FS_EVENT(xfs_inodegc_flush);
+DEFINE_FS_EVENT(xfs_inodegc_start);
+DEFINE_FS_EVENT(xfs_inodegc_stop);
+DEFINE_FS_EVENT(xfs_inodegc_worker);
+DEFINE_FS_EVENT(xfs_inodegc_queue);
+DEFINE_FS_EVENT(xfs_inodegc_throttle);
+DEFINE_FS_EVENT(xfs_fs_sync_fs);
+
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
 	TP_ARGS(mp, agno),
@@ -616,14 +658,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -667,6 +712,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH] xfs: inodegc needs to stop before freeze
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
@ 2021-08-04 10:03       ` Dave Chinner
  2021-08-04 12:37         ` Dave Chinner
  2021-08-04 10:46       ` [PATCH] xfs: don't run inodegc flushes when inodegc is not active Dave Chinner
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 10:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:
> For everyone else following along at home, I've posted the current draft
> version of this whole thing in:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
> 
> Here's Dave's patch reworked slightly to fix a deadlock between icreate
> and inactivation; conversion to use m_opstate and related macro stamping
> goodness; and some code reorganization to make it easier to add the
> throttling bits in the back two thirds of the series.
> 
> IOWs, I like this patch.  The runtime for my crazy deltree benchmark
> dropped from ~27 minutes to ~17 when the VM has 560M of RAM, and there's
> no observable drop in performance when the VM has 16G of RAM.  I also
> finally got it to run with 512M of RAM, whereas current TOT OOMs.
> 
> (Note: My crazy deltree benchmark is: I have a mdrestored sparse image
> with 10m files that I use dm-snapshot so that I can repeatedly write to
> it without needing to restore the image.  Then I mount the dm snapshot,
> and try to delete every file in the fs.)
....

Ok, so xfs/517 fails with a freeze assert:

XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861

> @@ -718,6 +729,25 @@ xfs_fs_sync_fs(
>  		flush_delayed_work(&mp->m_log->l_work);
>  	}
>  
> +	/*
> +	 * Flush all deferred inode inactivation work so that the free space
> +	 * counters will reflect recent deletions.  Do not force the log again
> +	 * because log recovery can restart the inactivation from the info that
> +	 * we just wrote into the ondisk log.
> +	 *
> +	 * For regular operation this isn't strictly necessary since we aren't
> +	 * required to guarantee that unlinking frees space immediately, but
> +	 * that is how XFS historically behaved.
> +	 *
> +	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
> +	 * last chance to complete the inactivation work before the filesystem
> +	 * freezes and the log is quiesced.  The background worker will not
> +	 * activate again until the fs is thawed because the VFS won't evict
> +	 * any more inodes until freeze_super drops s_umount and we disable the
> +	 * worker in xfs_fs_freeze.
> +	 */
> +	xfs_inodegc_flush(mp);

How does s_umount prevent __fput() from dropping the last reference
to an unlinked inode and putting it through evict() and hence adding
it to the deferred list that way?

Remember that the flush does not guarantee the per-cpu queues are
empty when it returns, just that whatever is in each percpu queue at
the time the per-cpu work is run has been completed.  We haven't yet
gone to SB_FREEZE_FS, so the transaction subsystem won't be frozen
at this point. Hence I can't see anything that would prevent unlinks
racing with this flush and queueing work after the flush work drains
the queues and starts processing the inodes it drained.

> +
>  	return 0;
>  }
>  
> @@ -832,6 +862,17 @@ xfs_fs_freeze(
>  	 */
>  	flags = memalloc_nofs_save();
>  	xfs_blockgc_stop(mp);
> +
> +	/*
> +	 * Stop the inodegc background worker.  freeze_super already flushed
> +	 * all pending inodegc work when it sync'd the filesystem after setting
> +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> +	 * If the filesystem is read-write, inactivated inodes will queue but
> +	 * the worker will not run until the filesystem thaws or unmounts.
> +	 */
> +	xfs_inodegc_stop(mp);

.... and so we end up with this flush blocked on the background work
that assert failed and BUG()d:

[  219.511172] task:xfs_io          state:D stack:14208 pid:10238 ppid:  9089 flags:0x00004004
[  219.513126] Call Trace:
[  219.513768]  __schedule+0x310/0x9f0
[  219.514628]  schedule+0x67/0xe0
[  219.515405]  schedule_timeout+0x114/0x160
[  219.516404]  ? _raw_spin_unlock_irqrestore+0x12/0x40
[  219.517622]  ? do_raw_spin_unlock+0x57/0xb0
[  219.518655]  __wait_for_common+0xc0/0x160
[  219.519638]  ? usleep_range+0xa0/0xa0
[  219.520545]  wait_for_completion+0x24/0x30
[  219.521544]  flush_work+0x58/0x70
[  219.522357]  ? flush_workqueue_prep_pwqs+0x140/0x140
[  219.523553]  xfs_inodegc_flush+0x88/0x100
[  219.524524]  xfs_inodegc_stop+0x28/0xb0
[  219.525514]  xfs_fs_freeze+0x40/0x70
[  219.526401]  freeze_super+0xd8/0x140
[  219.527277]  do_vfs_ioctl+0x784/0x890
[  219.528146]  __x64_sys_ioctl+0x6f/0xc0
[  219.529062]  do_syscall_64+0x35/0x80
[  219.529974]  entry_SYSCALL_64_after_hwframe+0x44/0xae

At this point, we are at SB_FREEZE_FS and the transaction system it
shut down, so this is a hard fail.

ISTR a discussion about this in the past - I think we need to hook
->freeze_super() and run xfs_inodegc_stop() before we run
freeze_super(). That way we end up just queuing pending
inactivations while the freeze runs and completes.

The patch below does this (applied on top of you entire stack) and
it seems to fix the 517 failure (0 failures in 50 runs vs 100% fail
rate without the patch).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: inodegc needs to stop before freeze

From: Dave Chinner <dchinner@redhat.com>

There is no way to safely stop background inactivation
during a freeze without racing. We can't rely on flushes during
->sync_fs to empty the queues because the freeze hasn't yet frozen
the transaction subsystem. Hence unlinks can race with freeze
resulting in inactivations being processed after the transaction
system has been frozen. That leads to assert failures like:

XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861

So, shutdown inodegc before the freeze machinery starts by hooking
->freeze_super(). This ensures that no inactivation takes place
while we are freezing and hence we avoid the freeze vs inactivation
races altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c | 47 +++++++++++++++++------------------------------
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ec0165513c60..c251679e8514 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -750,26 +750,6 @@ xfs_fs_sync_fs(
 		 */
 		flush_delayed_work(&mp->m_log->l_work);
 	}
-
-	/*
-	 * Flush all deferred inode inactivation work so that the free space
-	 * counters will reflect recent deletions.  Do not force the log again
-	 * because log recovery can restart the inactivation from the info that
-	 * we just wrote into the ondisk log.
-	 *
-	 * For regular operation this isn't strictly necessary since we aren't
-	 * required to guarantee that unlinking frees space immediately, but
-	 * that is how XFS historically behaved.
-	 *
-	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
-	 * last chance to complete the inactivation work before the filesystem
-	 * freezes and the log is quiesced.  The background worker will not
-	 * activate again until the fs is thawed because the VFS won't evict
-	 * any more inodes until freeze_super drops s_umount and we disable the
-	 * worker in xfs_fs_freeze.
-	 */
-	xfs_inodegc_flush(mp);
-
 	return 0;
 }
 
@@ -888,16 +868,6 @@ xfs_fs_freeze(
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
 
-	/*
-	 * Stop the inodegc background worker.  freeze_super already flushed
-	 * all pending inodegc work when it sync'd the filesystem after setting
-	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
-	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
-	 * If the filesystem is read-write, inactivated inodes will queue but
-	 * the worker will not run until the filesystem thaws or unmounts.
-	 */
-	xfs_inodegc_stop(mp);
-
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);
@@ -927,6 +897,22 @@ xfs_fs_unfreeze(
 	return 0;
 }
 
+static int
+xfs_fs_freeze_super(
+	struct super_block	*sb)
+{
+	struct xfs_mount	*mp = XFS_M(sb);
+	int			error;
+
+	xfs_inodegc_stop(mp);
+	error = freeze_super(sb);
+	if (error) {
+		if (!(mp->m_flags & XFS_MOUNT_RDONLY))
+			xfs_inodegc_start(mp);
+	}
+	return error;
+}
+
 /*
  * This function fills in xfs_mount_t fields based on mount args.
  * Note: the superblock _has_ now been read in.
@@ -1130,6 +1116,7 @@ static const struct super_operations xfs_super_operations = {
 	.drop_inode		= xfs_fs_drop_inode,
 	.put_super		= xfs_fs_put_super,
 	.sync_fs		= xfs_fs_sync_fs,
+	.freeze_super		= xfs_fs_freeze_super,
 	.freeze_fs		= xfs_fs_freeze,
 	.unfreeze_fs		= xfs_fs_unfreeze,
 	.statfs			= xfs_fs_statfs,

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH] xfs: don't run inodegc flushes when inodegc is not active
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
  2021-08-04 10:03       ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
@ 2021-08-04 10:46       ` Dave Chinner
  2021-08-04 16:20         ` Darrick J. Wong
  2021-08-04 11:09       ` [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues Dave Chinner
                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 10:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch


From: Dave Chinner <dchinner@redhat.com>

A flush trigger on a frozen filesystem (e.g. from statfs)
will run queued inactivations and assert fail like this:

XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861

Bug exposed by xfs/011.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 92006260fe90..f772f2a67a8b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1893,8 +1893,8 @@ xfs_inodegc_worker(
  * wait for the work to finish. Two pass - queue all the work first pass, wait
  * for it in a second pass.
  */
-void
-xfs_inodegc_flush(
+static void
+__xfs_inodegc_flush(
 	struct xfs_mount	*mp)
 {
 	struct xfs_inodegc	*gc;
@@ -1913,6 +1913,14 @@ xfs_inodegc_flush(
 	}
 }
 
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	if (xfs_is_inodegc_enabled(mp))
+		__xfs_inodegc_flush(mp);
+}
+
 /*
  * Flush all the pending work and then disable the inode inactivation background
  * workers and wait for them to stop.
@@ -1927,7 +1935,7 @@ xfs_inodegc_stop(
 	if (!xfs_clear_inodegc_enabled(mp))
 		return;
 
-	xfs_inodegc_flush(mp);
+	__xfs_inodegc_flush(mp);
 
 	for_each_online_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
  2021-08-04 10:03       ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
  2021-08-04 10:46       ` [PATCH] xfs: don't run inodegc flushes when inodegc is not active Dave Chinner
@ 2021-08-04 11:09       ` Dave Chinner
  2021-08-04 15:59         ` Darrick J. Wong
  2021-08-04 11:49       ` [PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure Dave Chinner
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 11:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:
> For everyone else following along at home, I've posted the current draft
> version of this whole thing in:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15

Overall looks good - fixes to freeze problems I hit are found
in other replies to this.

I omitted the commits:

xfs: queue inodegc worker immediately when memory is tight
xfs: throttle inode inactivation queuing on memory reclaim

in my test kernel because I think they are unnecessary.

I think the first is unnecessary because reclaim of inodes from the
VFS is usually in large batches and so early triggers aren't
desirable when we're getting thousands of inodes being evicted by
the superblock shrinker at a time. If we've only got a handful of
inodes queued, then inactivating them early isn't going to make much
of an impact on free memory. I could be wrong, but so far I have no
evidence that expediting inactivation is necessary.

The second patch is the custom shrinker. Again, I just don't think
this is necessary because if there is any amount of inactivation of
evicted inodes needed due to reclaim, we'll already be triggering it
to run via the deferred queue flush thresholds. Hence we don't
really need any mechanism to tell us that there is memory pressure;
the deferred work reacts to eviction from reclaim in exactly the
same way it reacts to eviction from unlink....

I've been running the patchset without these two patches on my 512MB
test VM, and the only OOM kill I get from fstests is g/531. This is
the "many open-but-unlinked" test, which creates 50,000 open
unlinked files per CPU. So for this test VM which has 4 CPUs, that's
200,000 open, dirty iunlinked inodes and a lot of pinned inode
cluster buffers. At ~2kB of memory per unlinked inode (ignoring the
cluster buffers) this would consume about 400MB of the 512MB of RAM
the VM has. It OOM kills the test programs that hold the open files
long before it gets to 200,000 files, so this test never passed
before this patchset on this machine...

I have a couple of extra patches to set up per-cpu hotplug
infrastructure before the deferred inode inactivation patch - I'll
post them after I finish this email. I'm going to leave it running
tests overnight.

Darrick, I'm pretty happy with the way the patchset is behaving now.
If you want to fold in the bug fixes I've posted and add in
the hotplug patches, then I think it's ready to be posted in full
again (if it all passes your testing) for review.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
                         ` (2 preceding siblings ...)
  2021-08-04 11:09       ` [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues Dave Chinner
@ 2021-08-04 11:49       ` Dave Chinner
  2021-08-04 11:50       ` [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications Dave Chinner
  2021-08-04 11:52       ` [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification Dave Chinner
  5 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 11:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

From: Dave Chinner <dchinner@redhat.com>

We need to move to per-cpu state for both deferred inode
inactivation and CIL tracking, but to do that we
need to handle CPUs being removed from the system by the hot-plug
code. Introduce generic XFS infrastructure to handle CPU hotplug
events that is set up at module init time and torn down at module
exit time.

Initially, we only need CPU dead notifications, so we only set
up a callback for these notifications. The infrastructure can be
updated in future for other CPU hotplug state machine notifications
easily if ever needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_super.c         | 38 +++++++++++++++++++++++++++++++++++++-
 include/linux/cpuhotplug.h |  1 +
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8b7a9895b4a2..5232c808287b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2111,6 +2111,35 @@ xfs_destroy_workqueues(void)
 	destroy_workqueue(xfs_alloc_wq);
 }
 
+static int
+xfs_cpu_dead(
+	unsigned int		cpu)
+{
+	return 0;
+}
+
+static int __init
+xfs_cpu_hotplug_init(void)
+{
+	int	error;
+
+	error = cpuhp_setup_state_nocalls(CPUHP_XFS_DEAD,
+					"xfs:dead", NULL,
+					xfs_cpu_dead);
+	if (error < 0) {
+		xfs_alert(NULL,
+"Failed to initialise CPU hotplug, error %d. XFS is non-functional.",
+			error);
+	}
+	return error;
+}
+
+static void
+xfs_cpu_hotplug_destroy(void)
+{
+	cpuhp_remove_state_nocalls(CPUHP_XFS_DEAD);
+}
+
 STATIC int __init
 init_xfs_fs(void)
 {
@@ -2123,10 +2152,14 @@ init_xfs_fs(void)
 
 	xfs_dir_startup();
 
-	error = xfs_init_zones();
+	error = xfs_cpu_hotplug_init();
 	if (error)
 		goto out;
 
+	error = xfs_init_zones();
+	if (error)
+		goto out_destroy_hp;
+
 	error = xfs_init_workqueues();
 	if (error)
 		goto out_destroy_zones;
@@ -2206,6 +2239,8 @@ init_xfs_fs(void)
 	xfs_destroy_workqueues();
  out_destroy_zones:
 	xfs_destroy_zones();
+ out_destroy_hp:
+	xfs_cpu_hotplug_destroy();
  out:
 	return error;
 }
@@ -2228,6 +2263,7 @@ exit_xfs_fs(void)
 	xfs_destroy_workqueues();
 	xfs_destroy_zones();
 	xfs_uuid_table_free();
+	xfs_cpu_hotplug_destroy();
 }
 
 module_init(init_xfs_fs);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f39b34b13871..439adc05be4e 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -52,6 +52,7 @@ enum cpuhp_state {
 	CPUHP_FS_BUFF_DEAD,
 	CPUHP_PRINTK_DEAD,
 	CPUHP_MM_MEMCQ_DEAD,
+	CPUHP_XFS_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
 	CPUHP_PAGE_ALLOC,

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
                         ` (3 preceding siblings ...)
  2021-08-04 11:49       ` [PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure Dave Chinner
@ 2021-08-04 11:50       ` Dave Chinner
  2021-08-04 16:06         ` Darrick J. Wong
  2021-08-04 11:52       ` [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification Dave Chinner
  5 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 11:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch


From: Dave Chinner <dchinner@redhat.com>

The inode inactivation and CIL tracking percpu structures are
per-xfs_mount structures. That means when we get a CPU dead
notification, we need to then iterate all the per-cpu structure
instances to process them. Rather than keeping linked lists of
per-cpu structures in each subsystem, add a list of all xfs_mounts
that the generic xfs_cpu_dead() function will iterate and call into
each subsystem appropriately.

This allows us to handle both per-mount and global XFS percpu state
from xfs_cpu_dead(), and avoids the need to link subsystem
structures that can be easily found from the xfs_mount into their
own global lists.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h |  1 +
 fs/xfs/xfs_super.c | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c78b63fe779a..ed7064596f94 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -82,6 +82,7 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+	struct list_head	m_mount_list;	/* global mount list */
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ffe1ecd048db..c27df85212d4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -49,6 +49,28 @@ static struct kset *xfs_kset;		/* top-level xfs sysfs dir */
 static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #endif
 
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(xfs_mount_list);
+static DEFINE_SPINLOCK(xfs_mount_list_lock);
+
+static inline void xfs_mount_list_add(struct xfs_mount *mp)
+{
+	spin_lock(&xfs_mount_list_lock);
+	list_add(&mp->m_mount_list, &xfs_mount_list);
+	spin_unlock(&xfs_mount_list_lock);
+}
+
+static inline void xfs_mount_list_del(struct xfs_mount *mp)
+{
+	spin_lock(&xfs_mount_list_lock);
+	list_del(&mp->m_mount_list);
+	spin_unlock(&xfs_mount_list_lock);
+}
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline void xfs_mount_list_add(struct xfs_mount *mp) {}
+static inline void xfs_mount_list_del(struct xfs_mount *mp) {}
+#endif
+
 enum xfs_dax_mode {
 	XFS_DAX_INODE = 0,
 	XFS_DAX_ALWAYS = 1,
@@ -988,6 +1010,7 @@ xfs_fs_put_super(
 
 	xfs_freesb(mp);
 	free_percpu(mp->m_stats.xs_stats);
+	xfs_mount_list_del(mp);
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
@@ -1359,6 +1382,8 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_destroy_workqueues;
 
+	xfs_mount_list_add(mp);
+
 	/* Allocate stats memory before we do operations that might use it */
 	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
 	if (!mp->m_stats.xs_stats) {
@@ -1567,6 +1592,7 @@ xfs_fs_fill_super(
  out_free_stats:
 	free_percpu(mp->m_stats.xs_stats);
  out_destroy_counters:
+	xfs_mount_list_del(mp);
 	xfs_destroy_percpu_counters(mp);
  out_destroy_workqueues:
 	xfs_destroy_mount_workqueues(mp);
@@ -2061,10 +2087,20 @@ xfs_destroy_workqueues(void)
 	destroy_workqueue(xfs_alloc_wq);
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
 static int
 xfs_cpu_dead(
 	unsigned int		cpu)
 {
+	struct xfs_mount	*mp, *n;
+
+	spin_lock(&xfs_mount_list_lock);
+	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
+		spin_unlock(&xfs_mount_list_lock);
+		/* xfs_subsys_dead(mp, cpu); */
+		spin_lock(&xfs_mount_list_lock);
+	}
+	spin_unlock(&xfs_mount_list_lock);
 	return 0;
 }
 
@@ -2090,6 +2126,11 @@ xfs_cpu_hotplug_destroy(void)
 	cpuhp_remove_state_nocalls(CPUHP_XFS_DEAD);
 }
 
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline int xfs_cpu_hotplug_init(struct xfs_cil *cil) { return 0; }
+static inline void xfs_cpu_hotplug_destroy(struct xfs_cil *cil) {}
+#endif
+
 STATIC int __init
 init_xfs_fs(void)
 {

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification
  2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
                         ` (4 preceding siblings ...)
  2021-08-04 11:50       ` [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications Dave Chinner
@ 2021-08-04 11:52       ` Dave Chinner
  2021-08-04 16:19         ` Darrick J. Wong
  5 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 11:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch


From: Dave Chinner <dchinner@redhat.com>

So we don't leave queued inodes on a CPU we won't ever flush.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.h |  1 +
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f772f2a67a8b..9e2c95903c68 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1966,6 +1966,42 @@ xfs_inodegc_start(
 	}
 }
 
+/*
+ * Fold the dead CPU inodegc queue into the current CPUs queue.
+ */
+void
+xfs_inodegc_cpu_dead(
+	struct xfs_mount	*mp,
+	int			dead_cpu)
+{
+	struct xfs_inodegc	*dead_gc, *gc;
+	struct llist_node	*first, *last;
+	int			count = 0;
+
+	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
+	cancel_work_sync(&dead_gc->work);
+
+	if (llist_empty(&dead_gc->list))
+		return;
+
+	first = dead_gc->list.first;
+	last = first;
+	while (last->next) {
+		last = last->next;
+		count++;
+	}
+	dead_gc->list.first = NULL;
+	dead_gc->items = 0;
+
+	/* Add pending work to current CPU */
+	gc = get_cpu_ptr(mp->m_inodegc);
+	llist_add_batch(first, last, &gc->list);
+	count += READ_ONCE(gc->items);
+	WRITE_ONCE(gc->items, count);
+	put_cpu_ptr(gc);
+	queue_work(mp->m_inodegc_wq, &gc->work);
+}
+
 #ifdef CONFIG_XFS_RT
 static inline bool
 xfs_inodegc_want_queue_rt_file(
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index bdf2a8d3fdd5..853d5bfc0cfb 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -79,5 +79,6 @@ void xfs_inodegc_worker(struct work_struct *work);
 void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
+void xfs_inodegc_cpu_dead(struct xfs_mount *mp, int cpu);
 
 #endif
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c251679e8514..f579ec49eb7a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2187,7 +2187,7 @@ xfs_cpu_dead(
 	spin_lock(&xfs_mount_list_lock);
 	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
 		spin_unlock(&xfs_mount_list_lock);
-		/* xfs_subsys_dead(mp, cpu); */
+		xfs_inodegc_cpu_dead(mp, cpu);
 		spin_lock(&xfs_mount_list_lock);
 	}
 	spin_unlock(&xfs_mount_list_lock);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] xfs: inodegc needs to stop before freeze
  2021-08-04 10:03       ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
@ 2021-08-04 12:37         ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 12:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 08:03:28PM +1000, Dave Chinner wrote:
> On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:
> > For everyone else following along at home, I've posted the current draft
> > version of this whole thing in:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
> > 
> > Here's Dave's patch reworked slightly to fix a deadlock between icreate
> > and inactivation; conversion to use m_opstate and related macro stamping
> > goodness; and some code reorganization to make it easier to add the
> > throttling bits in the back two thirds of the series.
> > 
> > IOWs, I like this patch.  The runtime for my crazy deltree benchmark
> > dropped from ~27 minutes to ~17 when the VM has 560M of RAM, and there's
> > no observable drop in performance when the VM has 16G of RAM.  I also
> > finally got it to run with 512M of RAM, whereas current TOT OOMs.
> > 
> > (Note: My crazy deltree benchmark is: I have a mdrestored sparse image
> > with 10m files that I use dm-snapshot so that I can repeatedly write to
> > it without needing to restore the image.  Then I mount the dm snapshot,
> > and try to delete every file in the fs.)
> ....
> 
> Ok, so xfs/517 fails with a freeze assert:
> 
> XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861
> 
> > @@ -718,6 +729,25 @@ xfs_fs_sync_fs(
> >  		flush_delayed_work(&mp->m_log->l_work);
> >  	}
> >  
> > +	/*
> > +	 * Flush all deferred inode inactivation work so that the free space
> > +	 * counters will reflect recent deletions.  Do not force the log again
> > +	 * because log recovery can restart the inactivation from the info that
> > +	 * we just wrote into the ondisk log.
> > +	 *
> > +	 * For regular operation this isn't strictly necessary since we aren't
> > +	 * required to guarantee that unlinking frees space immediately, but
> > +	 * that is how XFS historically behaved.
> > +	 *
> > +	 * If, however, the filesystem is at FREEZE_PAGEFAULTS, this is our
> > +	 * last chance to complete the inactivation work before the filesystem
> > +	 * freezes and the log is quiesced.  The background worker will not
> > +	 * activate again until the fs is thawed because the VFS won't evict
> > +	 * any more inodes until freeze_super drops s_umount and we disable the
> > +	 * worker in xfs_fs_freeze.
> > +	 */
> > +	xfs_inodegc_flush(mp);
> 
> How does s_umount prevent __fput() from dropping the last reference
> to an unlinked inode and putting it through evict() and hence adding
> it to the deferred list that way?
> 
> Remember that the flush does not guarantee the per-cpu queues are
> empty when it returns, just that whatever is in each percpu queue at
> the time the per-cpu work is run has been completed.  We haven't yet
> gone to SB_FREEZE_FS, so the transaction subsystem won't be frozen
> at this point. Hence I can't see anything that would prevent unlinks
> racing with this flush and queueing work after the flush work drains
> the queues and starts processing the inodes it drained.
> 
> > +
> >  	return 0;
> >  }
> >  
> > @@ -832,6 +862,17 @@ xfs_fs_freeze(
> >  	 */
> >  	flags = memalloc_nofs_save();
> >  	xfs_blockgc_stop(mp);
> > +
> > +	/*
> > +	 * Stop the inodegc background worker.  freeze_super already flushed
> > +	 * all pending inodegc work when it sync'd the filesystem after setting
> > +	 * SB_FREEZE_PAGEFAULTS, and it holds s_umount, so we know that inodes
> > +	 * cannot enter xfs_fs_destroy_inode until the freeze is complete.
> > +	 * If the filesystem is read-write, inactivated inodes will queue but
> > +	 * the worker will not run until the filesystem thaws or unmounts.
> > +	 */
> > +	xfs_inodegc_stop(mp);
> 
> .... and so we end up with this flush blocked on the background work
> that assert failed and BUG()d:
> 
> [  219.511172] task:xfs_io          state:D stack:14208 pid:10238 ppid:  9089 flags:0x00004004
> [  219.513126] Call Trace:
> [  219.513768]  __schedule+0x310/0x9f0
> [  219.514628]  schedule+0x67/0xe0
> [  219.515405]  schedule_timeout+0x114/0x160
> [  219.516404]  ? _raw_spin_unlock_irqrestore+0x12/0x40
> [  219.517622]  ? do_raw_spin_unlock+0x57/0xb0
> [  219.518655]  __wait_for_common+0xc0/0x160
> [  219.519638]  ? usleep_range+0xa0/0xa0
> [  219.520545]  wait_for_completion+0x24/0x30
> [  219.521544]  flush_work+0x58/0x70
> [  219.522357]  ? flush_workqueue_prep_pwqs+0x140/0x140
> [  219.523553]  xfs_inodegc_flush+0x88/0x100
> [  219.524524]  xfs_inodegc_stop+0x28/0xb0
> [  219.525514]  xfs_fs_freeze+0x40/0x70
> [  219.526401]  freeze_super+0xd8/0x140
> [  219.527277]  do_vfs_ioctl+0x784/0x890
> [  219.528146]  __x64_sys_ioctl+0x6f/0xc0
> [  219.529062]  do_syscall_64+0x35/0x80
> [  219.529974]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> At this point, we are at SB_FREEZE_FS and the transaction system it
> shut down, so this is a hard fail.
> 
> ISTR a discussion about this in the past - I think we need to hook
> ->freeze_super() and run xfs_inodegc_stop() before we run
> freeze_super(). That way we end up just queuing pending
> inactivations while the freeze runs and completes.
> 
> The patch below does this (applied on top of you entire stack) and
> it seems to fix the 517 failure (0 failures in 50 runs vs 100% fail
> rate without the patch).

This doesn't work. g/390 does nested, racing freeze/thaw and so we
can have a start from an unfreeze racing with a stop for a freeze
about to run. IOWs, we can't stop the inodegc work until s_umount is
held and we know that there isn't another freeze in progress....

Back to the drawing board for this one.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues
  2021-08-04 11:09       ` [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues Dave Chinner
@ 2021-08-04 15:59         ` Darrick J. Wong
  2021-08-04 21:35           ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-04 15:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 09:09:16PM +1000, Dave Chinner wrote:
> On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:
> > For everyone else following along at home, I've posted the current draft
> > version of this whole thing in:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
> 
> Overall looks good - fixes to freeze problems I hit are found
> in other replies to this.
> 
> I omitted the commits:
> 
> xfs: queue inodegc worker immediately when memory is tight
> xfs: throttle inode inactivation queuing on memory reclaim
> 
> in my test kernel because I think they are unnecessary.
> 
> I think the first is unnecessary because reclaim of inodes from the
> VFS is usually in large batches and so early triggers aren't
> desirable when we're getting thousands of inodes being evicted by
> the superblock shrinker at a time. If we've only got a handful of
> inodes queued, then inactivating them early isn't going to make much
> of an impact on free memory. I could be wrong, but so far I have no
> evidence that expediting inactivation is necessary.

I think this was a lot more necessary under the old design because I let
the number of tagged inodes grow quite large before triggering gc work,
much less throttling anything.  256 is low enough that it should be
manageable.

Does it matter that we no longer inactivate inodes in inode number
order?  I guess it could be nice to be able to dump inode cluster
buffers as soon as practicable, but OTOH I suspect that only matters for
the case of mass deletion, in which case we'll probably catch up soon
enough?

Anyway, I'll try turning both of these off with my silly deltree
exerciser and see what happens.

> The second patch is the custom shrinker. Again, I just don't think
> this is necessary because if there is any amount of inactivation of
> evicted inodes needed due to reclaim, we'll already be triggering it
> to run via the deferred queue flush thresholds. Hence we don't
> really need any mechanism to tell us that there is memory pressure;
> the deferred work reacts to eviction from reclaim in exactly the
> same way it reacts to eviction from unlink....

Yep.  I came to the same conclusion last night; it looks like my fast
fstests setup for that passed.

> I've been running the patchset without these two patches on my 512MB
> test VM, and the only OOM kill I get from fstests is g/531. This is
> the "many open-but-unlinked" test, which creates 50,000 open
> unlinked files per CPU. So for this test VM which has 4 CPUs, that's
> 200,000 open, dirty iunlinked inodes and a lot of pinned inode
> cluster buffers. At ~2kB of memory per unlinked inode (ignoring the
> cluster buffers) this would consume about 400MB of the 512MB of RAM
> the VM has. It OOM kills the test programs that hold the open files
> long before it gets to 200,000 files, so this test never passed
> before this patchset on this machine...

Yeah... I actually tried running fstests on a 512M VM and whooeee did I
see a lot of OOM kills.  Clearly we've all gotten spoiled by cheap DRAM.

> I have a couple of extra patches to set up per-cpu hotplug
> infrastructure before the deferred inode inactivation patch - I'll
> post them after I finish this email. I'm going to leave it running
> tests overnight.

Ok, I'll jam those on the front end of the series.

> Darrick, I'm pretty happy with the way the patchset is behaving now.
> If you want to fold in the bug fixes I've posted and add in
> the hotplug patches, then I think it's ready to be posted in full
> again (if it all passes your testing) for review.

It's probably about time for that.  Now that we do percpu thingies, I
think it might also be time for a test that runs fstests while plugging
and unplugging the non-bsp processors.

[narrator: ...and thus he unleashed another terrifying bug mountain]

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications
  2021-08-04 11:50       ` [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications Dave Chinner
@ 2021-08-04 16:06         ` Darrick J. Wong
  2021-08-04 21:17           ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-04 16:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 09:50:51PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode inactivation and CIL tracking percpu structures are
> per-xfs_mount structures. That means when we get a CPU dead
> notification, we need to then iterate all the per-cpu structure
> instances to process them. Rather than keeping linked lists of
> per-cpu structures in each subsystem, add a list of all xfs_mounts
> that the generic xfs_cpu_dead() function will iterate and call into
> each subsystem appropriately.
> 
> This allows us to handle both per-mount and global XFS percpu state
> from xfs_cpu_dead(), and avoids the need to link subsystem
> structures that can be easily found from the xfs_mount into their
> own global lists.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.h |  1 +
>  fs/xfs/xfs_super.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c78b63fe779a..ed7064596f94 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -82,6 +82,7 @@ typedef struct xfs_mount {
>  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
>  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
>  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> +	struct list_head	m_mount_list;	/* global mount list */
>  	/*
>  	 * Optional cache of rt summary level per bitmap block with the
>  	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ffe1ecd048db..c27df85212d4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -49,6 +49,28 @@ static struct kset *xfs_kset;		/* top-level xfs sysfs dir */
>  static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
>  #endif
>  
> +#ifdef CONFIG_HOTPLUG_CPU
> +static LIST_HEAD(xfs_mount_list);
> +static DEFINE_SPINLOCK(xfs_mount_list_lock);
> +
> +static inline void xfs_mount_list_add(struct xfs_mount *mp)
> +{
> +	spin_lock(&xfs_mount_list_lock);
> +	list_add(&mp->m_mount_list, &xfs_mount_list);
> +	spin_unlock(&xfs_mount_list_lock);
> +}
> +
> +static inline void xfs_mount_list_del(struct xfs_mount *mp)
> +{
> +	spin_lock(&xfs_mount_list_lock);
> +	list_del(&mp->m_mount_list);
> +	spin_unlock(&xfs_mount_list_lock);
> +}
> +#else /* !CONFIG_HOTPLUG_CPU */
> +static inline void xfs_mount_list_add(struct xfs_mount *mp) {}
> +static inline void xfs_mount_list_del(struct xfs_mount *mp) {}
> +#endif
> +
>  enum xfs_dax_mode {
>  	XFS_DAX_INODE = 0,
>  	XFS_DAX_ALWAYS = 1,
> @@ -988,6 +1010,7 @@ xfs_fs_put_super(
>  
>  	xfs_freesb(mp);
>  	free_percpu(mp->m_stats.xs_stats);
> +	xfs_mount_list_del(mp);
>  	xfs_destroy_percpu_counters(mp);
>  	xfs_destroy_mount_workqueues(mp);
>  	xfs_close_devices(mp);
> @@ -1359,6 +1382,8 @@ xfs_fs_fill_super(
>  	if (error)
>  		goto out_destroy_workqueues;
>  
> +	xfs_mount_list_add(mp);
> +
>  	/* Allocate stats memory before we do operations that might use it */
>  	mp->m_stats.xs_stats = alloc_percpu(struct xfsstats);
>  	if (!mp->m_stats.xs_stats) {
> @@ -1567,6 +1592,7 @@ xfs_fs_fill_super(
>   out_free_stats:
>  	free_percpu(mp->m_stats.xs_stats);
>   out_destroy_counters:
> +	xfs_mount_list_del(mp);
>  	xfs_destroy_percpu_counters(mp);
>   out_destroy_workqueues:
>  	xfs_destroy_mount_workqueues(mp);
> @@ -2061,10 +2087,20 @@ xfs_destroy_workqueues(void)
>  	destroy_workqueue(xfs_alloc_wq);
>  }
>  
> +#ifdef CONFIG_HOTPLUG_CPU
>  static int
>  xfs_cpu_dead(
>  	unsigned int		cpu)
>  {
> +	struct xfs_mount	*mp, *n;
> +
> +	spin_lock(&xfs_mount_list_lock);
> +	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
> +		spin_unlock(&xfs_mount_list_lock);
> +		/* xfs_subsys_dead(mp, cpu); */
> +		spin_lock(&xfs_mount_list_lock);
> +	}
> +	spin_unlock(&xfs_mount_list_lock);
>  	return 0;
>  }
>  
> @@ -2090,6 +2126,11 @@ xfs_cpu_hotplug_destroy(void)
>  	cpuhp_remove_state_nocalls(CPUHP_XFS_DEAD);
>  }
>  
> +#else /* !CONFIG_HOTPLUG_CPU */
> +static inline int xfs_cpu_hotplug_init(struct xfs_cil *cil) { return 0; }
> +static inline void xfs_cpu_hotplug_destroy(struct xfs_cil *cil) {}

void arguments here, right?

> +#endif

Nit: I think this ifdef stuff belongs in the previous patch.  Will fix
it when I drag this into my tree.

--D

> +
>  STATIC int __init
>  init_xfs_fs(void)
>  {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification
  2021-08-04 11:52       ` [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification Dave Chinner
@ 2021-08-04 16:19         ` Darrick J. Wong
  2021-08-04 21:48           ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-04 16:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 09:52:25PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> So we don't leave queued inodes on a CPU we won't ever flush.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_icache.h |  1 +
>  fs/xfs/xfs_super.c  |  2 +-
>  3 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index f772f2a67a8b..9e2c95903c68 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1966,6 +1966,42 @@ xfs_inodegc_start(
>  	}
>  }
>  
> +/*
> + * Fold the dead CPU inodegc queue into the current CPUs queue.
> + */
> +void
> +xfs_inodegc_cpu_dead(
> +	struct xfs_mount	*mp,
> +	int			dead_cpu)

unsigned int, since that's the caller's type.

> +{
> +	struct xfs_inodegc	*dead_gc, *gc;
> +	struct llist_node	*first, *last;
> +	int			count = 0;
> +
> +	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
> +	cancel_work_sync(&dead_gc->work);
> +
> +	if (llist_empty(&dead_gc->list))
> +		return;
> +
> +	first = dead_gc->list.first;
> +	last = first;
> +	while (last->next) {
> +		last = last->next;
> +		count++;
> +	}
> +	dead_gc->list.first = NULL;
> +	dead_gc->items = 0;
> +
> +	/* Add pending work to current CPU */
> +	gc = get_cpu_ptr(mp->m_inodegc);
> +	llist_add_batch(first, last, &gc->list);
> +	count += READ_ONCE(gc->items);
> +	WRITE_ONCE(gc->items, count);

I was wondering about the READ/WRITE_ONCE pattern for gc->items: it's
meant to be an accurate count of the list items, right?  But there's no
hard synchronization (e.g. spinlock) around them, which means that the
only CPU that can access that variable at all is the one that the percpu
structure belongs to, right?  And I think that's ok here, because the
only accessors are _queue() and _worker(), which both are supposed to
run on the same CPU since they're percpu lists, right?

In which case: why can't we just say count = dead_gc->items;?  @dead_cpu
is being offlined, which implies that nothing will get scheduled on it,
right?

> +	put_cpu_ptr(gc);
> +	queue_work(mp->m_inodegc_wq, &gc->work);

Should this be thresholded like we do for _inodegc_queue?

In the old days I would have imagined that cpu offlining should be rare
enough <cough> that it probably doesn't make any real difference.  OTOH
my cloudic colleague reminds me that they aggressively offline cpus to
reduce licensing cost(!).

--D

> +}
> +
>  #ifdef CONFIG_XFS_RT
>  static inline bool
>  xfs_inodegc_want_queue_rt_file(
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index bdf2a8d3fdd5..853d5bfc0cfb 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -79,5 +79,6 @@ void xfs_inodegc_worker(struct work_struct *work);
>  void xfs_inodegc_flush(struct xfs_mount *mp);
>  void xfs_inodegc_stop(struct xfs_mount *mp);
>  void xfs_inodegc_start(struct xfs_mount *mp);
> +void xfs_inodegc_cpu_dead(struct xfs_mount *mp, int cpu);
>  
>  #endif
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c251679e8514..f579ec49eb7a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -2187,7 +2187,7 @@ xfs_cpu_dead(
>  	spin_lock(&xfs_mount_list_lock);
>  	list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
>  		spin_unlock(&xfs_mount_list_lock);
> -		/* xfs_subsys_dead(mp, cpu); */
> +		xfs_inodegc_cpu_dead(mp, cpu);
>  		spin_lock(&xfs_mount_list_lock);
>  	}
>  	spin_unlock(&xfs_mount_list_lock);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] xfs: don't run inodegc flushes when inodegc is not active
  2021-08-04 10:46       ` [PATCH] xfs: don't run inodegc flushes when inodegc is not active Dave Chinner
@ 2021-08-04 16:20         ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-08-04 16:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 08:46:16PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> A flush trigger on a frozen filesystem (e.g. from statfs)
> will run queued inactivations and assert fail like this:
> 
> XFS: Assertion failed: mp->m_super->s_writers.frozen < SB_FREEZE_FS, file: fs/xfs/xfs_icache.c, line: 1861
> 
> Bug exposed by xfs/011.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Odd that I didn't see any of these problems in my overnight tests,
but the reasoning looks solid.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_icache.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 92006260fe90..f772f2a67a8b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1893,8 +1893,8 @@ xfs_inodegc_worker(
>   * wait for the work to finish. Two pass - queue all the work first pass, wait
>   * for it in a second pass.
>   */
> -void
> -xfs_inodegc_flush(
> +static void
> +__xfs_inodegc_flush(
>  	struct xfs_mount	*mp)
>  {
>  	struct xfs_inodegc	*gc;
> @@ -1913,6 +1913,14 @@ xfs_inodegc_flush(
>  	}
>  }
>  
> +void
> +xfs_inodegc_flush(
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_is_inodegc_enabled(mp))
> +		__xfs_inodegc_flush(mp);
> +}
> +
>  /*
>   * Flush all the pending work and then disable the inode inactivation background
>   * workers and wait for them to stop.
> @@ -1927,7 +1935,7 @@ xfs_inodegc_stop(
>  	if (!xfs_clear_inodegc_enabled(mp))
>  		return;
>  
> -	xfs_inodegc_flush(mp);
> +	__xfs_inodegc_flush(mp);
>  
>  	for_each_online_cpu(cpu) {
>  		gc = per_cpu_ptr(mp->m_inodegc, cpu);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications
  2021-08-04 16:06         ` Darrick J. Wong
@ 2021-08-04 21:17           ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 21:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 09:06:01AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 04, 2021 at 09:50:51PM +1000, Dave Chinner wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The inode inactivation and CIL tracking percpu structures are
> > per-xfs_mount structures. That means when we get a CPU dead
> > notification, we need to then iterate all the per-cpu structure
> > instances to process them. Rather than keeping linked lists of
> > per-cpu structures in each subsystem, add a list of all xfs_mounts
> > that the generic xfs_cpu_dead() function will iterate and call into
> > each subsystem appropriately.
> > 
> > This allows us to handle both per-mount and global XFS percpu state
> > from xfs_cpu_dead(), and avoids the need to link subsystem
> > structures that can be easily found from the xfs_mount into their
> > own global lists.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
....
> > @@ -2090,6 +2126,11 @@ xfs_cpu_hotplug_destroy(void)
> >  	cpuhp_remove_state_nocalls(CPUHP_XFS_DEAD);
> >  }
> >  
> > +#else /* !CONFIG_HOTPLUG_CPU */
> > +static inline int xfs_cpu_hotplug_init(struct xfs_cil *cil) { return 0; }
> > +static inline void xfs_cpu_hotplug_destroy(struct xfs_cil *cil) {}
> 
> void arguments here, right?

Ah, yeah, most likely.

> > +#endif
> 
> Nit: I think this ifdef stuff belongs in the previous patch.  Will fix
> it when I drag this into my tree.

I didn't have them in the previous patch because when
CONFIG_HOTPLUG_CPU=n the cpuhotplug functions are stubbed out and
the compiler elides it all as they collapse down to functions that are
just "return 0". It's not until the mount list appears that there is
something we need to elide from the source ourselves...

<shrug>

Doesn't worry me either way.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues
  2021-08-04 15:59         ` Darrick J. Wong
@ 2021-08-04 21:35           ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 21:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 08:59:52AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 04, 2021 at 09:09:16PM +1000, Dave Chinner wrote:
> > On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:
> > > For everyone else following along at home, I've posted the current draft
> > > version of this whole thing in:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
> > 
> > Overall looks good - fixes to freeze problems I hit are found
> > in other replies to this.
> > 
> > I omitted the commits:
> > 
> > xfs: queue inodegc worker immediately when memory is tight
> > xfs: throttle inode inactivation queuing on memory reclaim
> > 
> > in my test kernel because I think they are unnecessary.
> > 
> > I think the first is unnecessary because reclaim of inodes from the
> > VFS is usually in large batches and so early triggers aren't
> > desirable when we're getting thousands of inodes being evicted by
> > the superblock shrinker at a time. If we've only got a handful of
> > inodes queued, then inactivating them early isn't going to make much
> > of an impact on free memory. I could be wrong, but so far I have no
> > evidence that expediting inactivation is necessary.
> 
> I think this was a lot more necessary under the old design because I let
> the number of tagged inodes grow quite large before triggering gc work,
> much less throttling anything.  256 is low enough that it should be
> manageable.

But the next patch in the series prevents the shrinkers from
blocking on the hard throttle, yes? So the hard limit throttling
queuing isn't something memory reclaim relies on, either. What will
have an impact is the cond_resched() we place in shrinker execution.
Whenever the VFS inode reclaim eviction list processing hits one of
those and we have a queued deferred work, it will switch away from
the shrinker to run inactivation on that CPU.

So, in reality, we are still throttling and blocking direct reclaim
with the deferred processing. It's just that we are doing it
implicitly at defined reschedule points in the shrinker rather than
doing it directly inline by blocking during a modification
transaction. This also means that if inactivation does block, the
reclaim process can keep running and queuing/reclaiming more ex-VFS
inodes. IOWs, running inactivation like this should help improve
reclaim behaviour and reduce reclaim scan latencies without having
reclaim run out of control...

> Does it matter that we no longer inactivate inodes in inode number
> order?  I guess it could be nice to be able to dump inode cluster
> buffers as soon as practicable, but OTOH I suspect that only matters for
> the case of mass deletion, in which case we'll probably catch up soon
> enough?
> 
> Anyway, I'll try turning both of these off with my silly deltree
> exerciser and see what happens.

I haven't seen anything that makes it necessary so in the absence of
simplifying this as much as possible, I want to remove this stuff.
We can always add it back in (easily) if something turns up and we
find this is the cause.

> > The second patch is the custom shrinker. Again, I just don't think
> > this is necessary because if there is any amount of inactivation of
> > evicted inodes needed due to reclaim, we'll already be triggering it
> > to run via the deferred queue flush thresholds. Hence we don't
> > really need any mechanism to tell us that there is memory pressure;
> > the deferred work reacts to eviction from reclaim in exactly the
> > same way it reacts to eviction from unlink....
> 
> Yep.  I came to the same conclusion last night; it looks like my fast
> fstests setup for that passed.
> 
> > I've been running the patchset without these two patches on my 512MB
> > test VM, and the only OOM kill I get from fstests is g/531. This is
> > the "many open-but-unlinked" test, which creates 50,000 open
> > unlinked files per CPU. So for this test VM which has 4 CPUs, that's
> > 200,000 open, dirty iunlinked inodes and a lot of pinned inode
> > cluster buffers. At ~2kB of memory per unlinked inode (ignoring the
> > cluster buffers) this would consume about 400MB of the 512MB of RAM
> > the VM has. It OOM kills the test programs that hold the open files
> > long before it gets to 200,000 files, so this test never passed
> > before this patchset on this machine...
> 
> Yeah... I actually tried running fstests on a 512M VM and whooeee did I
> see a lot of OOM kills.  Clearly we've all gotten spoiled by cheap DRAM.

fstests does a lot of stuff that requires memory to complete. THe
filesystem itself will run in much less RAM, but it's stuff like
requiring caching of hundreds of MB of inodes when you don't have
hundreds of MB of RAM that causes the problems.

I will note g/531 does try to limit the number of open files based
on /proc/sys/fs/file-max, but as we found out last night on #xfs,
systemd now unconditionally sets that to 2^63 - 1 and so it breaks
any attempt to size the fileset based on the kernel's RAM size based
default file-max setting...

> > I have a couple of extra patches to set up per-cpu hotplug
> > infrastructure before the deferred inode inactivation patch - I'll
> > post them after I finish this email. I'm going to leave it running
> > tests overnight.
> 
> Ok, I'll jam those on the front end of the series.
> 
> > Darrick, I'm pretty happy with the way the patchset is behaving now.
> > If you want to fold in the bug fixes I've posted and add in
> > the hotplug patches, then I think it's ready to be posted in full
> > again (if it all passes your testing) for review.
> 
> It's probably about time for that.  Now that we do percpu thingies, I
> think it might also be time for a test that runs fstests while plugging
> and unplugging the non-bsp processors.

Yeah, I haven't tested the CPU dead notification much at all. It
should work, but...

> [narrator: ...and thus he unleashed another terrifying bug mountain]

... yeah, this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification
  2021-08-04 16:19         ` Darrick J. Wong
@ 2021-08-04 21:48           ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2021-08-04 21:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Aug 04, 2021 at 09:19:18AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 04, 2021 at 09:52:25PM +1000, Dave Chinner wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > So we don't leave queued inodes on a CPU we won't ever flush.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_icache.h |  1 +
> >  fs/xfs/xfs_super.c  |  2 +-
> >  3 files changed, 38 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index f772f2a67a8b..9e2c95903c68 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1966,6 +1966,42 @@ xfs_inodegc_start(
> >  	}
> >  }
> >  
> > +/*
> > + * Fold the dead CPU inodegc queue into the current CPUs queue.
> > + */
> > +void
> > +xfs_inodegc_cpu_dead(
> > +	struct xfs_mount	*mp,
> > +	int			dead_cpu)
> 
> unsigned int, since that's the caller's type.

*nod*

> > +{
> > +	struct xfs_inodegc	*dead_gc, *gc;
> > +	struct llist_node	*first, *last;
> > +	int			count = 0;
> > +
> > +	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
> > +	cancel_work_sync(&dead_gc->work);
> > +
> > +	if (llist_empty(&dead_gc->list))
> > +		return;
> > +
> > +	first = dead_gc->list.first;
> > +	last = first;
> > +	while (last->next) {
> > +		last = last->next;
> > +		count++;
> > +	}
> > +	dead_gc->list.first = NULL;
> > +	dead_gc->items = 0;
> > +
> > +	/* Add pending work to current CPU */
> > +	gc = get_cpu_ptr(mp->m_inodegc);
> > +	llist_add_batch(first, last, &gc->list);
> > +	count += READ_ONCE(gc->items);
> > +	WRITE_ONCE(gc->items, count);
> 
> I was wondering about the READ/WRITE_ONCE pattern for gc->items: it's
> meant to be an accurate count of the list items, right?  But there's no
> hard synchronization (e.g. spinlock) around them, which means that the
> only CPU that can access that variable at all is the one that the percpu
> structure belongs to, right?  And I think that's ok here, because the
> only accessors are _queue() and _worker(), which both are supposed to
> run on the same CPU since they're percpu lists, right?

For items that are per-cpu, we only need to guarantee that the
normal case is access by that CPU only and that dependent accesses
within an algorithm occur within a preempt disabled region. THe use
of get_cpu_ptr()/put_cpu_ptr() creates a critital region where
preemption is disabled on that CPU. Hence we can read, modify and
write a per-cpu variable without locking, knowing that nothing else
will be attempting to run the same modification at the same time on
a different CPU, because they will be accessing percpu stuff local
to that CPU, not this one.

The reason for using READ_ONCE/WRITE_ONCE is largely to ensure that
we fetch and store the variable appropriately, as the work that
zeros the count can sometimes run on a different CPU. And it will
shut up the data race detector thing...

As it is, the count of items is rough, and doesn't need to be
accurate. If we race with a zeroing of the count, we'll set the
count to be higher (as if the zeroing didn't occur) and that just
causes the work to be rescheduled sooner than it otherwise would. A
race with zeroing is not the end of the world...

> In which case: why can't we just say count = dead_gc->items;?  @dead_cpu
> is being offlined, which implies that nothing will get scheduled on it,
> right?

The local CPU might already have items queued, so the count should
include them, too.

> > +	put_cpu_ptr(gc);
> > +	queue_work(mp->m_inodegc_wq, &gc->work);
> 
> Should this be thresholded like we do for _inodegc_queue?

I thought about that, then thought "this is slow path stuff, we just
want to clear out the backlog so we don't care about batching.."

> In the old days I would have imagined that cpu offlining should be rare
> enough <cough> that it probably doesn't make any real difference.  OTOH
> my cloudic colleague reminds me that they aggressively offline cpus to
> reduce licensing cost(!).

Yeah, CPU hotplug is very rare, except in those rare environments
where it is very common....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-08-04 21:49 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-29 18:43 [PATCHSET v8 00/20] xfs: deferred inode inactivation Darrick J. Wong
2021-07-29 18:43 ` [PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Darrick J. Wong
2021-07-29 18:44 ` [PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it Darrick J. Wong
2021-07-29 18:44 ` [PATCH 03/20] xfs: defer inode inactivation to a workqueue Darrick J. Wong
2021-07-30  4:24   ` Dave Chinner
2021-07-31  4:21     ` Darrick J. Wong
2021-08-01 21:49       ` Dave Chinner
2021-08-01 23:47         ` Dave Chinner
2021-08-03  8:34   ` [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues Dave Chinner
2021-08-03 20:20     ` Darrick J. Wong
2021-08-04  3:20     ` [PATCH, alternative v2] " Darrick J. Wong
2021-08-04 10:03       ` [PATCH] xfs: inodegc needs to stop before freeze Dave Chinner
2021-08-04 12:37         ` Dave Chinner
2021-08-04 10:46       ` [PATCH] xfs: don't run inodegc flushes when inodegc is not active Dave Chinner
2021-08-04 16:20         ` Darrick J. Wong
2021-08-04 11:09       ` [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues Dave Chinner
2021-08-04 15:59         ` Darrick J. Wong
2021-08-04 21:35           ` Dave Chinner
2021-08-04 11:49       ` [PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure Dave Chinner
2021-08-04 11:50       ` [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications Dave Chinner
2021-08-04 16:06         ` Darrick J. Wong
2021-08-04 21:17           ` Dave Chinner
2021-08-04 11:52       ` [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification Dave Chinner
2021-08-04 16:19         ` Darrick J. Wong
2021-08-04 21:48           ` Dave Chinner
2021-07-29 18:44 ` [PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim Darrick J. Wong
2021-07-29 18:44 ` [PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes Darrick J. Wong
2021-07-29 18:44 ` [PATCH 06/20] xfs: throttle inodegc queuing on backlog Darrick J. Wong
2021-08-02  0:45   ` Dave Chinner
2021-08-02  1:30     ` Dave Chinner
2021-07-29 18:44 ` [PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
2021-07-29 18:44 ` [PATCH 09/20] xfs: reduce inactivation delay when free space is tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 10/20] xfs: reduce inactivation delay when quota are tight Darrick J. Wong
2021-07-29 18:44 ` [PATCH 11/20] xfs: reduce inactivation delay when realtime extents " Darrick J. Wong
2021-07-29 18:44 ` [PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations Darrick J. Wong
2021-07-29 18:45 ` [PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics Darrick J. Wong
2021-07-29 18:45 ` [PATCH 14/20] xfs: parallelize inode inactivation Darrick J. Wong
2021-08-02  0:55   ` Dave Chinner
2021-08-02 21:33     ` Darrick J. Wong
2021-07-29 18:45 ` [PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight Darrick J. Wong
2021-07-29 18:45 ` [PATCH 16/20] xfs: queue inodegc worker immediately on backlog Darrick J. Wong
2021-07-29 18:45 ` [PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
2021-07-29 18:45 ` [PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space Darrick J. Wong
2021-07-29 18:45 ` [PATCH 19/20] xfs: use background worker pool when transactions can't get " Darrick J. Wong
2021-07-29 18:45 ` [PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
2021-08-02 10:35 ` [PATCHSET v8 00/20] xfs: deferred inode inactivation Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).