All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v6 0/9] xfs: deferred inode inactivation
@ 2021-06-07 22:24 Darrick J. Wong
  2021-06-07 22:24 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:24 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs, david, hch

Hi all,

This patch series implements deferred inode inactivation.  Inactivation
is what happens when an open file loses its last incore reference: if
the file has speculative preallocations, they must be freed, and if the
file is unlinked, all forks must be truncated, and the inode marked
freed in the inode chunk and the inode btrees.

Currently, all of this activity is performed in frontend threads when
the last in-memory reference is lost and/or the vfs decides to drop the
inode.  Three complaints stem from this behavior: first, that the time
to unlink (in the worst case) depends on both the complexity of the
directory as well as the the number of extents in that file; second,
that deleting a directory tree is inefficient and seeky because we free
the inodes in readdir order, not disk order; and third, the upcoming
online repair feature needs to be able to xfs_irele while scanning a
filesystem in transaction context.  It cannot perform inode inactivation
in this context because xfs does not support nested transactions.

The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation.  When inodes need inactivation, we set
NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
deferred work item.  The deferred worker runs in an unbounded workqueue,
scanning the inode radix tree for tagged inodes to inactivate, and
performing all the on-disk metadata updates.  Once the inode has been
inactivated, it is left in the reclaim state and the background reclaim
worker (or direct reclaim) will get to it eventually.

Doing the inactivations from kernel threads solves the first problem by
constraining the amount of work done by the unlink() call to removing
the directory entry.  It solves the third problem by moving inactivation
to a separate process.  Because the inactivations are done in order of
inode number, we solve the second problem by performing updates in (we
hope) disk order.  This also decreases the amount of time it takes to
let go of an inode cluster if we're deleting entire directory trees.

There are three big warts I can think of in this series: first, because
the actual freeing of nlink==0 inodes is now done in the background,
this means that the system will be busy making metadata updates for some
time after the unlink() call returns.  This temporarily reduces
available iops.  Second, in order to retain the behavior that deleting
100TB of unshared data should result in a free space gain of 100TB, the
statvfs and quota reporting ioctls wait for inactivation to finish,
which increases the long tail latency of those calls.  This behavior is,
unfortunately, key to not introducing regressions in fstests.  The third
problem is that the deferrals keep memory usage higher for longer,
reduce opportunities to throttle the frontend when metadata load is
heavy, and the unbounded workqueues can create transaction storms.

For v5 there are some serious changes against the older versions of this
patchset -- we no longer cycle an inode's dquots to avoid fights with
quotaoff, and we actually shut down the background gc threads when the
filesystem is frozen.

v1-v2: NYE patchbombs
v3: rebase against 5.12-rc2 for submission.
v4: combine the can/has eofblocks predicates, clean up incore inode tree
    walks, fix inobt deadlock
v5: actually freeze the inode gc threads when we freeze the filesystem,
    consolidate the code that deals with inode tagging, and use
    foreground inactivation during quotaoff to avoid cycling dquots
v6: rebase to 5.13-rc4, fix quotaoff not to require foreground inactivation,
    refactor to use inode walk goals, use atomic bitflags to control the
    scheduling of gc workers

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.14
---
 Documentation/admin-guide/xfs.rst |   10 +
 fs/xfs/libxfs/xfs_ag.c            |    3 
 fs/xfs/libxfs/xfs_ag.h            |    3 
 fs/xfs/scrub/common.c             |    2 
 fs/xfs/xfs_bmap_util.c            |   43 +++
 fs/xfs/xfs_fsops.c                |    4 
 fs/xfs/xfs_globals.c              |    3 
 fs/xfs/xfs_icache.c               |  596 ++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_icache.h               |   37 ++
 fs/xfs/xfs_inode.c                |   60 +++-
 fs/xfs/xfs_inode.h                |   15 +
 fs/xfs/xfs_itable.c               |   42 ++-
 fs/xfs/xfs_iwalk.c                |   33 ++
 fs/xfs/xfs_linux.h                |    2 
 fs/xfs/xfs_log_recover.c          |    7 
 fs/xfs/xfs_mount.c                |   30 ++
 fs/xfs/xfs_mount.h                |   16 +
 fs/xfs/xfs_qm_syscalls.c          |    4 
 fs/xfs/xfs_super.c                |  130 +++++++-
 fs/xfs/xfs_sysctl.c               |    9 +
 fs/xfs/xfs_sysctl.h               |    1 
 fs/xfs/xfs_trace.h                |   14 +
 22 files changed, 955 insertions(+), 109 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/9] xfs: refactor the inode recycling code
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-06-07 22:24 ` Darrick J. Wong
  2021-06-07 22:59   ` Dave Chinner
  2021-06-07 22:25 ` [PATCH 2/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:24 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
to an xfs_inode that was previously vfs-destroyed.  The next patch will
add a new set of state flags, so we need the helper to avoid
duplication.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |  139 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 81 insertions(+), 58 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4e4682879bbd..4d4aa61fbd34 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -350,19 +350,19 @@ xfs_inew_wait(
  * need to retain across reinitialisation, and rewrite them into the VFS inode
  * after reinitialisation even if it fails.
  */
-static int
+static inline int
 xfs_reinit_inode(
 	struct xfs_mount	*mp,
 	struct inode		*inode)
 {
-	int		error;
-	uint32_t	nlink = inode->i_nlink;
-	uint32_t	generation = inode->i_generation;
-	uint64_t	version = inode_peek_iversion(inode);
-	umode_t		mode = inode->i_mode;
-	dev_t		dev = inode->i_rdev;
-	kuid_t		uid = inode->i_uid;
-	kgid_t		gid = inode->i_gid;
+	int			error;
+	uint32_t		nlink = inode->i_nlink;
+	uint32_t		generation = inode->i_generation;
+	uint64_t		version = inode_peek_iversion(inode);
+	umode_t			mode = inode->i_mode;
+	dev_t			dev = inode->i_rdev;
+	kuid_t			uid = inode->i_uid;
+	kgid_t			gid = inode->i_gid;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -376,6 +376,70 @@ xfs_reinit_inode(
 	return error;
 }
 
+/*
+ * Carefully nudge an inode whose VFS state has been torn down back into a
+ * usable state.  Drops the i_flags_lock and the rcu read lock.
+ */
+static int
+xfs_iget_recycle(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip) __releases(&ip->i_flags_lock)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	int			error;
+
+	/*
+	 * We need to make it look like the inode is being reclaimed to prevent
+	 * the actual reclaim workers from stomping over us while we recycle
+	 * the inode.  We can't clear the radix tree tag yet as it requires
+	 * pag_ici_lock to be held exclusive.
+	 */
+	ip->i_flags |= XFS_IRECLAIM;
+
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+
+	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+	error = xfs_reinit_inode(mp, inode);
+	if (error) {
+		bool	wake;
+
+		/*
+		 * Re-initializing the inode failed, and we are in deep
+		 * trouble.  Try to re-add it to the reclaim list.
+		 */
+		rcu_read_lock();
+		spin_lock(&ip->i_flags_lock);
+		wake = !!__xfs_iflags_test(ip, XFS_INEW);
+		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+		if (wake)
+			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
+		ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
+		spin_unlock(&ip->i_flags_lock);
+		rcu_read_unlock();
+		return error;
+	}
+
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	/*
+	 * Clear the per-lifetime state in the inode as we are now effectively
+	 * a new inode and need to return to the initial state before reuse
+	 * occurs.
+	 */
+	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
+	ip->i_flags |= XFS_INEW;
+	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			XFS_ICI_RECLAIM_TAG);
+	inode->i_state = I_NEW;
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+
+	return 0;
+}
+
 /*
  * If we are allocating a new inode, then check what was returned is
  * actually a free, empty inode. If we are not allocating an inode,
@@ -450,7 +514,7 @@ xfs_iget_cache_hit(
 	/*
 	 * If we are racing with another cache hit that is currently
 	 * instantiating this inode or currently recycling it out of
-	 * reclaimabe state, wait for the initialisation to complete
+	 * reclaimable state, wait for the initialisation to complete
 	 * before continuing.
 	 *
 	 * XXX(hch): eventually we should do something equivalent to
@@ -472,11 +536,11 @@ xfs_iget_cache_hit(
 	if (error)
 		goto out_error;
 
-	/*
-	 * If IRECLAIMABLE is set, we've torn down the VFS inode already.
-	 * Need to carefully get it back into useable state.
-	 */
 	if (ip->i_flags & XFS_IRECLAIMABLE) {
+		/*
+		 * If IRECLAIMABLE is set, we've torn down the VFS inode
+		 * already, and must carefully restore it to usable state.
+		 */
 		trace_xfs_iget_reclaim(ip);
 
 		if (flags & XFS_IGET_INCORE) {
@@ -484,52 +548,12 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
-		/*
-		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
-		 * from stomping over us while we recycle the inode.  We can't
-		 * clear the radix tree reclaimable tag yet as it requires
-		 * pag_ici_lock to be held exclusive.
-		 */
-		ip->i_flags |= XFS_IRECLAIM;
-
-		spin_unlock(&ip->i_flags_lock);
-		rcu_read_unlock();
-
-		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-		error = xfs_reinit_inode(mp, inode);
+		/* Drops i_flags_lock and RCU read lock. */
+		error = xfs_iget_recycle(pag, ip);
 		if (error) {
-			bool wake;
-			/*
-			 * Re-initializing the inode failed, and we are in deep
-			 * trouble.  Try to re-add it to the reclaim list.
-			 */
-			rcu_read_lock();
-			spin_lock(&ip->i_flags_lock);
-			wake = !!__xfs_iflags_test(ip, XFS_INEW);
-			ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
-			if (wake)
-				wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
-			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
 			trace_xfs_iget_reclaim_fail(ip);
-			goto out_error;
+			return error;
 		}
-
-		spin_lock(&pag->pag_ici_lock);
-		spin_lock(&ip->i_flags_lock);
-
-		/*
-		 * Clear the per-lifetime state in the inode as we are now
-		 * effectively a new inode and need to return to the initial
-		 * state before reuse occurs.
-		 */
-		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
-		ip->i_flags |= XFS_INEW;
-		xfs_perag_clear_inode_tag(pag,
-				XFS_INO_TO_AGINO(pag->pag_mount, ino),
-				XFS_ICI_RECLAIM_TAG);
-		inode->i_state = I_NEW;
-		spin_unlock(&ip->i_flags_lock);
-		spin_unlock(&pag->pag_ici_lock);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -559,7 +583,6 @@ xfs_iget_cache_hit(
 	return error;
 }
 
-
 static int
 xfs_iget_cache_miss(
 	struct xfs_mount	*mp,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/9] xfs: deferred inode inactivation
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
  2021-06-07 22:24 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-08  0:57   ` Dave Chinner
  2021-06-07 22:25 ` [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue.  With this we
avoid blocking memory reclaim on filesystem metadata updates that are
necessary to free an in-core inode, such as post-eof block freeing, COW
staging extent freeing, and truncating and freeing unlinked inodes.  Now
that work is deferred to a workqueue where we can do the freeing in
batches.

We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated.  A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).

Unfortunately, deferring the inactivation has one huge downside --
eventual consistency.  Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we also force inactivation scans in order to maintain the existing
behaviors, at least outwardly.

For this patch we'll set the delay to zero to mimic the old timing as
much as possible; in the next patch we'll play with different delay
settings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    3 
 fs/xfs/scrub/common.c             |    2 
 fs/xfs/xfs_fsops.c                |    4 
 fs/xfs/xfs_icache.c               |  364 +++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_icache.h               |   35 +++-
 fs/xfs/xfs_inode.c                |   60 ++++++
 fs/xfs/xfs_inode.h                |   15 +-
 fs/xfs/xfs_log_recover.c          |    7 +
 fs/xfs/xfs_mount.c                |   29 +++
 fs/xfs/xfs_mount.h                |    7 +
 fs/xfs/xfs_qm_syscalls.c          |    4 
 fs/xfs/xfs_super.c                |  120 +++++++++++-
 fs/xfs/xfs_trace.h                |   14 +
 13 files changed, 620 insertions(+), 44 deletions(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index 8de008c0c5ad..f9b109bfc6a6 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -524,7 +524,8 @@ and the short name of the data device.  They all can be found in:
                   mount time quotacheck.
   xfs-gc          Background garbage collection of disk space that have been
                   speculatively allocated beyond EOF or for staging copy on
-                  write operations.
+                  write operations; and files that are no longer linked into
+                  the directory tree.
 ================  ===========
 
 For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index cadfd5799909..93e63407c284 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -884,6 +884,7 @@ xchk_stop_reaping(
 {
 	sc->flags |= XCHK_REAPING_DISABLED;
 	xfs_blockgc_stop(sc->mp);
+	xfs_inodegc_stop(sc->mp);
 }
 
 /* Restart background reaping of resources. */
@@ -891,6 +892,7 @@ void
 xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
+	xfs_inodegc_start(sc->mp);
 	xfs_blockgc_start(sc->mp);
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 07c745cd483e..8e92402c685b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -19,6 +19,8 @@
 #include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
 
 /*
  * Write new AG headers to disk. Non-transactional, but need to be
@@ -343,6 +345,8 @@ xfs_fs_counts(
 	xfs_mount_t		*mp,
 	xfs_fsop_counts_t	*cnt)
 {
+	xfs_inodegc_summary_flush(mp);
+
 	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
 	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
 	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4d4aa61fbd34..791202236a18 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -32,6 +32,8 @@
 #define XFS_ICI_RECLAIM_TAG	0
 /* Inode has speculative preallocations (posteof or cow) to clean. */
 #define XFS_ICI_BLOCKGC_TAG	1
+/* Inode can be inactivated. */
+#define XFS_ICI_INODEGC_TAG	2
 
 /*
  * The goal for walking incore inodes.  These can correspond with incore inode
@@ -44,6 +46,7 @@ enum xfs_icwalk_goal {
 	/* Goals directly associated with tagged inodes. */
 	XFS_ICWALK_BLOCKGC	= XFS_ICI_BLOCKGC_TAG,
 	XFS_ICWALK_RECLAIM	= XFS_ICI_RECLAIM_TAG,
+	XFS_ICWALK_INODEGC	= XFS_ICI_INODEGC_TAG,
 };
 
 #define XFS_ICWALK_NULL_TAG	(-1U)
@@ -228,6 +231,26 @@ xfs_blockgc_queue(
 	rcu_read_unlock();
 }
 
+static inline bool
+xfs_inodegc_running(struct xfs_mount *mp)
+{
+	return test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
+}
+
+/* Queue a new inode gc pass if there are inodes needing inactivation. */
+static void
+xfs_inodegc_queue(
+	struct xfs_mount        *mp)
+{
+	if (!xfs_inodegc_running(mp))
+		return;
+
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
+		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+	rcu_read_unlock();
+}
+
 /* Set a tag on both the AG incore inode tree and the AG radix tree. */
 static void
 xfs_perag_set_inode_tag(
@@ -262,6 +285,9 @@ xfs_perag_set_inode_tag(
 	case XFS_ICI_BLOCKGC_TAG:
 		xfs_blockgc_queue(pag);
 		break;
+	case XFS_ICI_INODEGC_TAG:
+		xfs_inodegc_queue(mp);
+		break;
 	}
 
 	trace_xfs_perag_set_inode_tag(mp, pag->pag_agno, tag, _RET_IP_);
@@ -308,18 +334,28 @@ xfs_perag_clear_inode_tag(
  */
 void
 xfs_inode_mark_reclaimable(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	bool			need_inactive)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	unsigned int		tag;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
 
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	if (need_inactive) {
+		trace_xfs_inode_set_need_inactive(ip);
+		ip->i_flags |= XFS_NEED_INACTIVE;
+		tag = XFS_ICI_INODEGC_TAG;
+	} else {
+		trace_xfs_inode_set_reclaimable(ip);
+		ip->i_flags |= XFS_IRECLAIMABLE;
+		tag = XFS_ICI_RECLAIM_TAG;
+	}
+
+	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -383,19 +419,26 @@ xfs_reinit_inode(
 static int
 xfs_iget_recycle(
 	struct xfs_perag	*pag,
-	struct xfs_inode	*ip) __releases(&ip->i_flags_lock)
+	struct xfs_inode	*ip,
+	unsigned long		iflag) __releases(&ip->i_flags_lock)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
+	unsigned int		tag;
 	int			error;
 
+	ASSERT(iflag == XFS_IRECLAIM || iflag == XFS_INACTIVATING);
+
+	tag = (iflag == XFS_INACTIVATING) ? XFS_ICI_INODEGC_TAG :
+					    XFS_ICI_RECLAIM_TAG;
+
 	/*
 	 * We need to make it look like the inode is being reclaimed to prevent
 	 * the actual reclaim workers from stomping over us while we recycle
 	 * the inode.  We can't clear the radix tree tag yet as it requires
 	 * pag_ici_lock to be held exclusive.
 	 */
-	ip->i_flags |= XFS_IRECLAIM;
+	ip->i_flags |= iflag;
 
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
@@ -412,10 +455,13 @@ xfs_iget_recycle(
 		rcu_read_lock();
 		spin_lock(&ip->i_flags_lock);
 		wake = !!__xfs_iflags_test(ip, XFS_INEW);
-		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+		ip->i_flags &= ~(XFS_INEW | iflag);
 		if (wake)
 			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
-		ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
+		if (iflag == XFS_IRECLAIM)
+			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
+		if (iflag == XFS_INACTIVATING)
+			ASSERT(ip->i_flags & XFS_NEED_INACTIVE);
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 		return error;
@@ -431,8 +477,7 @@ xfs_iget_recycle(
 	 */
 	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 	ip->i_flags |= XFS_INEW;
-	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
+	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
 	inode->i_state = I_NEW;
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -455,6 +500,13 @@ xfs_iget_check_free_state(
 	struct xfs_inode	*ip,
 	int			flags)
 {
+	/*
+	 * Unlinked inodes awaiting inactivation must not be reused until we
+	 * have a chance to clear the on-disk metadata.
+	 */
+	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
+		return -ENOENT;
+
 	if (flags & XFS_IGET_CREATE) {
 		/* should be a free inode */
 		if (VFS_I(ip)->i_mode != 0) {
@@ -521,7 +573,7 @@ xfs_iget_cache_hit(
 	 *	     wait_on_inode to wait for these flags to be cleared
 	 *	     instead of polling for it.
 	 */
-	if (ip->i_flags & (XFS_INEW|XFS_IRECLAIM)) {
+	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING)) {
 		trace_xfs_iget_skip(ip);
 		XFS_STATS_INC(mp, xs_ig_frecycle);
 		error = -EAGAIN;
@@ -549,11 +601,29 @@ xfs_iget_cache_hit(
 		}
 
 		/* Drops i_flags_lock and RCU read lock. */
-		error = xfs_iget_recycle(pag, ip);
+		error = xfs_iget_recycle(pag, ip, XFS_IRECLAIM);
 		if (error) {
 			trace_xfs_iget_reclaim_fail(ip);
 			return error;
 		}
+	} else if (ip->i_flags & XFS_NEED_INACTIVE) {
+		/*
+		 * If NEED_INACTIVE is set, we've torn down the VFS inode
+		 * already, and must carefully restore it to usable state.
+		 */
+		trace_xfs_iget_inactive(ip);
+
+		if (flags & XFS_IGET_INCORE) {
+			error = -EAGAIN;
+			goto out_error;
+		}
+
+		/* Drops i_flags_lock and RCU read lock. */
+		error = xfs_iget_recycle(pag, ip, XFS_INACTIVATING);
+		if (error) {
+			trace_xfs_iget_inactive_fail(ip);
+			return error;
+		}
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -845,22 +915,33 @@ xfs_dqrele_igrab(
 
 	/*
 	 * Skip inodes that are anywhere in the reclaim machinery because we
-	 * drop dquots before tagging an inode for reclamation.
+	 * drop dquots before tagging an inode for reclamation.  If the inode
+	 * is being inactivated, skip it because inactivation will drop the
+	 * dquots for us.
 	 */
-	if (ip->i_flags & (XFS_IRECLAIM | XFS_IRECLAIMABLE))
+	if (ip->i_flags & (XFS_IRECLAIM | XFS_IRECLAIMABLE | XFS_INACTIVATING))
 		goto out_unlock;
 
 	/*
-	 * The inode looks alive; try to grab a VFS reference so that it won't
-	 * get destroyed.  If we got the reference, return true to say that
-	 * we grabbed the inode.
+	 * If the inode is queued but not undergoing inactivation, set the
+	 * inactivating flag so everyone will leave it alone and return true
+	 * to say that we are taking ownership of it.
+	 *
+	 * Otherwise, the inode looks alive; try to grab a VFS reference so
+	 * that it won't get destroyed.  If we got the reference, return true
+	 * to say that we grabbed the inode.
 	 *
 	 * If we can't get the reference, then we know the inode had its VFS
 	 * state torn down and hasn't yet entered the reclaim machinery.  Since
 	 * we also know that dquots are detached from an inode before it enters
 	 * reclaim, we can skip the inode.
 	 */
-	ret = igrab(VFS_I(ip)) != NULL;
+	if (ip->i_flags & XFS_NEED_INACTIVE) {
+		ip->i_flags |= XFS_INACTIVATING;
+		ret = true;
+	} else {
+		ret = igrab(VFS_I(ip)) != NULL;
+	}
 
 out_unlock:
 	spin_unlock(&ip->i_flags_lock);
@@ -873,6 +954,8 @@ xfs_dqrele_inode(
 	struct xfs_inode	*ip,
 	struct xfs_icwalk	*icw)
 {
+	bool			live_inode;
+
 	if (xfs_iflags_test(ip, XFS_INEW))
 		xfs_inew_wait(ip);
 
@@ -890,7 +973,19 @@ xfs_dqrele_inode(
 		ip->i_pdquot = NULL;
 	}
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	xfs_irele(ip);
+
+	/*
+	 * If we set INACTIVATING earlier to prevent this inode from being
+	 * touched, clear that state to let the inodegc claim it.  Otherwise,
+	 * it's a live inode and we need to release it.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	live_inode = !(ip->i_flags & XFS_INACTIVATING);
+	ip->i_flags &= ~XFS_INACTIVATING;
+	spin_unlock(&ip->i_flags_lock);
+
+	if (live_inode)
+		xfs_irele(ip);
 }
 
 /*
@@ -999,6 +1094,7 @@ xfs_reclaim_inode(
 
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
+	trace_xfs_inode_reclaiming(ip);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1476,6 +1572,8 @@ xfs_blockgc_start(
 
 /* Don't try to run block gc on an inode that's in any of these states. */
 #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
+					 XFS_NEED_INACTIVE | \
+					 XFS_INACTIVATING | \
 					 XFS_IRECLAIMABLE | \
 					 XFS_IRECLAIM)
 /*
@@ -1636,6 +1734,229 @@ xfs_blockgc_free_quota(
 			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), iwalk_flags);
 }
 
+/*
+ * Inode Inactivation and Reclaimation
+ * ===================================
+ *
+ * Sometimes, inodes need to have work done on them once the last program has
+ * closed the file.  Typically this means cleaning out any leftover speculative
+ * preallocations after EOF or in the CoW fork.  For inodes that have been
+ * totally unlinked, this means unmapping data/attr/cow blocks, removing the
+ * inode from the unlinked buckets, and marking it free in the inobt and inode
+ * table.
+ *
+ * This process can generate many metadata updates, which shows up as close()
+ * and unlink() calls that take a long time.  We defer all that work to a
+ * workqueue which means that we can batch a lot of work and do it in inode
+ * order for better performance.  Furthermore, we can control the workqueue,
+ * which means that we can avoid doing inactivation work at a bad time, such as
+ * when the fs is frozen.
+ *
+ * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
+ * INACTIVATING) and adds a new INODEGC radix tree tag for fast access.  We
+ * maintain separate perag counters for both types, and move counts as inodes
+ * wander the state machine, which now works as follows:
+ *
+ * If the inode needs inactivation, we:
+ *   - Set the NEED_INACTIVE inode flag
+ *   - Increment the per-AG inactive count
+ *   - Set the ICI_INODEGC tag in the per-AG inode tree
+ *   - Set the ICI_INODEGC tag in the per-fs AG tree
+ *   - Schedule background inode inactivation
+ *
+ * If the inode does not need inactivation, we:
+ *   - Set the IRECLAIMABLE inode flag
+ *   - Increment the per-AG reclaim count
+ *   - Set the ICI_RECLAIM tag in the per-AG inode tree
+ *   - Set the ICI_RECLAIM tag in the per-fs AG tree
+ *   - Schedule background inode reclamation
+ *
+ * When it is time for background inode inactivation, we:
+ *   - Set the INACTIVATING inode flag
+ *   - Make all the on-disk updates
+ *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
+ *   - Decrement the per-AG inactive count
+ *   - Clear the ICI_INODEGC tag in the per-AG inode tree
+ *   - Clear the ICI_INODEGC tag in the per-fs AG tree if we just inactivated
+ *     the last inode in the AG
+ *   - Kick the inode into reclamation per the previous paragraph
+ *
+ * When it is time for background inode reclamation, we:
+ *   - Set the IRECLAIM inode flag
+ *   - Detach all the resources and remove the inode from the per-AG inode tree
+ *   - Clear both IRECLAIM and RECLAIMABLE inode flags
+ *   - Decrement the per-AG reclaim count
+ *   - Clear the ICI_RECLAIM tag from the per-AG inode tree
+ *   - Clear the ICI_RECLAIM tag from the per-fs AG tree if we just reclaimed
+ *     the last inode in the AG
+ *
+ * When these state transitions occur, the caller must have taken the per-AG
+ * incore inode tree lock and then the inode i_flags lock, in that order.
+ */
+
+/*
+ * Decide if the given @ip is eligible for inactivation, and grab it if so.
+ * Returns true if it's ready to go or false if we should just ignore it.
+ */
+static bool
+xfs_inodegc_igrab(
+	struct xfs_inode	*ip)
+{
+	ASSERT(rcu_read_lock_held());
+
+	/* Check for stale RCU freed inode */
+	spin_lock(&ip->i_flags_lock);
+	if (!ip->i_ino)
+		goto out_unlock_noent;
+
+	/*
+	 * Skip inodes that don't need inactivation or are being inactivated
+	 * (or reactivated) by another thread.  Inodes should not be tagged
+	 * for inactivation while also in INEW or any reclaim state.
+	 */
+	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
+	    (ip->i_flags & XFS_INACTIVATING))
+		goto out_unlock_noent;
+
+	/*
+	 * Mark this inode as being inactivated even if the fs is shut down
+	 * because we need xfs_inodegc_inactivate to push this inode into the
+	 * reclaim state.
+	 */
+	ip->i_flags |= XFS_INACTIVATING;
+	spin_unlock(&ip->i_flags_lock);
+	return true;
+
+out_unlock_noent:
+	spin_unlock(&ip->i_flags_lock);
+	return false;
+}
+
+/*
+ * Free all speculative preallocations and possibly even the inode itself.
+ * This is the last chance to make changes to an otherwise unreferenced file
+ * before incore reclamation happens.
+ */
+static int
+xfs_inodegc_inactivate(
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	struct xfs_icwalk	*icw)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+
+	/*
+	 * Inactivation isn't supposed to run when the fs is frozen because
+	 * we don't want kernel threads to block on transaction allocation.
+	 */
+	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	/*
+	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed to
+	 * pass in a icw structure to look for inodes to inactivate
+	 * immediately to free some resources.  If this inode isn't a match,
+	 * put it back on the shelf and move on.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (!xfs_icwalk_match(ip, icw)) {
+		ip->i_flags &= ~XFS_INACTIVATING;
+		spin_unlock(&ip->i_flags_lock);
+		return 0;
+	}
+	spin_unlock(&ip->i_flags_lock);
+
+	trace_xfs_inode_inactivating(ip);
+
+	xfs_inactive(ip);
+	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
+
+	/*
+	 * Move the inode from the inactivation phase to the reclamation phase
+	 * by clearing both inactivation inode state flags and marking the
+	 * inode reclaimable.  Schedule background reclaim to run later.
+	 */
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	ip->i_flags |= XFS_IRECLAIMABLE;
+
+	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_INODEGC_TAG);
+	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+
+	return 0;
+}
+
+/* Walk the fs and inactivate the inodes in them. */
+int
+xfs_inodegc_free_space(
+	struct xfs_mount	*mp,
+	struct xfs_icwalk	*icw)
+{
+	trace_xfs_inodegc_free_space(mp, icw, _RET_IP_);
+
+	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
+}
+
+/* Background inode inactivation worker. */
+void
+xfs_inodegc_worker(
+	struct work_struct	*work)
+{
+	struct xfs_mount	*mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_inodegc_work);
+	int			error;
+
+	/*
+	 * Queueing of this inodegc worker can race with xfs_inodegc_stop,
+	 * which means that we can be running after the opflag clears.  Double
+	 * check the flag state so that we don't trip asserts.
+	 */
+	if (!xfs_inodegc_running(mp))
+		return;
+
+	error = xfs_inodegc_free_space(mp, NULL);
+	if (error && error != -EAGAIN)
+		xfs_err(mp, "inode inactivation failed, error %d", error);
+
+	xfs_inodegc_queue(mp);
+}
+
+/* Force all currently queued inode inactivation work to run immediately. */
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	if (!xfs_inodegc_running(mp) ||
+	    !radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
+		return;
+
+	mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+	flush_delayed_work(&mp->m_inodegc_work);
+}
+
+/* Stop all queued inactivation work. */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
+	cancel_delayed_work_sync(&mp->m_inodegc_work);
+}
+
+/* Schedule deferred inode inactivation work. */
+void
+xfs_inodegc_start(
+	struct xfs_mount	*mp)
+{
+	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
+	xfs_inodegc_queue(mp);
+}
+
 /* XFS Inode Cache Walking Code */
 
 /*
@@ -1664,6 +1985,8 @@ xfs_icwalk_igrab(
 		return xfs_blockgc_igrab(ip);
 	case XFS_ICWALK_RECLAIM:
 		return xfs_reclaim_igrab(ip, icw);
+	case XFS_ICWALK_INODEGC:
+		return xfs_inodegc_igrab(ip);
 	default:
 		return false;
 	}
@@ -1692,6 +2015,9 @@ xfs_icwalk_process_inode(
 	case XFS_ICWALK_RECLAIM:
 		xfs_reclaim_inode(ip, pag);
 		break;
+	case XFS_ICWALK_INODEGC:
+		error = xfs_inodegc_inactivate(ip, pag, icw);
+		break;
 	}
 	return error;
 }
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 00dc98a92835..d03d46f83316 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -52,7 +52,7 @@ void xfs_reclaim_inodes(struct xfs_mount *mp);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
-void xfs_inode_mark_reclaimable(struct xfs_inode *ip);
+void xfs_inode_mark_reclaimable(struct xfs_inode *ip, bool need_inactive);
 
 int xfs_blockgc_free_dquots(struct xfs_mount *mp, struct xfs_dquot *udqp,
 		struct xfs_dquot *gdqp, struct xfs_dquot *pdqp,
@@ -80,4 +80,37 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_flush(struct xfs_mount *mp);
+void xfs_inodegc_stop(struct xfs_mount *mp);
+void xfs_inodegc_start(struct xfs_mount *mp);
+int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
+
+/*
+ * Process all pending inode inactivations immediately (sort of) so that a
+ * resource usage report will be mostly accurate with regards to files that
+ * have been unlinked recently.
+ *
+ * It isn't practical to maintain a count of the resources used by unlinked
+ * inodes to adjust the values reported by this function.  Resources that are
+ * shared (e.g. reflink) when an inode is queued for inactivation cannot be
+ * counted towards the adjustment, and cross referencing data extents with the
+ * refcount btree is the only way to decide if a resource is shared.  Worse,
+ * unsharing of any data blocks in the system requires either a second
+ * consultation with the refcount btree, or training users to deal with the
+ * free space counts possibly fluctuating upwards as inactivations occur.
+ *
+ * Hence we guard the inactivation flush with a ratelimiter so that the counts
+ * are not way out of whack while ignoring workloads that hammer us with statfs
+ * calls.  Once per clock tick seems frequent enough to avoid complaints about
+ * inaccurate counts.
+ */
+static inline void
+xfs_inodegc_summary_flush(
+	struct xfs_mount	*mp)
+{
+	if (__ratelimit(&mp->m_inodegc_ratelimit))
+		xfs_inodegc_flush(mp);
+}
+
 #endif
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 3bee1cd20072..aa55a210a440 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1654,6 +1654,55 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/*
+ * Returns true if we need to update the on-disk metadata before we can free
+ * the memory used by this inode.  Updates include freeing post-eof
+ * preallocations; freeing COW staging extents; and marking the inode free in
+ * the inobt if it is on the unlinked list.
+ */
+bool
+xfs_inode_needs_inactivation(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*cow_ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+
+	/*
+	 * If the inode is already free, then there can be nothing
+	 * to clean up here.
+	 */
+	if (VFS_I(ip)->i_mode == 0)
+		return false;
+
+	/* If this is a read-only mount, don't do this (would generate I/O) */
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return false;
+
+	/* Metadata inodes require explicit resource cleanup. */
+	if (xfs_is_metadata_inode(ip))
+		return false;
+
+	/* Want to clean out the cow blocks if there are any. */
+	if (cow_ifp && cow_ifp->if_bytes > 0)
+		return true;
+
+	/* Unlinked files must be freed. */
+	if (VFS_I(ip)->i_nlink == 0)
+		return true;
+
+	/*
+	 * This file isn't being freed, so check if there are post-eof blocks
+	 * to free.  @force is true because we are evicting an inode from the
+	 * cache.  Post-eof blocks must be freed, lest we end up with broken
+	 * free space accounting.
+	 *
+	 * Note: don't bother with iolock here since lockdep complains about
+	 * acquiring it in reclaim context. We have the only reference to the
+	 * inode at this point anyways.
+	 */
+	return xfs_can_free_eofblocks(ip, true);
+}
+
 /*
  * xfs_inactive
  *
@@ -1664,7 +1713,7 @@ xfs_inactive_ifree(
  */
 void
 xfs_inactive(
-	xfs_inode_t	*ip)
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp;
 	int			error;
@@ -1690,6 +1739,11 @@ xfs_inactive(
 	if (xfs_is_metadata_inode(ip))
 		goto out;
 
+	/* Ensure dquots are attached prior to making changes to this file. */
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		goto out;
+
 	/* Try to clean out the cow blocks if there are any. */
 	if (xfs_inode_has_cow_data(ip))
 		xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true);
@@ -1715,10 +1769,6 @@ xfs_inactive(
 	     ip->i_df.if_nextents > 0 || ip->i_delayed_blks > 0))
 		truncate = 1;
 
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		goto out;
-
 	if (S_ISLNK(VFS_I(ip)->i_mode))
 		error = xfs_inactive_symlink(ip);
 	else if (truncate)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 4b6703dbffb8..8d695b39bcdf 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -240,6 +240,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
+#define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
 /*
  * If this unlinked inode is in the middle of recovery, don't let drop_inode
  * truncate and free the inode.  This can happen if we iget the inode during
@@ -248,6 +249,15 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -255,7 +265,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Flags for inode locking.
@@ -493,6 +504,8 @@ extern struct kmem_zone	*xfs_inode_zone;
 /* The default CoW extent size hint. */
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
+bool xfs_inode_needs_inactivation(struct xfs_inode *ip);
+
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1227503d2246..9d8fc85bd28d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2784,6 +2784,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Flush the pending unlinked inodes to ensure that the inactivations
+	 * are fully completed on disk and the incore inodes can be reclaimed
+	 * before we signal that recovery is complete.
+	 */
+	xfs_inodegc_flush(mp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c3a96fb3ad80..fbbee8ac12f3 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -516,6 +516,10 @@ xfs_check_summary_counts(
  * so we need to unpin them, write them back and/or reclaim them before unmount
  * can proceed.
  *
+ * The caller should start the process by flushing queued inactivation work so
+ * that all file updates to on-disk metadata can be flushed with the log.
+ * After the AIL push, all inodes should be ready for reclamation.
+ *
  * An inode cluster that has been freed can have its buffer still pinned in
  * memory because the transaction is still sitting in a iclog. The stale inodes
  * on that buffer will be pinned to the buffer until the transaction hits the
@@ -546,6 +550,7 @@ xfs_unmount_flush_inodes(
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
 
 	xfs_ail_push_all_sync(mp->m_ail);
+	xfs_inodegc_stop(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
@@ -782,6 +787,17 @@ xfs_mountfs(
 	if (error)
 		goto out_log_dealloc;
 
+	/*
+	 * Don't allow statfs to hammer us with inodegc flushes.  Once per
+	 * clock tick seems sufficient to avoid complaints about inaccurate
+	 * counter values.  Disable ratelimit logging.
+	 */
+	ratelimit_state_init(&mp->m_inodegc_ratelimit, 1, 1);
+	ratelimit_set_flags(&mp->m_inodegc_ratelimit, RATELIMIT_MSG_ON_RELEASE);
+
+	/* Enable background workers. */
+	xfs_inodegc_start(mp);
+
 	/*
 	 * Get and sanity-check the root inode.
 	 * Save the pointer to it in the mount structure.
@@ -937,9 +953,9 @@ xfs_mountfs(
 	/* Clean out dquots that might be in memory after quotacheck. */
 	xfs_qm_unmount(mp);
 	/*
-	 * Flush all inode reclamation work and flush the log.
-	 * We have to do this /after/ rtunmount and qm_unmount because those
-	 * two will have scheduled delayed reclaim for the rt/quota inodes.
+	 * Flush all inode reclamation work and flush inodes to the log.  Do
+	 * this after rtunmount and qm_unmount because those two will have
+	 * released the rt and quota inodes.
 	 *
 	 * This is slightly different from the unmountfs call sequence
 	 * because we could be tearing down a partially set up mount.  In
@@ -947,6 +963,7 @@ xfs_mountfs(
 	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
 	 * quota inodes.
 	 */
+	xfs_inodegc_flush(mp);
 	xfs_unmount_flush_inodes(mp);
  out_log_dealloc:
 	xfs_log_mount_cancel(mp);
@@ -983,6 +1000,12 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Flush all the queued inode inactivation work to disk before tearing
+	 * down rt metadata and quotas.
+	 */
+	xfs_inodegc_flush(mp);
+
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c78b63fe779a..04a016a46dc8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -154,6 +154,8 @@ typedef struct xfs_mount {
 	uint8_t			m_rt_checked;
 	uint8_t			m_rt_sick;
 
+	unsigned long		m_opflags;
+
 	/*
 	 * End of read-mostly variables. Frequently written variables and locks
 	 * should be placed below this comment from now on. The first variable
@@ -184,6 +186,7 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
+	struct delayed_work	m_inodegc_work; /* background inode inactive */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -220,6 +223,7 @@ typedef struct xfs_mount {
 	unsigned int		*m_errortag;
 	struct xfs_kobj		m_errortag_kobj;
 #endif
+	struct ratelimit_state	m_inodegc_ratelimit;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
@@ -258,6 +262,9 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
 #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
 
+#define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)	/* are we allowed to start the
+						   inode inactivation worker? */
+
 /*
  * Max and min values for mount-option defined I/O
  * preallocation sizes.
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 13a56e1ea15c..137f9486ac6a 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -698,6 +698,8 @@ xfs_qm_scall_getquota(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	xfs_inodegc_summary_flush(mp);
+
 	/*
 	 * Try to get the dquot. We don't want it allocated on disk, so don't
 	 * set doalloc. If it doesn't exist, we'll get ENOENT back.
@@ -736,6 +738,8 @@ xfs_qm_scall_getquota_next(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	xfs_inodegc_summary_flush(mp);
+
 	error = xfs_qm_dqget_next(mp, *id, type, &dqp);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3a7fd4f02aa7..120a4426fd64 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -629,6 +629,43 @@ xfs_check_delalloc(
 #define xfs_check_delalloc(ip, whichfork)	do { } while (0)
 #endif
 
+#ifdef CONFIG_XFS_QUOTA
+/*
+ * If a quota type is turned off but we still have a dquot attached to the
+ * inode, detach it before tagging this inode for inactivation (or reclaim) to
+ * avoid delaying quotaoff for longer than is necessary.  Because the inode has
+ * no VFS state and has not yet been tagged for reclaim or inactivation, it is
+ * safe to drop the dquots locklessly because iget, quotaoff, blockgc, and
+ * reclaim will not touch the inode.
+ */
+static inline void
+xfs_fs_dqdestroy_inode(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!XFS_IS_UQUOTA_ON(mp)) {
+		xfs_qm_dqrele(ip->i_udquot);
+		ip->i_udquot = NULL;
+	}
+	if (!XFS_IS_GQUOTA_ON(mp)) {
+		xfs_qm_dqrele(ip->i_gdquot);
+		ip->i_gdquot = NULL;
+	}
+	if (!XFS_IS_PQUOTA_ON(mp)) {
+		xfs_qm_dqrele(ip->i_pdquot);
+		ip->i_pdquot = NULL;
+	}
+}
+#else
+# define xfs_fs_dqdestroy_inode(ip)		((void)0)
+#endif
+
+/* iflags that we shouldn't see before scheduling reclaim or inactivation. */
+#define XFS_IDESTROY_BAD_IFLAGS	(XFS_IRECLAIMABLE | \
+				 XFS_IRECLAIM | \
+				 XFS_NEED_INACTIVE | \
+				 XFS_INACTIVATING)
 /*
  * Now that the generic code is guaranteed not to be accessing
  * the linux inode, we can inactivate and reclaim the inode.
@@ -638,28 +675,44 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			need_inactive;
 
 	trace_xfs_destroy_inode(ip);
 
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-	XFS_STATS_INC(ip->i_mount, vn_rele);
-	XFS_STATS_INC(ip->i_mount, vn_remove);
+	XFS_STATS_INC(mp, vn_rele);
+	XFS_STATS_INC(mp, vn_remove);
 
-	xfs_inactive(ip);
+	need_inactive = xfs_inode_needs_inactivation(ip);
+	if (!need_inactive) {
+		/*
+		 * If the inode doesn't need inactivation, that means we're
+		 * going directly into reclaim and can drop the dquots.  It
+		 * also means that there shouldn't be any delalloc reservations
+		 * or speculative CoW preallocations remaining.
+		 */
+		xfs_qm_dqdetach(ip);
 
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
-		xfs_check_delalloc(ip, XFS_DATA_FORK);
-		xfs_check_delalloc(ip, XFS_COW_FORK);
-		ASSERT(0);
+		if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
+			xfs_check_delalloc(ip, XFS_DATA_FORK);
+			xfs_check_delalloc(ip, XFS_COW_FORK);
+			ASSERT(0);
+		}
+	} else {
+		/*
+		 * Drop dquots for disabled quota types to avoid delaying
+		 * quotaoff while we wait for inactivation to occur.
+		 */
+		xfs_fs_dqdestroy_inode(ip);
 	}
 
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
+	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
-	 * We should never get here with one of the reclaim flags already set.
+	 * We should never get here with any of the reclaim flags already set.
 	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IDESTROY_BAD_IFLAGS));
 
 	/*
 	 * We always use background reclaim here because even if the inode is
@@ -668,7 +721,7 @@ xfs_fs_destroy_inode(
 	 * reclaim path handles this more efficiently than we can here, so
 	 * simply let background reclaim tear down all inodes.
 	 */
-	xfs_inode_mark_reclaimable(ip);
+	xfs_inode_mark_reclaimable(ip, need_inactive);
 }
 
 static void
@@ -780,6 +833,21 @@ xfs_fs_sync_fs(
 		flush_delayed_work(&mp->m_log->l_work);
 	}
 
+	/*
+	 * If the fs is at FREEZE_PAGEFAULTS, that means the VFS holds the
+	 * umount mutex and is syncing the filesystem just before setting the
+	 * state to FREEZE_FS.  We are not allowed to run transactions on a
+	 * filesystem that is in FREEZE_FS state, so deactivate the background
+	 * workers before we get there, and leave them off for the duration of
+	 * the freeze.
+	 *
+	 * We can't do this in xfs_fs_freeze_super because freeze_super takes
+	 * s_umount, which means we can't lock out a concurrent thaw request
+	 * without adding another layer of locks to the freeze process.
+	 */
+	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
+		xfs_inodegc_stop(mp);
+
 	return 0;
 }
 
@@ -798,6 +866,8 @@ xfs_fs_statfs(
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
+	xfs_inodegc_summary_flush(mp);
+
 	statp->f_type = XFS_SUPER_MAGIC;
 	statp->f_namelen = MAXNAMELEN - 1;
 
@@ -908,10 +978,27 @@ xfs_fs_unfreeze(
 
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
+	xfs_inodegc_start(mp);
 	xfs_blockgc_start(mp);
 	return 0;
 }
 
+STATIC int
+xfs_fs_freeze_super(
+	struct super_block	*sb)
+{
+	struct xfs_mount	*mp = XFS_M(sb);
+
+	/*
+	 * Before we take s_umount to get to FREEZE_WRITE, flush all the
+	 * accumulated background work so that there's less recovery work
+	 * to do if we crash during the freeze.
+	 */
+	xfs_inodegc_flush(mp);
+
+	return freeze_super(sb);
+}
+
 /*
  * This function fills in xfs_mount_t fields based on mount args.
  * Note: the superblock _has_ now been read in.
@@ -1090,6 +1177,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.freeze_super		= xfs_fs_freeze_super,
 };
 
 static int
@@ -1737,6 +1825,13 @@ xfs_remount_ro(
 		return error;
 	}
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes.
+	 * Since this can involve finobt updates, do it now before we lose the
+	 * per-AG space reservations to guarantee that we won't fail there.
+	 */
+	xfs_inodegc_flush(mp);
+
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
 	if (error) {
@@ -1860,6 +1955,7 @@ static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
+	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a10612155377..49d09b1571d6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -615,14 +615,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -632,6 +635,8 @@ DEFINE_EVENT(xfs_inode_class, name, \
 DEFINE_INODE_EVENT(xfs_iget_skip);
 DEFINE_INODE_EVENT(xfs_iget_reclaim);
 DEFINE_INODE_EVENT(xfs_iget_reclaim_fail);
+DEFINE_INODE_EVENT(xfs_iget_inactive);
+DEFINE_INODE_EVENT(xfs_iget_inactive_fail);
 DEFINE_INODE_EVENT(xfs_iget_hit);
 DEFINE_INODE_EVENT(xfs_iget_miss);
 
@@ -666,6 +671,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -3928,6 +3937,7 @@ DEFINE_EVENT(xfs_icwalk_class, name,	\
 	TP_ARGS(mp, icw, caller_ip))
 DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks);
 DEFINE_ICWALK_EVENT(xfs_blockgc_free_space);
+DEFINE_ICWALK_EVENT(xfs_inodegc_free_space);
 
 #endif /* _TRACE_XFS_H */
 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
  2021-06-07 22:24 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong
  2021-06-07 22:25 ` [PATCH 2/9] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-08  1:09   ` Dave Chinner
  2021-06-07 22:25 ` [PATCH 4/9] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Allow administrators to control the length that we defer inode
inactivation.  By default we'll set the delay to 2 seconds, as an
arbitrary choice between allowing for some batching of a deltree
operation, and not letting too many inodes pile up in memory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    7 +++++++
 fs/xfs/xfs_globals.c              |    3 +++
 fs/xfs/xfs_icache.c               |    3 ++-
 fs/xfs/xfs_linux.h                |    1 +
 fs/xfs/xfs_sysctl.c               |    9 +++++++++
 fs/xfs/xfs_sysctl.h               |    1 +
 6 files changed, 23 insertions(+), 1 deletion(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index f9b109bfc6a6..9dd62b155fda 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -277,6 +277,13 @@ The following sysctls are available for the XFS filesystem:
 	references and returns timed-out AGs back to the free stream
 	pool.
 
+  fs.xfs.inode_gc_delay
+	(Units: centiseconds   Min: 0  Default: 1  Max: 360000)
+	The amount of time to delay cleanup work that happens after a file is
+	closed by all programs.  This involves clearing speculative
+	preallocations from linked files and freeing unlinked files.  A higher
+	value here increases batching at a risk of background work storms.
+
   fs.xfs.speculative_prealloc_lifetime
 	(Units: seconds   Min: 1  Default: 300  Max: 86400)
 	The interval at which the background scanning for inodes
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index f62fa652c2fd..8c359b0b8fd3 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -28,6 +28,9 @@ xfs_param_t xfs_params = {
 	.rotorstep	= {	1,		1,		255	},
 	.inherit_nodfrg	= {	0,		1,		1	},
 	.fstrm_timer	= {	1,		30*100,		3600*100},
+	.inodegc_timer	= {	0,		1,		3600*100},
+
+	/* Values below here are measured in seconds */
 	.blockgc_timer	= {	1,		300,		3600*24},
 };
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 791202236a18..432b30d0b878 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -247,7 +247,8 @@ xfs_inodegc_queue(
 
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
-		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+				msecs_to_jiffies(xfs_inodegc_centisecs * 10));
 	rcu_read_unlock();
 }
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 7688663b9773..3d6b0a407d52 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -99,6 +99,7 @@ typedef __u32			xfs_nlink_t;
 #define xfs_inherit_nodefrag	xfs_params.inherit_nodfrg.val
 #define xfs_fstrm_centisecs	xfs_params.fstrm_timer.val
 #define xfs_blockgc_secs	xfs_params.blockgc_timer.val
+#define xfs_inodegc_centisecs	xfs_params.inodegc_timer.val
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_set_flags_nested(sp, f)		\
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index 546a6cd96729..878f31d3a587 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -176,6 +176,15 @@ static struct ctl_table xfs_table[] = {
 		.extra1		= &xfs_params.fstrm_timer.min,
 		.extra2		= &xfs_params.fstrm_timer.max,
 	},
+	{
+		.procname	= "inode_gc_delay",
+		.data		= &xfs_params.inodegc_timer.val,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &xfs_params.inodegc_timer.min,
+		.extra2		= &xfs_params.inodegc_timer.max
+	},
 	{
 		.procname	= "speculative_prealloc_lifetime",
 		.data		= &xfs_params.blockgc_timer.val,
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 7692e76ead33..a045c33c3d30 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -36,6 +36,7 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t blockgc_timer;	/* Interval between blockgc scans */
+	xfs_sysctl_val_t inodegc_timer;	/* Inode inactivation scan interval */
 } xfs_param_t;
 
 /*


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/9] xfs: force inode inactivation and retry fs writes when there isn't space
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-06-07 22:25 ` [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-07 22:25 ` [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Any time we try to modify a file's contents and it fails due to ENOSPC
or EDQUOT, force inode inactivation work to try to free space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 432b30d0b878..a7ca6b988e29 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1657,16 +1657,23 @@ xfs_blockgc_worker(
 }
 
 /*
- * Try to free space in the filesystem by purging eofblocks and cowblocks.
+ * Try to free space in the filesystem by purging inactive inodes, eofblocks
+ * and cowblocks.
  */
 int
 xfs_blockgc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_icwalk	*icw)
 {
+	int			error;
+
 	trace_xfs_blockgc_free_space(mp, icw, _RET_IP_);
 
-	return xfs_icwalk(mp, XFS_ICWALK_BLOCKGC, icw);
+	error = xfs_icwalk(mp, XFS_ICWALK_BLOCKGC, icw);
+	if (error)
+		return error;
+
+	return xfs_inodegc_free_space(mp, icw);
 }
 
 /*


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-06-07 22:25 ` [PATCH 4/9] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-08  1:26   ` Dave Chinner
  2021-06-07 22:25 ` [PATCH 6/9] xfs: parallelize inode inactivation Darrick J. Wong
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Generally speaking, when a user calls fallocate, they're looking to
preallocate space in a file in the largest contiguous chunks possible.
If free space is low, it's possible that the free space will look
unnecessarily fragmented because there are unlinked inodes that are
holding on to space that we could allocate.  When this happens,
fallocate makes suboptimal allocation decisions for the sake of deleted
files, which doesn't make much sense, so scan the filesystem for dead
items to delete to try to avoid this.

Note that there are a handful of fstests that fill a filesystem, delete
just enough files to allow a single large allocation, and check that
fallocate actually gets the allocation.  These tests regress because the
test runs fallocate before the inode gc has a chance to run, so add this
behavior to maintain as much of the old behavior as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.c    |    8 ++++++++
 fs/xfs/xfs_icache.h    |    1 +
 3 files changed, 52 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 997eb5c6e9b4..a1be77fe89d6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -28,6 +28,8 @@
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -767,6 +769,43 @@ xfs_free_eofblocks(
 	return error;
 }
 
+/*
+ * If the target device (or some part of it) is full enough that it won't to be
+ * able to satisfy the entire request, try to free inactive files to free up
+ * space.  While it's perfectly fine to fill a preallocation request with a
+ * bunch of short extents, we prefer to slow down preallocation requests to
+ * combat long term fragmentation in new file data.
+ */
+static int
+xfs_alloc_consolidate_freespace(
+	struct xfs_inode	*ip,
+	xfs_filblks_t		wanted)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	xfs_agnumber_t		agno;
+
+	if (!xfs_has_inodegc_work(mp))
+		return 0;
+
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
+			return 0;
+		goto free_space;
+	}
+
+	for_each_perag(mp, agno, pag) {
+		if (pag->pagf_freeblks >= wanted) {
+			xfs_perag_put(pag);
+			return 0;
+		}
+	}
+
+free_space:
+	return xfs_inodegc_free_space(mp, NULL);
+}
+
 int
 xfs_alloc_file_space(
 	struct xfs_inode	*ip,
@@ -851,6 +890,10 @@ xfs_alloc_file_space(
 			rblocks = 0;
 		}
 
+		error = xfs_alloc_consolidate_freespace(ip, allocatesize_fsb);
+		if (error)
+			break;
+
 		/*
 		 * Allocate and setup the transaction.
 		 */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a7ca6b988e29..8016e90b7b6d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1965,6 +1965,14 @@ xfs_inodegc_start(
 	xfs_inodegc_queue(mp);
 }
 
+/* Are there files waiting for inactivation? */
+bool
+xfs_has_inodegc_work(
+	struct xfs_mount	*mp)
+{
+	return radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG);
+}
+
 /* XFS Inode Cache Walking Code */
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d03d46f83316..1f693e7fe6c8 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -85,6 +85,7 @@ void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
 int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
+bool xfs_has_inodegc_work(struct xfs_mount *mp);
 
 /*
  * Process all pending inode inactivations immediately (sort of) so that a


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/9] xfs: parallelize inode inactivation
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-06-07 22:25 ` [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-07 22:25 ` [PATCH 7/9] xfs: create a polled function to force " Darrick J. Wong
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c |    3 +++
 fs/xfs/libxfs/xfs_ag.h |    3 +++
 fs/xfs/xfs_icache.c    |   48 ++++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_mount.h     |    1 -
 fs/xfs/xfs_super.c     |    1 -
 5 files changed, 40 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 0765a0ba30e1..7652d90d7d0d 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -173,6 +173,7 @@ __xfs_free_perag(
 	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
 
 	ASSERT(!delayed_work_pending(&pag->pag_blockgc_work));
+	ASSERT(!delayed_work_pending(&pag->pag_inodegc_work));
 	ASSERT(atomic_read(&pag->pag_ref) == 0);
 	kmem_free(pag);
 }
@@ -195,6 +196,7 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 
@@ -253,6 +255,7 @@ xfs_initialize_perag(
 		spin_lock_init(&pag->pagb_lock);
 		spin_lock_init(&pag->pag_state_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
+		INIT_DELAYED_WORK(&pag->pag_inodegc_work, xfs_inodegc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		init_waitqueue_head(&pag->pagb_wait);
 		pag->pagb_count = 0;
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index d68b56de495a..1408043dce85 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -96,6 +96,9 @@ struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
+	/* background inode inactivation */
+	struct delayed_work	pag_inodegc_work;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8016e90b7b6d..8574edca6f52 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -240,14 +240,16 @@ xfs_inodegc_running(struct xfs_mount *mp)
 /* Queue a new inode gc pass if there are inodes needing inactivation. */
 static void
 xfs_inodegc_queue(
-	struct xfs_mount        *mp)
+	struct xfs_perag	*pag)
 {
+	struct xfs_mount        *mp = pag->pag_mount;
+
 	if (!xfs_inodegc_running(mp))
 		return;
 
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
-		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
 				msecs_to_jiffies(xfs_inodegc_centisecs * 10));
 	rcu_read_unlock();
 }
@@ -287,7 +289,7 @@ xfs_perag_set_inode_tag(
 		xfs_blockgc_queue(pag);
 		break;
 	case XFS_ICI_INODEGC_TAG:
-		xfs_inodegc_queue(mp);
+		xfs_inodegc_queue(pag);
 		break;
 	}
 
@@ -1915,8 +1917,9 @@ void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_mount	*mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_inodegc_work);
+	struct xfs_perag	*pag = container_of(to_delayed_work(work),
+					struct xfs_perag, pag_inodegc_work);
+	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
 	/*
@@ -1927,24 +1930,33 @@ xfs_inodegc_worker(
 	if (!xfs_inodegc_running(mp))
 		return;
 
-	error = xfs_inodegc_free_space(mp, NULL);
+	error = xfs_icwalk_ag(pag, XFS_ICWALK_INODEGC, NULL);
 	if (error && error != -EAGAIN)
 		xfs_err(mp, "inode inactivation failed, error %d", error);
 
-	xfs_inodegc_queue(mp);
+	xfs_inodegc_queue(pag);
 }
 
-/* Force all currently queued inode inactivation work to run immediately. */
+/* Force all queued inode inactivation work to run immediately. */
 void
 xfs_inodegc_flush(
 	struct xfs_mount	*mp)
 {
-	if (!xfs_inodegc_running(mp) ||
-	    !radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	bool			queued = false;
+
+	if (!xfs_inodegc_running(mp))
+		return;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG) {
+		mod_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0);
+		queued = true;
+	}
+	if (!queued)
 		return;
 
-	mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
-	flush_delayed_work(&mp->m_inodegc_work);
+	flush_workqueue(mp->m_gc_workqueue);
 }
 
 /* Stop all queued inactivation work. */
@@ -1952,8 +1964,12 @@ void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
 	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
-	cancel_delayed_work_sync(&mp->m_inodegc_work);
+	for_each_perag(mp, agno, pag)
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 }
 
 /* Schedule deferred inode inactivation work. */
@@ -1961,8 +1977,12 @@ void
 xfs_inodegc_start(
 	struct xfs_mount	*mp)
 {
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
 	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
-	xfs_inodegc_queue(mp);
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG)
+		xfs_inodegc_queue(pag);
 }
 
 /* Are there files waiting for inactivation? */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 04a016a46dc8..c2a8f0a550cd 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -186,7 +186,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
-	struct delayed_work	m_inodegc_work; /* background inode inactive */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 120a4426fd64..164626a4232f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1955,7 +1955,6 @@ static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
-	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 7/9] xfs: create a polled function to force inode inactivation
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-06-07 22:25 ` [PATCH 6/9] xfs: parallelize inode inactivation Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-07 22:25 ` [PATCH 8/9] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
  2021-06-07 22:25 ` [PATCH 9/9] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
  8 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Create a polled version of xfs_inactive_force so that we can force
inactivation while holding a lock (usually the umount lock) without
tripping over the softlockup timer.  This is for callers that hold vfs
locks while calling inactivation, which is currently unmount, iunlink
processing during mount, and rw->ro remount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h |    1 +
 fs/xfs/xfs_linux.h  |    1 +
 fs/xfs/xfs_mount.c  |    2 +-
 fs/xfs/xfs_mount.h  |    7 +++++++
 fs/xfs/xfs_super.c  |   10 +++++++++-
 6 files changed, 62 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8574edca6f52..22090b318e58 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1907,9 +1907,13 @@ xfs_inodegc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_icwalk	*icw)
 {
+	int			error;
+
 	trace_xfs_inodegc_free_space(mp, icw, _RET_IP_);
 
-	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
+	error = xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
+	wake_up(&mp->m_inactive_wait);
+	return error;
 }
 
 /* Background inode inactivation worker. */
@@ -1959,6 +1963,44 @@ xfs_inodegc_flush(
 	flush_workqueue(mp->m_gc_workqueue);
 }
 
+/*
+ * Force all queued inode inactivation work to run immediately, and poll
+ * until the work is complete.  Callers should only use this function if they
+ * must inactivate inodes while holding VFS mutexes.
+ */
+void
+xfs_inodegc_flush_poll(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	bool			queued = false;
+
+	/*
+	 * Force to the foreground any inode inactivations that may happen
+	 * while we're running our polling loop.
+	 */
+	set_bit(XFS_OPFLAG_INACTIVATE_NOW_BIT, &mp->m_opflags);
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INODEGC_TAG) {
+		mod_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0);
+		queued = true;
+	}
+	if (!queued)
+		goto clear;
+
+	/*
+	 * Touch the softlockup watchdog every 1/10th of a second while there
+	 * are still inactivation-tagged inodes in the filesystem.
+	 */
+	while (!wait_event_timeout(mp->m_inactive_wait,
+				!xfs_has_inodegc_work(mp), HZ / 10)) {
+		touch_softlockup_watchdog();
+	}
+clear:
+	clear_bit(XFS_OPFLAG_INACTIVATE_NOW_BIT, &mp->m_opflags);
+}
+
 /* Stop all queued inactivation work. */
 void
 xfs_inodegc_stop(
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 1f693e7fe6c8..d0ba6778325e 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -82,6 +82,7 @@ void xfs_blockgc_start(struct xfs_mount *mp);
 
 void xfs_inodegc_worker(struct work_struct *work);
 void xfs_inodegc_flush(struct xfs_mount *mp);
+void xfs_inodegc_flush_poll(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
 int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 3d6b0a407d52..9b6f519a9a43 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -61,6 +61,7 @@ typedef __u32			xfs_nlink_t;
 #include <linux/ratelimit.h>
 #include <linux/rhashtable.h>
 #include <linux/xattr.h>
+#include <linux/nmi.h>
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fbbee8ac12f3..24dc3b7026b7 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1004,7 +1004,7 @@ xfs_unmountfs(
 	 * Flush all the queued inode inactivation work to disk before tearing
 	 * down rt metadata and quotas.
 	 */
-	xfs_inodegc_flush(mp);
+	xfs_inodegc_flush_poll(mp);
 
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c2a8f0a550cd..4323aaa3e7b4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -223,6 +223,11 @@ typedef struct xfs_mount {
 	struct xfs_kobj		m_errortag_kobj;
 #endif
 	struct ratelimit_state	m_inodegc_ratelimit;
+	/*
+	 * Use this to wait for the inode inactivation workqueue to finish
+	 * inactivating all the inodes.
+	 */
+	struct wait_queue_head	m_inactive_wait;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
@@ -263,6 +268,8 @@ typedef struct xfs_mount {
 
 #define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)	/* are we allowed to start the
 						   inode inactivation worker? */
+#define XFS_OPFLAG_INACTIVATE_NOW_BIT	(1)	/* force foreground inactivation
+						   of unlinked inodes */
 
 /*
  * Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 164626a4232f..968176b35d13 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -699,6 +699,13 @@ xfs_fs_destroy_inode(
 			xfs_check_delalloc(ip, XFS_COW_FORK);
 			ASSERT(0);
 		}
+	} else if (test_bit(XFS_OPFLAG_INACTIVATE_NOW_BIT, &mp->m_opflags)) {
+		/*
+		 * If we're doing foreground inactivation, take care of the
+		 * inode right now and push it straight to reclaim.
+		 */
+		xfs_inactive(ip);
+		need_inactive = false;
 	} else {
 		/*
 		 * Drop dquots for disabled quota types to avoid delaying
@@ -1830,7 +1837,7 @@ xfs_remount_ro(
 	 * Since this can involve finobt updates, do it now before we lose the
 	 * per-AG space reservations to guarantee that we won't fail there.
 	 */
-	xfs_inodegc_flush(mp);
+	xfs_inodegc_flush_poll(mp);
 
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
@@ -1956,6 +1963,7 @@ static int xfs_init_fs_context(
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
+	init_waitqueue_head(&mp->m_inactive_wait);
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
 	 * recovery, so we must set this to true so that an ifree transaction


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 8/9] xfs: don't run speculative preallocation gc when fs is frozen
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-06-07 22:25 ` [PATCH 7/9] xfs: create a polled function to force " Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  2021-06-07 22:25 ` [PATCH 9/9] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
  8 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the infrastructure to switch background workers on and
off at will, fix the block gc worker code so that we don't actually run
the worker when the filesystem is frozen, same as we do for deferred
inactivation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   29 ++++++++++++++++++++++++++---
 fs/xfs/xfs_mount.c  |    1 +
 fs/xfs/xfs_mount.h  |    3 +++
 fs/xfs/xfs_super.c  |    5 +++--
 4 files changed, 33 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 22090b318e58..5d5a0c137f32 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -215,6 +215,12 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
+static inline bool
+xfs_blockgc_running(struct xfs_mount *mp)
+{
+	return test_bit(XFS_OPFLAG_BLOCKGC_RUNNING_BIT, &mp->m_opflags);
+}
+
 /*
  * Background scanning to trim preallocated space. This is queued based on the
  * 'speculative_prealloc_lifetime' tunable (5m by default).
@@ -223,6 +229,9 @@ static inline void
 xfs_blockgc_queue(
 	struct xfs_perag	*pag)
 {
+	if (!xfs_blockgc_running(pag->pag_mount))
+		return;
+
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
 		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
@@ -1557,7 +1566,8 @@ xfs_blockgc_stop(
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
-	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
+	clear_bit(XFS_OPFLAG_BLOCKGC_RUNNING_BIT, &mp->m_opflags);
+	for_each_perag(mp, agno, pag)
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
 }
 
@@ -1569,6 +1579,7 @@ xfs_blockgc_start(
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
+	set_bit(XFS_OPFLAG_BLOCKGC_RUNNING_BIT, &mp->m_opflags);
 	for_each_perag_tag(mp, agno, pag, XFS_ICI_BLOCKGC_TAG)
 		xfs_blockgc_queue(pag);
 }
@@ -1626,6 +1637,13 @@ xfs_blockgc_scan_inode(
 	unsigned int		lockflags = 0;
 	int			error;
 
+	/*
+	 * Speculative preallocation gc isn't supposed to run when the fs is
+	 * frozen because we don't want kernel threads to block on transaction
+	 * allocation.
+	 */
+	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
+
 	error = xfs_inode_free_eofblocks(ip, icw, &lockflags);
 	if (error)
 		goto unlock;
@@ -1648,13 +1666,18 @@ xfs_blockgc_worker(
 	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
-	if (!sb_start_write_trylock(mp->m_super))
+	/*
+	 * Queueing of this blockgc worker can race with xfs_blockgc_stop,
+	 * which means that we can be running after the opflag clears.  Double
+	 * check the flag state so that we don't trip asserts.
+	 */
+	if (!xfs_blockgc_running(mp))
 		return;
+
 	error = xfs_icwalk_ag(pag, XFS_ICWALK_BLOCKGC, NULL);
 	if (error)
 		xfs_info(mp, "AG %u preallocation gc worker failed, err=%d",
 				pag->pag_agno, error);
-	sb_end_write(mp->m_super);
 	xfs_blockgc_queue(pag);
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 24dc3b7026b7..d95432e4ac39 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -797,6 +797,7 @@ xfs_mountfs(
 
 	/* Enable background workers. */
 	xfs_inodegc_start(mp);
+	xfs_blockgc_start(mp);
 
 	/*
 	 * Get and sanity-check the root inode.
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4323aaa3e7b4..2da5bd55dd81 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -270,6 +270,9 @@ typedef struct xfs_mount {
 						   inode inactivation worker? */
 #define XFS_OPFLAG_INACTIVATE_NOW_BIT	(1)	/* force foreground inactivation
 						   of unlinked inodes */
+#define XFS_OPFLAG_BLOCKGC_RUNNING_BIT	(2)	/* are we allowed to start the
+						   speculative preallocation gc
+						   worker? */
 
 /*
  * Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 968176b35d13..730f8e960d98 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -852,8 +852,10 @@ xfs_fs_sync_fs(
 	 * s_umount, which means we can't lock out a concurrent thaw request
 	 * without adding another layer of locks to the freeze process.
 	 */
-	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
+	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
 		xfs_inodegc_stop(mp);
+		xfs_blockgc_stop(mp);
+	}
 
 	return 0;
 }
@@ -970,7 +972,6 @@ xfs_fs_freeze(
 	 * set a GFP_NOFS context here to avoid recursion deadlocks.
 	 */
 	flags = memalloc_nofs_save();
-	xfs_blockgc_stop(mp);
 	xfs_save_resvblks(mp);
 	ret = xfs_log_quiesce(mp);
 	memalloc_nofs_restore(flags);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 9/9] xfs: avoid buffer deadlocks when walking fs inodes
  2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-06-07 22:25 ` [PATCH 8/9] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
@ 2021-06-07 22:25 ` Darrick J. Wong
  8 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-07 22:25 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

When we're servicing an INUMBERS or BULKSTAT request or running
quotacheck, grab an empty transaction so that we can use its inherent
recursive buffer locking abilities to detect inode btree cycles without
hitting ABBA buffer deadlocks.  This patch requires the deferred inode
inactivation patchset because xfs_irele cannot directly call
xfs_inactive when the iwalk itself has an (empty) transaction.

Found by fuzzing an inode btree pointer to introduce a cycle into the
tree (xfs/365).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_itable.c |   42 +++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_iwalk.c  |   33 ++++++++++++++++++++++++++++-----
 2 files changed, 65 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index f331975a16de..84c17a9f9869 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -19,6 +19,7 @@
 #include "xfs_error.h"
 #include "xfs_icache.h"
 #include "xfs_health.h"
+#include "xfs_trans.h"
 
 /*
  * Bulk Stat
@@ -163,6 +164,7 @@ xfs_bulkstat_one(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error;
 
 	if (breq->mnt_userns != &init_user_ns) {
@@ -178,9 +180,18 @@ xfs_bulkstat_one(
 	if (!bc.buf)
 		return -ENOMEM;
 
-	error = xfs_bulkstat_one_int(breq->mp, breq->mnt_userns, NULL,
-				     breq->startino, &bc);
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
 
+	error = xfs_bulkstat_one_int(breq->mp, breq->mnt_userns, tp,
+			breq->startino, &bc);
+	xfs_trans_cancel(tp);
+out:
 	kmem_free(bc.buf);
 
 	/*
@@ -244,6 +255,7 @@ xfs_bulkstat(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error;
 
 	if (breq->mnt_userns != &init_user_ns) {
@@ -259,9 +271,18 @@ xfs_bulkstat(
 	if (!bc.buf)
 		return -ENOMEM;
 
-	error = xfs_iwalk(breq->mp, NULL, breq->startino, breq->flags,
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
+
+	error = xfs_iwalk(breq->mp, tp, breq->startino, breq->flags,
 			xfs_bulkstat_iwalk, breq->icount, &bc);
-
+	xfs_trans_cancel(tp);
+out:
 	kmem_free(bc.buf);
 
 	/*
@@ -374,13 +395,24 @@ xfs_inumbers(
 		.formatter	= formatter,
 		.breq		= breq,
 	};
+	struct xfs_trans	*tp;
 	int			error = 0;
 
 	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
 		return 0;
 
-	error = xfs_inobt_walk(breq->mp, NULL, breq->startino, breq->flags,
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(breq->mp, &tp);
+	if (error)
+		goto out;
+
+	error = xfs_inobt_walk(breq->mp, tp, breq->startino, breq->flags,
 			xfs_inumbers_walk, breq->icount, &ic);
+	xfs_trans_cancel(tp);
+out:
 
 	/*
 	 * We found some inode groups, so clear the error status and return
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 917d51eefee3..7558486f4937 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -83,6 +83,9 @@ struct xfs_iwalk_ag {
 
 	/* Skip empty inobt records? */
 	unsigned int			skip_empty:1;
+
+	/* Drop the (hopefully empty) transaction when calling iwalk_fn. */
+	unsigned int			drop_trans:1;
 };
 
 /*
@@ -352,7 +355,6 @@ xfs_iwalk_run_callbacks(
 	int				*has_more)
 {
 	struct xfs_mount		*mp = iwag->mp;
-	struct xfs_trans		*tp = iwag->tp;
 	struct xfs_inobt_rec_incore	*irec;
 	xfs_agino_t			next_agino;
 	int				error;
@@ -362,10 +364,15 @@ xfs_iwalk_run_callbacks(
 	ASSERT(iwag->nr_recs > 0);
 
 	/* Delete cursor but remember the last record we cached... */
-	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
+	xfs_iwalk_del_inobt(iwag->tp, curpp, agi_bpp, 0);
 	irec = &iwag->recs[iwag->nr_recs - 1];
 	ASSERT(next_agino >= irec->ir_startino + XFS_INODES_PER_CHUNK);
 
+	if (iwag->drop_trans) {
+		xfs_trans_cancel(iwag->tp);
+		iwag->tp = NULL;
+	}
+
 	error = xfs_iwalk_ag_recs(iwag);
 	if (error)
 		return error;
@@ -376,8 +383,15 @@ xfs_iwalk_run_callbacks(
 	if (!has_more)
 		return 0;
 
+	if (iwag->drop_trans) {
+		error = xfs_trans_alloc_empty(mp, &iwag->tp);
+		if (error)
+			return error;
+	}
+
 	/* ...and recreate the cursor just past where we left off. */
-	error = xfs_inobt_cur(mp, tp, iwag->pag, XFS_BTNUM_INO, curpp, agi_bpp);
+	error = xfs_inobt_cur(mp, iwag->tp, iwag->pag, XFS_BTNUM_INO, curpp,
+			agi_bpp);
 	if (error)
 		return error;
 
@@ -390,7 +404,6 @@ xfs_iwalk_ag(
 	struct xfs_iwalk_ag		*iwag)
 {
 	struct xfs_mount		*mp = iwag->mp;
-	struct xfs_trans		*tp = iwag->tp;
 	struct xfs_perag		*pag = iwag->pag;
 	struct xfs_buf			*agi_bp = NULL;
 	struct xfs_btree_cur		*cur = NULL;
@@ -469,7 +482,7 @@ xfs_iwalk_ag(
 	error = xfs_iwalk_run_callbacks(iwag, &cur, &agi_bp, &has_more);
 
 out:
-	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+	xfs_iwalk_del_inobt(iwag->tp, &cur, &agi_bp, error);
 	return error;
 }
 
@@ -599,8 +612,18 @@ xfs_iwalk_ag_work(
 	error = xfs_iwalk_alloc(iwag);
 	if (error)
 		goto out;
+	/*
+	 * Grab an empty transaction so that we can use its recursive buffer
+	 * locking abilities to detect cycles in the inobt without deadlocking.
+	 */
+	error = xfs_trans_alloc_empty(mp, &iwag->tp);
+	if (error)
+		goto out;
+	iwag->drop_trans = 1;
 
 	error = xfs_iwalk_ag(iwag);
+	if (iwag->tp)
+		xfs_trans_cancel(iwag->tp);
 	xfs_iwalk_free(iwag);
 out:
 	xfs_perag_put(iwag->pag);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/9] xfs: refactor the inode recycling code
  2021-06-07 22:24 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong
@ 2021-06-07 22:59   ` Dave Chinner
  2021-06-08  0:14     ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-06-07 22:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Jun 07, 2021 at 03:24:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
> to an xfs_inode that was previously vfs-destroyed.  The next patch will
> add a new set of state flags, so we need the helper to avoid
> duplication.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |  139 ++++++++++++++++++++++++++++++---------------------
>  1 file changed, 81 insertions(+), 58 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 4e4682879bbd..4d4aa61fbd34 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -350,19 +350,19 @@ xfs_inew_wait(
>   * need to retain across reinitialisation, and rewrite them into the VFS inode
>   * after reinitialisation even if it fails.
>   */
> -static int
> +static inline int
>  xfs_reinit_inode(
>  	struct xfs_mount	*mp,
>  	struct inode		*inode)

Don't use inline here as it's a pretty big function - it's a static
function so let the compiler decide if inlining is worth it.

Otherwise looks ok.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/9] xfs: refactor the inode recycling code
  2021-06-07 22:59   ` Dave Chinner
@ 2021-06-08  0:14     ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-08  0:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Tue, Jun 08, 2021 at 08:59:06AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 03:24:59PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
> > to an xfs_inode that was previously vfs-destroyed.  The next patch will
> > add a new set of state flags, so we need the helper to avoid
> > duplication.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |  139 ++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 81 insertions(+), 58 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 4e4682879bbd..4d4aa61fbd34 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -350,19 +350,19 @@ xfs_inew_wait(
> >   * need to retain across reinitialisation, and rewrite them into the VFS inode
> >   * after reinitialisation even if it fails.
> >   */
> > -static int
> > +static inline int
> >  xfs_reinit_inode(
> >  	struct xfs_mount	*mp,
> >  	struct inode		*inode)
> 
> Don't use inline here as it's a pretty big function - it's a static
> function so let the compiler decide if inlining is worth it.

Fixed.

--D

> Otherwise looks ok.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] xfs: deferred inode inactivation
  2021-06-07 22:25 ` [PATCH 2/9] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-06-08  0:57   ` Dave Chinner
  2021-06-08  4:40     ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-06-08  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Jun 07, 2021 at 03:25:04PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> defer the inactivation phase to a separate workqueue.  With this we
> avoid blocking memory reclaim on filesystem metadata updates that are
> necessary to free an in-core inode, such as post-eof block freeing, COW
> staging extent freeing, and truncating and freeing unlinked inodes.  Now
> that work is deferred to a workqueue where we can do the freeing in
> batches.
> 
> We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> The first flag helps our worker find inodes needing inactivation, and
> the second flag marks inodes that are in the process of being
> inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> 
> Unfortunately, deferring the inactivation has one huge downside --
> eventual consistency.  Since all the freeing is deferred to a worker
> thread, one can rm a file but the space doesn't come back immediately.
> This can cause some odd side effects with quota accounting and statfs,
> so we also force inactivation scans in order to maintain the existing
> behaviors, at least outwardly.
> 
> For this patch we'll set the delay to zero to mimic the old timing as
> much as possible; in the next patch we'll play with different delay
> settings.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  Documentation/admin-guide/xfs.rst |    3 
>  fs/xfs/scrub/common.c             |    2 
>  fs/xfs/xfs_fsops.c                |    4 
>  fs/xfs/xfs_icache.c               |  364 +++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_icache.h               |   35 +++-
>  fs/xfs/xfs_inode.c                |   60 ++++++
>  fs/xfs/xfs_inode.h                |   15 +-
>  fs/xfs/xfs_log_recover.c          |    7 +
>  fs/xfs/xfs_mount.c                |   29 +++
>  fs/xfs/xfs_mount.h                |    7 +
>  fs/xfs/xfs_qm_syscalls.c          |    4 
>  fs/xfs/xfs_super.c                |  120 +++++++++++-
>  fs/xfs/xfs_trace.h                |   14 +
>  13 files changed, 620 insertions(+), 44 deletions(-)

Big patch. Much as I don't like asking people to do this, I'd like
you to split the "xfs_inode_needs_inactivation()" factoring out of
this patch, just to reduce the amount of churn around the
inactivation callout code in this patch.

There's a couple of other changes around this that should reduce the
churn, too...

> @@ -343,6 +345,8 @@ xfs_fs_counts(
>  	xfs_mount_t		*mp,
>  	xfs_fsop_counts_t	*cnt)
>  {
> +	xfs_inodegc_summary_flush(mp);

What does "summary flush" mean? Doesn't make much sense from here...

>  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
>  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
>  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 4d4aa61fbd34..791202236a18 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -32,6 +32,8 @@
>  #define XFS_ICI_RECLAIM_TAG	0
>  /* Inode has speculative preallocations (posteof or cow) to clean. */
>  #define XFS_ICI_BLOCKGC_TAG	1
> +/* Inode can be inactivated. */
> +#define XFS_ICI_INODEGC_TAG	2
>  
>  /*
>   * The goal for walking incore inodes.  These can correspond with incore inode
> @@ -44,6 +46,7 @@ enum xfs_icwalk_goal {
>  	/* Goals directly associated with tagged inodes. */
>  	XFS_ICWALK_BLOCKGC	= XFS_ICI_BLOCKGC_TAG,
>  	XFS_ICWALK_RECLAIM	= XFS_ICI_RECLAIM_TAG,
> +	XFS_ICWALK_INODEGC	= XFS_ICI_INODEGC_TAG,
>  };
>  
>  #define XFS_ICWALK_NULL_TAG	(-1U)
> @@ -228,6 +231,26 @@ xfs_blockgc_queue(
>  	rcu_read_unlock();
>  }
>  
> +static inline bool
> +xfs_inodegc_running(struct xfs_mount *mp)
> +{
> +	return test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> +}

Ok, these opflags are new. Not a big fan of the naming, more on that
later.


> +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> +static void
> +xfs_inodegc_queue(
> +	struct xfs_mount        *mp)
> +{
> +	if (!xfs_inodegc_running(mp))
> +		return;
> +
> +	rcu_read_lock();
> +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> +	rcu_read_unlock();
> +}

I have no idea why we are checking if the gc is running here. All
our other background stuff runs and re-queues until it is directly
stopped or there's nothing left in the tree. Hence I'm a bit
clueless right now about what this semantic is for...

> +
>  /* Set a tag on both the AG incore inode tree and the AG radix tree. */
>  static void
>  xfs_perag_set_inode_tag(
> @@ -262,6 +285,9 @@ xfs_perag_set_inode_tag(
>  	case XFS_ICI_BLOCKGC_TAG:
>  		xfs_blockgc_queue(pag);
>  		break;
> +	case XFS_ICI_INODEGC_TAG:
> +		xfs_inodegc_queue(mp);
> +		break;
>  	}

And there's the on-demand start...

> @@ -308,18 +334,28 @@ xfs_perag_clear_inode_tag(
>   */
>  void
>  xfs_inode_mark_reclaimable(
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	bool			need_inactive)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_perag	*pag;
> +	unsigned int		tag;
>  
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>  	spin_lock(&pag->pag_ici_lock);
>  	spin_lock(&ip->i_flags_lock);
>  
> -	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			XFS_ICI_RECLAIM_TAG);
> -	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> +	if (need_inactive) {
> +		trace_xfs_inode_set_need_inactive(ip);
> +		ip->i_flags |= XFS_NEED_INACTIVE;
> +		tag = XFS_ICI_INODEGC_TAG;
> +	} else {
> +		trace_xfs_inode_set_reclaimable(ip);
> +		ip->i_flags |= XFS_IRECLAIMABLE;
> +		tag = XFS_ICI_RECLAIM_TAG;
> +	}
> +
> +	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);


Hmmmm. Rather than passing a boolean into this function that
indicates what needs to be done, why not move all the inactivation
stuff into this function? i.e. move the
xfs_inode_needs_inactivation() check into here instead of splitting
the inactivation and reclaim logic over xfs_fs_destroy_inode() and
this function?

>  
>  	spin_unlock(&ip->i_flags_lock);
>  	spin_unlock(&pag->pag_ici_lock);
> @@ -383,19 +419,26 @@ xfs_reinit_inode(
>  static int
>  xfs_iget_recycle(
>  	struct xfs_perag	*pag,
> -	struct xfs_inode	*ip) __releases(&ip->i_flags_lock)
> +	struct xfs_inode	*ip,
> +	unsigned long		iflag) __releases(&ip->i_flags_lock)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
> +	unsigned int		tag;
>  	int			error;
>  
> +	ASSERT(iflag == XFS_IRECLAIM || iflag == XFS_INACTIVATING);
> +
> +	tag = (iflag == XFS_INACTIVATING) ? XFS_ICI_INODEGC_TAG :
> +					    XFS_ICI_RECLAIM_TAG;

I don't like ternaries in code like this - just use an if-else,
or combine the assert into a switch:

	switch(iflag) {
	case XFS_INACTIVATING:
		tag = XFS_ICI_INODEGC_TAG;
		break;
	case XFS_IRECLAIM:
		tag = XFS_ICI_RECLAIM_TAG;
		break;
	default:
		ASSERT(0);
		return -EINVAL;
	}


>  	/*
>  	 * We need to make it look like the inode is being reclaimed to prevent
>  	 * the actual reclaim workers from stomping over us while we recycle
>  	 * the inode.  We can't clear the radix tree tag yet as it requires
>  	 * pag_ici_lock to be held exclusive.
>  	 */
> -	ip->i_flags |= XFS_IRECLAIM;
> +	ip->i_flags |= iflag;
>  
>  	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
> @@ -412,10 +455,13 @@ xfs_iget_recycle(
>  		rcu_read_lock();
>  		spin_lock(&ip->i_flags_lock);
>  		wake = !!__xfs_iflags_test(ip, XFS_INEW);
> -		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
> +		ip->i_flags &= ~(XFS_INEW | iflag);
>  		if (wake)
>  			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
> -		ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
> +		if (iflag == XFS_IRECLAIM)
> +			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
> +		if (iflag == XFS_INACTIVATING)
> +			ASSERT(ip->i_flags & XFS_NEED_INACTIVE);
>  		spin_unlock(&ip->i_flags_lock);
>  		rcu_read_unlock();
>  		return error;
> @@ -431,8 +477,7 @@ xfs_iget_recycle(
>  	 */
>  	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
>  	ip->i_flags |= XFS_INEW;
> -	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			XFS_ICI_RECLAIM_TAG);
> +	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
>  	inode->i_state = I_NEW;
>  	spin_unlock(&ip->i_flags_lock);
>  	spin_unlock(&pag->pag_ici_lock);
> @@ -455,6 +500,13 @@ xfs_iget_check_free_state(
>  	struct xfs_inode	*ip,
>  	int			flags)
>  {
> +	/*
> +	 * Unlinked inodes awaiting inactivation must not be reused until we
> +	 * have a chance to clear the on-disk metadata.
> +	 */
> +	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
> +		return -ENOENT;

Hmmmm. That's messy. The actual situation here is inodes that are on
the unlinked list but have no VFS references need to be avoided.
This should only happen in the cache hit case, so I don't think this
belongs in xfs_iget_check_free_state() as that gets called from the
cache miss case, too.

Indeed, I think this is a case where we need to explicitly skip the
inode in lookup, because we cannot actually recycle or re-use these
inodes until they've been removed from the unlinked list. i.e. it's
a primary selection criteria for a cache hit, not some that should
be hidden in a separate function....

> +
>  	if (flags & XFS_IGET_CREATE) {
>  		/* should be a free inode */
>  		if (VFS_I(ip)->i_mode != 0) {
> @@ -521,7 +573,7 @@ xfs_iget_cache_hit(
>  	 *	     wait_on_inode to wait for these flags to be cleared
>  	 *	     instead of polling for it.
>  	 */
> -	if (ip->i_flags & (XFS_INEW|XFS_IRECLAIM)) {
> +	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING)) {
>  		trace_xfs_iget_skip(ip);
>  		XFS_STATS_INC(mp, xs_ig_frecycle);
>  		error = -EAGAIN;
> @@ -549,11 +601,29 @@ xfs_iget_cache_hit(
>  		}
>  
>  		/* Drops i_flags_lock and RCU read lock. */
> -		error = xfs_iget_recycle(pag, ip);
> +		error = xfs_iget_recycle(pag, ip, XFS_IRECLAIM);
>  		if (error) {
>  			trace_xfs_iget_reclaim_fail(ip);
>  			return error;
>  		}
> +	} else if (ip->i_flags & XFS_NEED_INACTIVE) {
> +		/*
> +		 * If NEED_INACTIVE is set, we've torn down the VFS inode
> +		 * already, and must carefully restore it to usable state.
> +		 */
> +		trace_xfs_iget_inactive(ip);
> +
> +		if (flags & XFS_IGET_INCORE) {
> +			error = -EAGAIN;
> +			goto out_error;
> +		}

And that's also a primary selection criteria. :)

> +
> +		/* Drops i_flags_lock and RCU read lock. */
> +		error = xfs_iget_recycle(pag, ip, XFS_INACTIVATING);
> +		if (error) {
> +			trace_xfs_iget_inactive_fail(ip);
> +			return error;
> +		}
>  	} else {
>  		/* If the VFS inode is being torn down, pause and try again. */
>  		if (!igrab(inode)) {

Overall, I think this is kinda messy in that it smears the logic
boundaries in the function. The cache hit code is structured as

	check inode validity
	  skip inode
	check inode reusuability state
	  skip inode
	check inode reclaim state
	  recycle inode
	or
	  grab VFS inode


I think it would be better to restructure this to end up looking
like this:

	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING))
		goto out_skip;

	if ((flags & XFS_IGET_INCORE) &&
	    (ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE)))
		goto out_skip;

	error = xfs_iget_check_free_state(ip, flags);
	if (error)
		goto out_error;

	if (ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE)) {
		trace_xfs_iget_recycle(ip);

		/* Drops i_flags_lock and RCU read lock. */
		error = xfs_iget_recycle(pag, ip);
		if (error) {
			trace_xfs_iget_recycle_fail(ip);
			return error;
		}
	} else {
		/* igrab */
	}
....
out_skip:
	trace_xfs_iget_skip(ip);
	error = -EAGAIN;
out_error:
	....
}

Where the details of what to do to recycle the inode is handled
entirely within xfs_iget_recycle(), rather than splitting the logic
over two functions. We don't need separate trace points for reclaim
vs inactivation recycling - we've got that information in the inode
flags that the trace point should be emitting.

The above gives us a much cleaner cache hit path, and gets all of
the slow path stuff (recycling inodes) out of the normal lookup
path.


> @@ -845,22 +915,33 @@ xfs_dqrele_igrab(
>  
>  	/*
>  	 * Skip inodes that are anywhere in the reclaim machinery because we
> -	 * drop dquots before tagging an inode for reclamation.
> +	 * drop dquots before tagging an inode for reclamation.  If the inode
> +	 * is being inactivated, skip it because inactivation will drop the
> +	 * dquots for us.
>  	 */
> -	if (ip->i_flags & (XFS_IRECLAIM | XFS_IRECLAIMABLE))
> +	if (ip->i_flags & (XFS_IRECLAIM | XFS_IRECLAIMABLE | XFS_INACTIVATING))
>  		goto out_unlock;
>  
>  	/*
> -	 * The inode looks alive; try to grab a VFS reference so that it won't
> -	 * get destroyed.  If we got the reference, return true to say that
> -	 * we grabbed the inode.
> +	 * If the inode is queued but not undergoing inactivation, set the
> +	 * inactivating flag so everyone will leave it alone and return true
> +	 * to say that we are taking ownership of it.
> +	 *
> +	 * Otherwise, the inode looks alive; try to grab a VFS reference so
> +	 * that it won't get destroyed.  If we got the reference, return true
> +	 * to say that we grabbed the inode.
>  	 *
>  	 * If we can't get the reference, then we know the inode had its VFS
>  	 * state torn down and hasn't yet entered the reclaim machinery.  Since
>  	 * we also know that dquots are detached from an inode before it enters
>  	 * reclaim, we can skip the inode.
>  	 */
> -	ret = igrab(VFS_I(ip)) != NULL;
> +	if (ip->i_flags & XFS_NEED_INACTIVE) {
> +		ip->i_flags |= XFS_INACTIVATING;
> +		ret = true;
> +	} else {
> +		ret = igrab(VFS_I(ip)) != NULL;
> +	}

	ret = true;
	if (ip->i_flags & XFS_NEED_INACTIVE)
		ip->i_flags |= XFS_INACTIVATING;
	else if (!igrab(VFS_I(ip)))
		ret = false;

>  /* Don't try to run block gc on an inode that's in any of these states. */
>  #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
> +					 XFS_NEED_INACTIVE | \
> +					 XFS_INACTIVATING | \
>  					 XFS_IRECLAIMABLE | \
>  					 XFS_IRECLAIM)
>  /*
> @@ -1636,6 +1734,229 @@ xfs_blockgc_free_quota(
>  			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), iwalk_flags);
>  }
>  
> +/*
> + * Inode Inactivation and Reclaimation
> + * ===================================
> + *
> + * Sometimes, inodes need to have work done on them once the last program has
> + * closed the file.  Typically this means cleaning out any leftover speculative
> + * preallocations after EOF or in the CoW fork.  For inodes that have been
> + * totally unlinked, this means unmapping data/attr/cow blocks, removing the
> + * inode from the unlinked buckets, and marking it free in the inobt and inode
> + * table.
> + *
> + * This process can generate many metadata updates, which shows up as close()
> + * and unlink() calls that take a long time.  We defer all that work to a
> + * workqueue which means that we can batch a lot of work and do it in inode
> + * order for better performance.  Furthermore, we can control the workqueue,
> + * which means that we can avoid doing inactivation work at a bad time, such as
> + * when the fs is frozen.
> + *
> + * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
> + * INACTIVATING) and adds a new INODEGC radix tree tag for fast access.  We
> + * maintain separate perag counters for both types, and move counts as inodes
> + * wander the state machine, which now works as follows:
> + *
> + * If the inode needs inactivation, we:
> + *   - Set the NEED_INACTIVE inode flag
> + *   - Increment the per-AG inactive count
> + *   - Set the ICI_INODEGC tag in the per-AG inode tree
> + *   - Set the ICI_INODEGC tag in the per-fs AG tree
> + *   - Schedule background inode inactivation
> + *
> + * If the inode does not need inactivation, we:
> + *   - Set the IRECLAIMABLE inode flag
> + *   - Increment the per-AG reclaim count
> + *   - Set the ICI_RECLAIM tag in the per-AG inode tree
> + *   - Set the ICI_RECLAIM tag in the per-fs AG tree
> + *   - Schedule background inode reclamation
> + *
> + * When it is time for background inode inactivation, we:
> + *   - Set the INACTIVATING inode flag
> + *   - Make all the on-disk updates
> + *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
> + *   - Decrement the per-AG inactive count
> + *   - Clear the ICI_INODEGC tag in the per-AG inode tree
> + *   - Clear the ICI_INODEGC tag in the per-fs AG tree if we just inactivated
> + *     the last inode in the AG
> + *   - Kick the inode into reclamation per the previous paragraph
> + *
> + * When it is time for background inode reclamation, we:
> + *   - Set the IRECLAIM inode flag
> + *   - Detach all the resources and remove the inode from the per-AG inode tree
> + *   - Clear both IRECLAIM and RECLAIMABLE inode flags

That's wrong - we never clear the IRECLAIM state on a reclaimed
inode - it is carried into the slab cache when it is freed so that
racing RCU lookups will always see the IRECLAIM state and skip the
inode and retry.

> + *   - Decrement the per-AG reclaim count
> + *   - Clear the ICI_RECLAIM tag from the per-AG inode tree
> + *   - Clear the ICI_RECLAIM tag from the per-fs AG tree if we just reclaimed
> + *     the last inode in the AG
> + *
> + * When these state transitions occur, the caller must have taken the per-AG
> + * incore inode tree lock and then the inode i_flags lock, in that order.
> + */

While the comment is good, describing what the code does is just
going to get out of date as we modify this in future. I'd drop all
the description of inode/perAG tag and tracking management and just
replace them with:

 * If the inode needs inactivation, we:
 *   - Set the NEED_INACTIVE inode flag
 *   - Schedule background inode inactivation
 *
 * If the inode does not need inactivation, we:
 *   - Set the IRECLAIMABLE inode flag
 *   - Schedule background inode reclamation
 *
 * If the inode is being inactivated, we:
 *   - Set the INACTIVATING inode flag
 *   - Make all the on-disk updates
 *   - Clear the inactive state and set the IRECLAIMABLE inode flag
 *   - Schedule background inode reclamation
 *
 * If the inode is being reclaimed, we:
 *   - Set the IRECLAIM inode flag
 *   - Reclaim the inode and RCU free it.

> +/*
> + * Decide if the given @ip is eligible for inactivation, and grab it if so.
> + * Returns true if it's ready to go or false if we should just ignore it.
> + */
> +static bool
> +xfs_inodegc_igrab(
> +	struct xfs_inode	*ip)
> +{
> +	ASSERT(rcu_read_lock_held());
> +
> +	/* Check for stale RCU freed inode */
> +	spin_lock(&ip->i_flags_lock);
> +	if (!ip->i_ino)
> +		goto out_unlock_noent;
> +
> +	/*
> +	 * Skip inodes that don't need inactivation or are being inactivated
> +	 * (or reactivated) by another thread.  Inodes should not be tagged
> +	 * for inactivation while also in INEW or any reclaim state.
> +	 */
> +	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
> +	    (ip->i_flags & XFS_INACTIVATING))
> +		goto out_unlock_noent;
> +
> +	/*
> +	 * Mark this inode as being inactivated even if the fs is shut down
> +	 * because we need xfs_inodegc_inactivate to push this inode into the
> +	 * reclaim state.
> +	 */

These two comments really should go at the head of the function.
i.e. if you define what "eligible for inactivation" where it is
stated, then you don't need these comments in the code.

> +	ip->i_flags |= XFS_INACTIVATING;
> +	spin_unlock(&ip->i_flags_lock);
> +	return true;
> +
> +out_unlock_noent:
> +	spin_unlock(&ip->i_flags_lock);
> +	return false;
> +}
> +
> +/*
> + * Free all speculative preallocations and possibly even the inode itself.
> + * This is the last chance to make changes to an otherwise unreferenced file
> + * before incore reclamation happens.
> + */
> +static int
> +xfs_inodegc_inactivate(
> +	struct xfs_inode	*ip,
> +	struct xfs_perag	*pag,
> +	struct xfs_icwalk	*icw)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +
> +	/*
> +	 * Inactivation isn't supposed to run when the fs is frozen because
> +	 * we don't want kernel threads to block on transaction allocation.
> +	 */
> +	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
> +
> +	/*
> +	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed to
> +	 * pass in a icw structure to look for inodes to inactivate
> +	 * immediately to free some resources.  If this inode isn't a match,
> +	 * put it back on the shelf and move on.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	if (!xfs_icwalk_match(ip, icw)) {
> +		ip->i_flags &= ~XFS_INACTIVATING;
> +		spin_unlock(&ip->i_flags_lock);
> +		return 0;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +
> +	trace_xfs_inode_inactivating(ip);
> +
> +	xfs_inactive(ip);
> +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> +
> +	/*
> +	 * Move the inode from the inactivation phase to the reclamation phase
> +	 * by clearing both inactivation inode state flags and marking the
> +	 * inode reclaimable.  Schedule background reclaim to run later.
> +	 */

Don't describe the code in the comment.

	/* Now schedule the inactivated inode for reclaim. */

> +	spin_lock(&pag->pag_ici_lock);
> +	spin_lock(&ip->i_flags_lock);
> +
> +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> +	ip->i_flags |= XFS_IRECLAIMABLE;
> +
> +	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_INODEGC_TAG);
> +	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
> +
> +	spin_unlock(&ip->i_flags_lock);
> +	spin_unlock(&pag->pag_ici_lock);
> +
> +	return 0;
> +}
> +
> +/* Walk the fs and inactivate the inodes in them. */
> +int
> +xfs_inodegc_free_space(
> +	struct xfs_mount	*mp,
> +	struct xfs_icwalk	*icw)
> +{
> +	trace_xfs_inodegc_free_space(mp, icw, _RET_IP_);
> +
> +	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
> +}
> +
> +/* Background inode inactivation worker. */
> +void
> +xfs_inodegc_worker(
> +	struct work_struct	*work)
> +{
> +	struct xfs_mount	*mp = container_of(to_delayed_work(work),
> +					struct xfs_mount, m_inodegc_work);
> +	int			error;
> +
> +	/*
> +	 * Queueing of this inodegc worker can race with xfs_inodegc_stop,
> +	 * which means that we can be running after the opflag clears.  Double
> +	 * check the flag state so that we don't trip asserts.
> +	 */
> +	if (!xfs_inodegc_running(mp))
> +		return;
> +
> +	error = xfs_inodegc_free_space(mp, NULL);
> +	if (error && error != -EAGAIN)
> +		xfs_err(mp, "inode inactivation failed, error %d", error);
> +
> +	xfs_inodegc_queue(mp);
> +}
> +
> +/* Force all currently queued inode inactivation work to run immediately. */
> +void
> +xfs_inodegc_flush(
> +	struct xfs_mount	*mp)
> +{
> +	if (!xfs_inodegc_running(mp) ||
> +	    !radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> +		return;
> +
> +	mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> +	flush_delayed_work(&mp->m_inodegc_work);

Doesn't flush_delayed_work() immediately schedule any delayed work?

Yup, it does:

/**                                                                              
 * flush_delayed_work - wait for a dwork to finish executing the last queueing   
 * @dwork: the delayed work to flush                                             
 *                                                                               
 * Delayed timer is cancelled and the pending work is queued for                 
 * immediate execution.  Like flush_work(), this function only                   
 * considers the last queueing instance of @dwork.                               
....

So no need for the mod_delayed_work() call here.

Also, if the gc is not running or there's nothing in the radix tree,
there is no queued work and so just calling flush_delayed_work()
would be a no-op, right?

> +}
> +
> +/* Stop all queued inactivation work. */
> +void
> +xfs_inodegc_stop(
> +	struct xfs_mount	*mp)
> +{
> +	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> +}

what's to stop racing invocations of stop/start? Perhaps:

	if (test_and_clear_bit())
		cancel_delayed_work_sync(&mp->m_inodegc_work);
> +
> +/* Schedule deferred inode inactivation work. */
> +void
> +xfs_inodegc_start(
> +	struct xfs_mount	*mp)
> +{
> +	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> +	xfs_inodegc_queue(mp);
> +}

	if (test_and_set_bit())
		xfs_inodegc_queue(mp);

So that the running state will remain in sync with the actual queue
operation? Though I'm still not sure why we need the running bit...

> @@ -80,4 +80,37 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
>  void xfs_blockgc_stop(struct xfs_mount *mp);
>  void xfs_blockgc_start(struct xfs_mount *mp);
>  
> +void xfs_inodegc_worker(struct work_struct *work);
> +void xfs_inodegc_flush(struct xfs_mount *mp);
> +void xfs_inodegc_stop(struct xfs_mount *mp);
> +void xfs_inodegc_start(struct xfs_mount *mp);
> +int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
> +
> +/*
> + * Process all pending inode inactivations immediately (sort of) so that a
> + * resource usage report will be mostly accurate with regards to files that
> + * have been unlinked recently.
> + *
> + * It isn't practical to maintain a count of the resources used by unlinked
> + * inodes to adjust the values reported by this function.  Resources that are
> + * shared (e.g. reflink) when an inode is queued for inactivation cannot be
> + * counted towards the adjustment, and cross referencing data extents with the
> + * refcount btree is the only way to decide if a resource is shared.  Worse,
> + * unsharing of any data blocks in the system requires either a second
> + * consultation with the refcount btree, or training users to deal with the
> + * free space counts possibly fluctuating upwards as inactivations occur.
> + *
> + * Hence we guard the inactivation flush with a ratelimiter so that the counts
> + * are not way out of whack while ignoring workloads that hammer us with statfs
> + * calls.  Once per clock tick seems frequent enough to avoid complaints about
> + * inaccurate counts.
> + */
> +static inline void
> +xfs_inodegc_summary_flush(
> +	struct xfs_mount	*mp)
> +{
> +	if (__ratelimit(&mp->m_inodegc_ratelimit))
> +		xfs_inodegc_flush(mp);
> +}

ONce per clock tick is still quite fast - once a millisecond on a
1000Hz kernel. I'd prefer that we use a known timer base for this
sort of thing, not something that changes with kernel config. Every
100ms, perhaps?

> @@ -937,9 +953,9 @@ xfs_mountfs(
>  	/* Clean out dquots that might be in memory after quotacheck. */
>  	xfs_qm_unmount(mp);
>  	/*
> -	 * Flush all inode reclamation work and flush the log.
> -	 * We have to do this /after/ rtunmount and qm_unmount because those
> -	 * two will have scheduled delayed reclaim for the rt/quota inodes.
> +	 * Flush all inode reclamation work and flush inodes to the log.  Do
> +	 * this after rtunmount and qm_unmount because those two will have
> +	 * released the rt and quota inodes.
>  	 *
>  	 * This is slightly different from the unmountfs call sequence
>  	 * because we could be tearing down a partially set up mount.  In
> @@ -947,6 +963,7 @@ xfs_mountfs(
>  	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
>  	 * quota inodes.
>  	 */
> +	xfs_inodegc_flush(mp);
>  	xfs_unmount_flush_inodes(mp);

Why isn't xfs_inodegc_flush() part of xfs_unmount_flush_inodes()?
Because, really, xfs_unmount_flush_inodes() depends on all the
inodes first being inactivated so that all transactions on inodes
are complete....

>   out_log_dealloc:
>  	xfs_log_mount_cancel(mp);
> @@ -983,6 +1000,12 @@ xfs_unmountfs(
>  	uint64_t		resblks;
>  	int			error;
>  
> +	/*
> +	 * Flush all the queued inode inactivation work to disk before tearing
> +	 * down rt metadata and quotas.
> +	 */
> +	xfs_inodegc_flush(mp);
> +
>  	xfs_blockgc_stop(mp);
>  	xfs_fs_unreserve_ag_blocks(mp);
>  	xfs_qm_unmount_quotas(mp);

FWIW, there's inconsistency in the order of operations between
failure handling in xfs_mountfs() w.r.t. inode flushing and quotas
vs what xfs_unmountfs() is now doing....

>  	uint8_t			m_rt_checked;
>  	uint8_t			m_rt_sick;
>  
> +	unsigned long		m_opflags;
> +
>  	/*
>  	 * End of read-mostly variables. Frequently written variables and locks
>  	 * should be placed below this comment from now on. The first variable
> @@ -184,6 +186,7 @@ typedef struct xfs_mount {
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> +	struct delayed_work	m_inodegc_work; /* background inode inactive */
>  	struct xfs_kobj		m_kobj;
>  	struct xfs_kobj		m_error_kobj;
>  	struct xfs_kobj		m_error_meta_kobj;
> @@ -220,6 +223,7 @@ typedef struct xfs_mount {
>  	unsigned int		*m_errortag;
>  	struct xfs_kobj		m_errortag_kobj;
>  #endif
> +	struct ratelimit_state	m_inodegc_ratelimit;
>  } xfs_mount_t;
>  
>  #define M_IGEO(mp)		(&(mp)->m_ino_geo)
> @@ -258,6 +262,9 @@ typedef struct xfs_mount {
>  #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
>  #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
>  
> +#define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)	/* are we allowed to start the
> +						   inode inactivation worker? */
> +
Ok, "opflags" are undocumented as to how they work, what their
consistency model is, etc. I understand you want an atomic flag to
indicate that something is running, and mp->m_flags is not that
(despite being used that way historically). 

I dislike the "_BIT" annotations for a variable that is only to be
used as an index bit field. Or maybe it's a flag field and you
haven't defined any bitwise flags for it because you're not using it
that way yet.

So, is m_opflags an indexed bit field or a normal flag field like
m_flags?

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 3a7fd4f02aa7..120a4426fd64 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -629,6 +629,43 @@ xfs_check_delalloc(
>  #define xfs_check_delalloc(ip, whichfork)	do { } while (0)
>  #endif
>  
> +#ifdef CONFIG_XFS_QUOTA
> +/*
> + * If a quota type is turned off but we still have a dquot attached to the
> + * inode, detach it before tagging this inode for inactivation (or reclaim) to
> + * avoid delaying quotaoff for longer than is necessary.  Because the inode has
> + * no VFS state and has not yet been tagged for reclaim or inactivation, it is
> + * safe to drop the dquots locklessly because iget, quotaoff, blockgc, and
> + * reclaim will not touch the inode.
> + */
> +static inline void
> +xfs_fs_dqdestroy_inode(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	if (!XFS_IS_UQUOTA_ON(mp)) {
> +		xfs_qm_dqrele(ip->i_udquot);
> +		ip->i_udquot = NULL;
> +	}
> +	if (!XFS_IS_GQUOTA_ON(mp)) {
> +		xfs_qm_dqrele(ip->i_gdquot);
> +		ip->i_gdquot = NULL;
> +	}
> +	if (!XFS_IS_PQUOTA_ON(mp)) {
> +		xfs_qm_dqrele(ip->i_pdquot);
> +		ip->i_pdquot = NULL;
> +	}
> +}
> +#else
> +# define xfs_fs_dqdestroy_inode(ip)		((void)0)
> +#endif

This should sit alongside xfs_qm_dqdetach() in the quota code, not
require additional #ifdef CONFIG_XFS_QUOTA blocks in the code here.

This could also be split out into a separate patch to reduce the
size of this one...

> +
> +/* iflags that we shouldn't see before scheduling reclaim or inactivation. */
> +#define XFS_IDESTROY_BAD_IFLAGS	(XFS_IRECLAIMABLE | \
> +				 XFS_IRECLAIM | \
> +				 XFS_NEED_INACTIVE | \
> +				 XFS_INACTIVATING)
>  /*
>   * Now that the generic code is guaranteed not to be accessing
>   * the linux inode, we can inactivate and reclaim the inode.
> @@ -638,28 +675,44 @@ xfs_fs_destroy_inode(
>  	struct inode		*inode)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	bool			need_inactive;
>  
>  	trace_xfs_destroy_inode(ip);
>  
>  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> -	XFS_STATS_INC(ip->i_mount, vn_rele);
> -	XFS_STATS_INC(ip->i_mount, vn_remove);
> +	XFS_STATS_INC(mp, vn_rele);
> +	XFS_STATS_INC(mp, vn_remove);
>  
> -	xfs_inactive(ip);
> +	need_inactive = xfs_inode_needs_inactivation(ip);
> +	if (!need_inactive) {
> +		/*
> +		 * If the inode doesn't need inactivation, that means we're
> +		 * going directly into reclaim and can drop the dquots.  It
> +		 * also means that there shouldn't be any delalloc reservations
> +		 * or speculative CoW preallocations remaining.
> +		 */
> +		xfs_qm_dqdetach(ip);
>  
> -	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
> -		xfs_check_delalloc(ip, XFS_DATA_FORK);
> -		xfs_check_delalloc(ip, XFS_COW_FORK);
> -		ASSERT(0);
> +		if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
> +			xfs_check_delalloc(ip, XFS_DATA_FORK);
> +			xfs_check_delalloc(ip, XFS_COW_FORK);
> +			ASSERT(0);
> +		}

These checks should happen when the inode is marked IRECLAIMABLE,
not here. This comes back to my comments about moving all the
"need_inactive" logic into xfs_inode_mark_reclaimable() - this
needs to be checked after inactivation, not just here where the
inode doesn't require inactivation....

IOWs, xfs_fs_destroy_inode() should largely become a shell function
with accounting and state checks, and not much else...

>  static void
> @@ -780,6 +833,21 @@ xfs_fs_sync_fs(
>  		flush_delayed_work(&mp->m_log->l_work);
>  	}
>  
> +	/*
> +	 * If the fs is at FREEZE_PAGEFAULTS, that means the VFS holds the
> +	 * umount mutex and is syncing the filesystem just before setting the
> +	 * state to FREEZE_FS.  We are not allowed to run transactions on a
> +	 * filesystem that is in FREEZE_FS state, so deactivate the background
> +	 * workers before we get there, and leave them off for the duration of
> +	 * the freeze.
> +	 *
> +	 * We can't do this in xfs_fs_freeze_super because freeze_super takes
> +	 * s_umount, which means we can't lock out a concurrent thaw request
> +	 * without adding another layer of locks to the freeze process.
> +	 */
> +	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
> +		xfs_inodegc_stop(mp);

Damn, that's an ugly hack.

SO the problem is that xfs_fs_freeze() is called after SB_FREEZE_FS
is set and so transactions won't run, hence it's too late to stop
the inode gc from running? i.e. it might already be blocked on
sb_start_intwrite() in xfs_trans_alloc()?

Which tends to imply that inactivation transactions should use
XFS_TRANS_NO_WRITECOUNT and then xfs_inodegc_stop() blocks waiting
for all the inactivation transactions to complete in xfs_fs_freeze()
before we do anything else.

I'd prefer we have a formal mechanism for handling this -
inactivation is something unknown to and hidden underneath the VFS,
so it's not considered in the VFS freeze mechanisms. Hence I think
it's fine to use our own mechanism in xfs_fs_freeze() to synchronise
as we need to....

> +
>  	return 0;
>  }
>  
> @@ -798,6 +866,8 @@ xfs_fs_statfs(
>  	xfs_extlen_t		lsize;
>  	int64_t			ffree;
>  
> +	xfs_inodegc_summary_flush(mp);

I suspect that's really going to hurt stat performance. I guess
benchmarks are in order...

> @@ -908,10 +978,27 @@ xfs_fs_unfreeze(
>  
>  	xfs_restore_resvblks(mp);
>  	xfs_log_work_queue(mp);
> +	xfs_inodegc_start(mp);
>  	xfs_blockgc_start(mp);
>  	return 0;
>  }
>  
> +STATIC int
> +xfs_fs_freeze_super(
> +	struct super_block	*sb)
> +{
> +	struct xfs_mount	*mp = XFS_M(sb);
> +
> +	/*
> +	 * Before we take s_umount to get to FREEZE_WRITE, flush all the
> +	 * accumulated background work so that there's less recovery work
> +	 * to do if we crash during the freeze.
> +	 */
> +	xfs_inodegc_flush(mp);
> +
> +	return freeze_super(sb);
> +}
> +
>  /*
>   * This function fills in xfs_mount_t fields based on mount args.
>   * Note: the superblock _has_ now been read in.
> @@ -1090,6 +1177,7 @@ static const struct super_operations xfs_super_operations = {
>  	.show_options		= xfs_fs_show_options,
>  	.nr_cached_objects	= xfs_fs_nr_cached_objects,
>  	.free_cached_objects	= xfs_fs_free_cached_objects,
> +	.freeze_super		= xfs_fs_freeze_super,
>  };

Ugh, so we have high level freeze control, but not low level.
Really, if we have to stop inode gc while freezing, then I'd prefer
we either do it here or in xfs_fs_freeze() than using the hack
you've got in place now...

>  
>  static int
> @@ -1737,6 +1825,13 @@ xfs_remount_ro(
>  		return error;
>  	}
>  
> +	/*
> +	 * Perform all on-disk metadata updates required to inactivate inodes.
> +	 * Since this can involve finobt updates, do it now before we lose the
> +	 * per-AG space reservations to guarantee that we won't fail there.
> +	 */
> +	xfs_inodegc_flush(mp);
> +

Ummm - from this point onwards there are no modifications for
inactivation - why not just turn inodegc off completely?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay
  2021-06-07 22:25 ` [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
@ 2021-06-08  1:09   ` Dave Chinner
  2021-06-08  2:02     ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-06-08  1:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Jun 07, 2021 at 03:25:10PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Allow administrators to control the length that we defer inode
> inactivation.  By default we'll set the delay to 2 seconds, as an
> arbitrary choice between allowing for some batching of a deltree
> operation, and not letting too many inodes pile up in memory.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  Documentation/admin-guide/xfs.rst |    7 +++++++
>  fs/xfs/xfs_globals.c              |    3 +++
>  fs/xfs/xfs_icache.c               |    3 ++-
>  fs/xfs/xfs_linux.h                |    1 +
>  fs/xfs/xfs_sysctl.c               |    9 +++++++++
>  fs/xfs/xfs_sysctl.h               |    1 +
>  6 files changed, 23 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
> index f9b109bfc6a6..9dd62b155fda 100644
> --- a/Documentation/admin-guide/xfs.rst
> +++ b/Documentation/admin-guide/xfs.rst
> @@ -277,6 +277,13 @@ The following sysctls are available for the XFS filesystem:
>  	references and returns timed-out AGs back to the free stream
>  	pool.
>  
> +  fs.xfs.inode_gc_delay
> +	(Units: centiseconds   Min: 0  Default: 1  Max: 360000)
> +	The amount of time to delay cleanup work that happens after a file is
> +	closed by all programs.  This involves clearing speculative
> +	preallocations from linked files and freeing unlinked files.  A higher
> +	value here increases batching at a risk of background work storms.

Can we make new timers use a sane unit of time like milliseconds?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-07 22:25 ` [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
@ 2021-06-08  1:26   ` Dave Chinner
  2021-06-08 11:48     ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-06-08  1:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Generally speaking, when a user calls fallocate, they're looking to
> preallocate space in a file in the largest contiguous chunks possible.
> If free space is low, it's possible that the free space will look
> unnecessarily fragmented because there are unlinked inodes that are
> holding on to space that we could allocate.  When this happens,
> fallocate makes suboptimal allocation decisions for the sake of deleted
> files, which doesn't make much sense, so scan the filesystem for dead
> items to delete to try to avoid this.
> 
> Note that there are a handful of fstests that fill a filesystem, delete
> just enough files to allow a single large allocation, and check that
> fallocate actually gets the allocation.  These tests regress because the
> test runs fallocate before the inode gc has a chance to run, so add this
> behavior to maintain as much of the old behavior as possible.

I don't think this is a good justification for the change. Just
because the unit tests exploit an undefined behaviour that no
filesystem actually guarantees to acheive a specific layout, it
doesn't mean we always have to behave that way.

For example, many tests used to use reverse sequential writes to
exploit deficiencies in the allocation algorithms to generate
fragmented files. We fixed that problem and the tests broke because
they couldn't fragment files any more.

Did we reject those changes because the tests broke? No, we didn't
because the tests were exploiting an observed behaviour rather than
a guaranteed behaviour.

So, yeah, "test does X to make Y happen" doesn't mean "X will always
make Y happen". It just means the test needs to be made more robust,
or we have to provide a way for the test to trigger the behaviour it
needs.

Indeed, I think that the way to fix these sorts of issues is to have
the tests issue a syncfs(2) after they've deleted the inodes and have
the filesystem run a inodegc flush as part of the sync mechanism.

Then we don't need to do.....

> +/*
> + * If the target device (or some part of it) is full enough that it won't to be
> + * able to satisfy the entire request, try to free inactive files to free up
> + * space.  While it's perfectly fine to fill a preallocation request with a
> + * bunch of short extents, we prefer to slow down preallocation requests to
> + * combat long term fragmentation in new file data.
> + */
> +static int
> +xfs_alloc_consolidate_freespace(
> +	struct xfs_inode	*ip,
> +	xfs_filblks_t		wanted)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_perag	*pag;
> +	struct xfs_sb		*sbp = &mp->m_sb;
> +	xfs_agnumber_t		agno;
> +
> +	if (!xfs_has_inodegc_work(mp))
> +		return 0;
> +
> +	if (XFS_IS_REALTIME_INODE(ip)) {
> +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> +			return 0;
> +		goto free_space;
> +	}
> +
> +	for_each_perag(mp, agno, pag) {
> +		if (pag->pagf_freeblks >= wanted) {
> +			xfs_perag_put(pag);
> +			return 0;
> +		}
> +	}

... really hurty things (e.g. on high AG count fs) on every fallocate()
call, and we have a simple modification to the tests that allow them
to work as they want to on both old and new kernels....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay
  2021-06-08  1:09   ` Dave Chinner
@ 2021-06-08  2:02     ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-08  2:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Tue, Jun 08, 2021 at 11:09:36AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 03:25:10PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Allow administrators to control the length that we defer inode
> > inactivation.  By default we'll set the delay to 2 seconds, as an
> > arbitrary choice between allowing for some batching of a deltree
> > operation, and not letting too many inodes pile up in memory.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  Documentation/admin-guide/xfs.rst |    7 +++++++
> >  fs/xfs/xfs_globals.c              |    3 +++
> >  fs/xfs/xfs_icache.c               |    3 ++-
> >  fs/xfs/xfs_linux.h                |    1 +
> >  fs/xfs/xfs_sysctl.c               |    9 +++++++++
> >  fs/xfs/xfs_sysctl.h               |    1 +
> >  6 files changed, 23 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
> > index f9b109bfc6a6..9dd62b155fda 100644
> > --- a/Documentation/admin-guide/xfs.rst
> > +++ b/Documentation/admin-guide/xfs.rst
> > @@ -277,6 +277,13 @@ The following sysctls are available for the XFS filesystem:
> >  	references and returns timed-out AGs back to the free stream
> >  	pool.
> >  
> > +  fs.xfs.inode_gc_delay
> > +	(Units: centiseconds   Min: 0  Default: 1  Max: 360000)
> > +	The amount of time to delay cleanup work that happens after a file is
> > +	closed by all programs.  This involves clearing speculative
> > +	preallocations from linked files and freeing unlinked files.  A higher
> > +	value here increases batching at a risk of background work storms.
> 
> Can we make new timers use a sane unit of time like milliseconds?

Ok.  Changing the name to inode_gc_delay_ms to make the units obvious to
userspace.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] xfs: deferred inode inactivation
  2021-06-08  0:57   ` Dave Chinner
@ 2021-06-08  4:40     ` Darrick J. Wong
  2021-06-09  1:01       ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-08  4:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Tue, Jun 08, 2021 at 10:57:53AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 03:25:04PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> > defer the inactivation phase to a separate workqueue.  With this we
> > avoid blocking memory reclaim on filesystem metadata updates that are
> > necessary to free an in-core inode, such as post-eof block freeing, COW
> > staging extent freeing, and truncating and freeing unlinked inodes.  Now
> > that work is deferred to a workqueue where we can do the freeing in
> > batches.
> > 
> > We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> > The first flag helps our worker find inodes needing inactivation, and
> > the second flag marks inodes that are in the process of being
> > inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> > inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> > 
> > Unfortunately, deferring the inactivation has one huge downside --
> > eventual consistency.  Since all the freeing is deferred to a worker
> > thread, one can rm a file but the space doesn't come back immediately.
> > This can cause some odd side effects with quota accounting and statfs,
> > so we also force inactivation scans in order to maintain the existing
> > behaviors, at least outwardly.
> > 
> > For this patch we'll set the delay to zero to mimic the old timing as
> > much as possible; in the next patch we'll play with different delay
> > settings.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  Documentation/admin-guide/xfs.rst |    3 
> >  fs/xfs/scrub/common.c             |    2 
> >  fs/xfs/xfs_fsops.c                |    4 
> >  fs/xfs/xfs_icache.c               |  364 +++++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_icache.h               |   35 +++-
> >  fs/xfs/xfs_inode.c                |   60 ++++++
> >  fs/xfs/xfs_inode.h                |   15 +-
> >  fs/xfs/xfs_log_recover.c          |    7 +
> >  fs/xfs/xfs_mount.c                |   29 +++
> >  fs/xfs/xfs_mount.h                |    7 +
> >  fs/xfs/xfs_qm_syscalls.c          |    4 
> >  fs/xfs/xfs_super.c                |  120 +++++++++++-
> >  fs/xfs/xfs_trace.h                |   14 +
> >  13 files changed, 620 insertions(+), 44 deletions(-)
> 
> Big patch. Much as I don't like asking people to do this, I'd like
> you to split the "xfs_inode_needs_inactivation()" factoring out of
> this patch, just to reduce the amount of churn around the
> inactivation callout code in this patch.

Ok.  Eeeeons ago I think that /was/ split out, but fmeh, I'll yank this
out along with the early-dqdetach thing too.

> There's a couple of other changes around this that should reduce the
> churn, too...
> 
> > @@ -343,6 +345,8 @@ xfs_fs_counts(
> >  	xfs_mount_t		*mp,
> >  	xfs_fsop_counts_t	*cnt)
> >  {
> > +	xfs_inodegc_summary_flush(mp);
> 
> What does "summary flush" mean? Doesn't make much sense from here...
> 
> >  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
> >  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
> >  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 4d4aa61fbd34..791202236a18 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -32,6 +32,8 @@
> >  #define XFS_ICI_RECLAIM_TAG	0
> >  /* Inode has speculative preallocations (posteof or cow) to clean. */
> >  #define XFS_ICI_BLOCKGC_TAG	1
> > +/* Inode can be inactivated. */
> > +#define XFS_ICI_INODEGC_TAG	2
> >  
> >  /*
> >   * The goal for walking incore inodes.  These can correspond with incore inode
> > @@ -44,6 +46,7 @@ enum xfs_icwalk_goal {
> >  	/* Goals directly associated with tagged inodes. */
> >  	XFS_ICWALK_BLOCKGC	= XFS_ICI_BLOCKGC_TAG,
> >  	XFS_ICWALK_RECLAIM	= XFS_ICI_RECLAIM_TAG,
> > +	XFS_ICWALK_INODEGC	= XFS_ICI_INODEGC_TAG,
> >  };
> >  
> >  #define XFS_ICWALK_NULL_TAG	(-1U)
> > @@ -228,6 +231,26 @@ xfs_blockgc_queue(
> >  	rcu_read_unlock();
> >  }
> >  
> > +static inline bool
> > +xfs_inodegc_running(struct xfs_mount *mp)
> > +{
> > +	return test_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > +}
> 
> Ok, these opflags are new. Not a big fan of the naming, more on that
> later.
> 
> 
> > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > +static void
> > +xfs_inodegc_queue(
> > +	struct xfs_mount        *mp)
> > +{
> > +	if (!xfs_inodegc_running(mp))
> > +		return;
> > +
> > +	rcu_read_lock();
> > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> > +	rcu_read_unlock();
> > +}
> 
> I have no idea why we are checking if the gc is running here. All
> our other background stuff runs and re-queues until it is directly
> stopped or there's nothing left in the tree. Hence I'm a bit
> clueless right now about what this semantic is for...

The opflag is supposed to control whether inactivation actually runs.
As you might guess from _running calls being scattered everywhere, it's
pretty ugly code.  All this crap exists because there's no easy solution
to shutting down background threads after we commit to freezing the fs
but before an fs freeze hits SB_FREEZE_FS and we can't allocate new
transactions.

Fixing inactivation to use NO_WRITECOUNT means auditing every function
call that xfs_inactive makes to look for an xfs_trans_alloc* call and
probably modifying all of them to be able to switch between regular and
NOWRITECOUNT mode.  I tried that, it's ugly.

Another solution is to add ->freeze_super and ->thaw_super functions
to prevent a FITHAW caller from racing with a FIFREEZE caller and
accidentally rearming the inodegc while a freeze starts.

Yet a third solution is to fix the vfs to call us every time it wants to
progress to another freeze state.

A fourth solution is the fugly one you see here in syncfs.

> > +
> >  /* Set a tag on both the AG incore inode tree and the AG radix tree. */
> >  static void
> >  xfs_perag_set_inode_tag(
> > @@ -262,6 +285,9 @@ xfs_perag_set_inode_tag(
> >  	case XFS_ICI_BLOCKGC_TAG:
> >  		xfs_blockgc_queue(pag);
> >  		break;
> > +	case XFS_ICI_INODEGC_TAG:
> > +		xfs_inodegc_queue(mp);
> > +		break;
> >  	}
> 
> And there's the on-demand start...
> 
> > @@ -308,18 +334,28 @@ xfs_perag_clear_inode_tag(
> >   */
> >  void
> >  xfs_inode_mark_reclaimable(
> > -	struct xfs_inode	*ip)
> > +	struct xfs_inode	*ip,
> > +	bool			need_inactive)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	struct xfs_perag	*pag;
> > +	unsigned int		tag;
> >  
> >  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> >  	spin_lock(&pag->pag_ici_lock);
> >  	spin_lock(&ip->i_flags_lock);
> >  
> > -	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > -			XFS_ICI_RECLAIM_TAG);
> > -	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> > +	if (need_inactive) {
> > +		trace_xfs_inode_set_need_inactive(ip);
> > +		ip->i_flags |= XFS_NEED_INACTIVE;
> > +		tag = XFS_ICI_INODEGC_TAG;
> > +	} else {
> > +		trace_xfs_inode_set_reclaimable(ip);
> > +		ip->i_flags |= XFS_IRECLAIMABLE;
> > +		tag = XFS_ICI_RECLAIM_TAG;
> > +	}
> > +
> > +	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
> 
> 
> Hmmmm. Rather than passing a boolean into this function that
> indicates what needs to be done, why not move all the inactivation
> stuff into this function? i.e. move the
> xfs_inode_needs_inactivation() check into here instead of splitting
> the inactivation and reclaim logic over xfs_fs_destroy_inode() and
> this function?

Done.

> >  
> >  	spin_unlock(&ip->i_flags_lock);
> >  	spin_unlock(&pag->pag_ici_lock);
> > @@ -383,19 +419,26 @@ xfs_reinit_inode(
> >  static int
> >  xfs_iget_recycle(
> >  	struct xfs_perag	*pag,
> > -	struct xfs_inode	*ip) __releases(&ip->i_flags_lock)
> > +	struct xfs_inode	*ip,
> > +	unsigned long		iflag) __releases(&ip->i_flags_lock)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	struct inode		*inode = VFS_I(ip);
> > +	unsigned int		tag;
> >  	int			error;
> >  
> > +	ASSERT(iflag == XFS_IRECLAIM || iflag == XFS_INACTIVATING);
> > +
> > +	tag = (iflag == XFS_INACTIVATING) ? XFS_ICI_INODEGC_TAG :
> > +					    XFS_ICI_RECLAIM_TAG;
> 
> I don't like ternaries in code like this - just use an if-else,
> or combine the assert into a switch:
> 
> 	switch(iflag) {
> 	case XFS_INACTIVATING:
> 		tag = XFS_ICI_INODEGC_TAG;
> 		break;
> 	case XFS_IRECLAIM:
> 		tag = XFS_ICI_RECLAIM_TAG;
> 		break;
> 	default:
> 		ASSERT(0);
> 		return -EINVAL;
> 	}

Changed.

> >  	/*
> >  	 * We need to make it look like the inode is being reclaimed to prevent
> >  	 * the actual reclaim workers from stomping over us while we recycle
> >  	 * the inode.  We can't clear the radix tree tag yet as it requires
> >  	 * pag_ici_lock to be held exclusive.
> >  	 */
> > -	ip->i_flags |= XFS_IRECLAIM;
> > +	ip->i_flags |= iflag;
> >  
> >  	spin_unlock(&ip->i_flags_lock);
> >  	rcu_read_unlock();
> > @@ -412,10 +455,13 @@ xfs_iget_recycle(
> >  		rcu_read_lock();
> >  		spin_lock(&ip->i_flags_lock);
> >  		wake = !!__xfs_iflags_test(ip, XFS_INEW);
> > -		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
> > +		ip->i_flags &= ~(XFS_INEW | iflag);
> >  		if (wake)
> >  			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
> > -		ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
> > +		if (iflag == XFS_IRECLAIM)
> > +			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
> > +		if (iflag == XFS_INACTIVATING)
> > +			ASSERT(ip->i_flags & XFS_NEED_INACTIVE);
> >  		spin_unlock(&ip->i_flags_lock);
> >  		rcu_read_unlock();
> >  		return error;
> > @@ -431,8 +477,7 @@ xfs_iget_recycle(
> >  	 */
> >  	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
> >  	ip->i_flags |= XFS_INEW;
> > -	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > -			XFS_ICI_RECLAIM_TAG);
> > +	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), tag);
> >  	inode->i_state = I_NEW;
> >  	spin_unlock(&ip->i_flags_lock);
> >  	spin_unlock(&pag->pag_ici_lock);
> > @@ -455,6 +500,13 @@ xfs_iget_check_free_state(
> >  	struct xfs_inode	*ip,
> >  	int			flags)
> >  {
> > +	/*
> > +	 * Unlinked inodes awaiting inactivation must not be reused until we
> > +	 * have a chance to clear the on-disk metadata.
> > +	 */
> > +	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
> > +		return -ENOENT;
> 
> Hmmmm. That's messy. The actual situation here is inodes that are on
> the unlinked list but have no VFS references need to be avoided.
> This should only happen in the cache hit case, so I don't think this
> belongs in xfs_iget_check_free_state() as that gets called from the
> cache miss case, too.

Agreed.

> Indeed, I think this is a case where we need to explicitly skip the
> inode in lookup, because we cannot actually recycle or re-use these
> inodes until they've been removed from the unlinked list. i.e. it's
> a primary selection criteria for a cache hit, not some that should
> be hidden in a separate function....

Ok, I'll hoist that to the top level of _cache_hit.

> > +
> >  	if (flags & XFS_IGET_CREATE) {
> >  		/* should be a free inode */
> >  		if (VFS_I(ip)->i_mode != 0) {
> > @@ -521,7 +573,7 @@ xfs_iget_cache_hit(
> >  	 *	     wait_on_inode to wait for these flags to be cleared
> >  	 *	     instead of polling for it.
> >  	 */
> > -	if (ip->i_flags & (XFS_INEW|XFS_IRECLAIM)) {
> > +	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING)) {
> >  		trace_xfs_iget_skip(ip);
> >  		XFS_STATS_INC(mp, xs_ig_frecycle);
> >  		error = -EAGAIN;
> > @@ -549,11 +601,29 @@ xfs_iget_cache_hit(
> >  		}
> >  
> >  		/* Drops i_flags_lock and RCU read lock. */
> > -		error = xfs_iget_recycle(pag, ip);
> > +		error = xfs_iget_recycle(pag, ip, XFS_IRECLAIM);
> >  		if (error) {
> >  			trace_xfs_iget_reclaim_fail(ip);
> >  			return error;
> >  		}
> > +	} else if (ip->i_flags & XFS_NEED_INACTIVE) {
> > +		/*
> > +		 * If NEED_INACTIVE is set, we've torn down the VFS inode
> > +		 * already, and must carefully restore it to usable state.
> > +		 */
> > +		trace_xfs_iget_inactive(ip);
> > +
> > +		if (flags & XFS_IGET_INCORE) {
> > +			error = -EAGAIN;
> > +			goto out_error;
> > +		}
> 
> And that's also a primary selection criteria. :)
> 
> > +
> > +		/* Drops i_flags_lock and RCU read lock. */
> > +		error = xfs_iget_recycle(pag, ip, XFS_INACTIVATING);
> > +		if (error) {
> > +			trace_xfs_iget_inactive_fail(ip);
> > +			return error;
> > +		}
> >  	} else {
> >  		/* If the VFS inode is being torn down, pause and try again. */
> >  		if (!igrab(inode)) {
> 
> Overall, I think this is kinda messy in that it smears the logic
> boundaries in the function. The cache hit code is structured as
> 
> 	check inode validity
> 	  skip inode
> 	check inode reusuability state
> 	  skip inode
> 	check inode reclaim state
> 	  recycle inode
> 	or
> 	  grab VFS inode
> 
> 
> I think it would be better to restructure this to end up looking
> like this:
> 
> 	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING))
> 		goto out_skip;
> 
> 	if ((flags & XFS_IGET_INCORE) &&
> 	    (ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE)))
> 		goto out_skip;
> 
> 	error = xfs_iget_check_free_state(ip, flags);
> 	if (error)
> 		goto out_error;
> 
> 	if (ip->i_flags & (XFS_IRECLAIMABLE | XFS_NEED_INACTIVE)) {
> 		trace_xfs_iget_recycle(ip);
> 
> 		/* Drops i_flags_lock and RCU read lock. */
> 		error = xfs_iget_recycle(pag, ip);
> 		if (error) {
> 			trace_xfs_iget_recycle_fail(ip);
> 			return error;
> 		}
> 	} else {
> 		/* igrab */
> 	}
> ....
> out_skip:
> 	trace_xfs_iget_skip(ip);
> 	error = -EAGAIN;
> out_error:
> 	....
> }
> 
> Where the details of what to do to recycle the inode is handled
> entirely within xfs_iget_recycle(), rather than splitting the logic
> over two functions. We don't need separate trace points for reclaim
> vs inactivation recycling - we've got that information in the inode
> flags that the trace point should be emitting.
> 
> The above gives us a much cleaner cache hit path, and gets all of
> the slow path stuff (recycling inodes) out of the normal lookup
> path.

<nod> Done.

> 
> 
> > @@ -845,22 +915,33 @@ xfs_dqrele_igrab(
> >  
> >  	/*
> >  	 * Skip inodes that are anywhere in the reclaim machinery because we
> > -	 * drop dquots before tagging an inode for reclamation.
> > +	 * drop dquots before tagging an inode for reclamation.  If the inode
> > +	 * is being inactivated, skip it because inactivation will drop the
> > +	 * dquots for us.
> >  	 */
> > -	if (ip->i_flags & (XFS_IRECLAIM | XFS_IRECLAIMABLE))
> > +	if (ip->i_flags & (XFS_IRECLAIM | XFS_IRECLAIMABLE | XFS_INACTIVATING))
> >  		goto out_unlock;
> >  
> >  	/*
> > -	 * The inode looks alive; try to grab a VFS reference so that it won't
> > -	 * get destroyed.  If we got the reference, return true to say that
> > -	 * we grabbed the inode.
> > +	 * If the inode is queued but not undergoing inactivation, set the
> > +	 * inactivating flag so everyone will leave it alone and return true
> > +	 * to say that we are taking ownership of it.
> > +	 *
> > +	 * Otherwise, the inode looks alive; try to grab a VFS reference so
> > +	 * that it won't get destroyed.  If we got the reference, return true
> > +	 * to say that we grabbed the inode.
> >  	 *
> >  	 * If we can't get the reference, then we know the inode had its VFS
> >  	 * state torn down and hasn't yet entered the reclaim machinery.  Since
> >  	 * we also know that dquots are detached from an inode before it enters
> >  	 * reclaim, we can skip the inode.
> >  	 */
> > -	ret = igrab(VFS_I(ip)) != NULL;
> > +	if (ip->i_flags & XFS_NEED_INACTIVE) {
> > +		ip->i_flags |= XFS_INACTIVATING;
> > +		ret = true;
> > +	} else {
> > +		ret = igrab(VFS_I(ip)) != NULL;
> > +	}
> 
> 	ret = true;
> 	if (ip->i_flags & XFS_NEED_INACTIVE)
> 		ip->i_flags |= XFS_INACTIVATING;
> 	else if (!igrab(VFS_I(ip)))
> 		ret = false;

Changed.

> >  /* Don't try to run block gc on an inode that's in any of these states. */
> >  #define XFS_BLOCKGC_NOGRAB_IFLAGS	(XFS_INEW | \
> > +					 XFS_NEED_INACTIVE | \
> > +					 XFS_INACTIVATING | \
> >  					 XFS_IRECLAIMABLE | \
> >  					 XFS_IRECLAIM)
> >  /*
> > @@ -1636,6 +1734,229 @@ xfs_blockgc_free_quota(
> >  			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), iwalk_flags);
> >  }
> >  
> > +/*
> > + * Inode Inactivation and Reclaimation
> > + * ===================================
> > + *
> > + * Sometimes, inodes need to have work done on them once the last program has
> > + * closed the file.  Typically this means cleaning out any leftover speculative
> > + * preallocations after EOF or in the CoW fork.  For inodes that have been
> > + * totally unlinked, this means unmapping data/attr/cow blocks, removing the
> > + * inode from the unlinked buckets, and marking it free in the inobt and inode
> > + * table.
> > + *
> > + * This process can generate many metadata updates, which shows up as close()
> > + * and unlink() calls that take a long time.  We defer all that work to a
> > + * workqueue which means that we can batch a lot of work and do it in inode
> > + * order for better performance.  Furthermore, we can control the workqueue,
> > + * which means that we can avoid doing inactivation work at a bad time, such as
> > + * when the fs is frozen.
> > + *
> > + * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
> > + * INACTIVATING) and adds a new INODEGC radix tree tag for fast access.  We
> > + * maintain separate perag counters for both types, and move counts as inodes
> > + * wander the state machine, which now works as follows:
> > + *
> > + * If the inode needs inactivation, we:
> > + *   - Set the NEED_INACTIVE inode flag
> > + *   - Increment the per-AG inactive count
> > + *   - Set the ICI_INODEGC tag in the per-AG inode tree
> > + *   - Set the ICI_INODEGC tag in the per-fs AG tree
> > + *   - Schedule background inode inactivation
> > + *
> > + * If the inode does not need inactivation, we:
> > + *   - Set the IRECLAIMABLE inode flag
> > + *   - Increment the per-AG reclaim count
> > + *   - Set the ICI_RECLAIM tag in the per-AG inode tree
> > + *   - Set the ICI_RECLAIM tag in the per-fs AG tree
> > + *   - Schedule background inode reclamation
> > + *
> > + * When it is time for background inode inactivation, we:
> > + *   - Set the INACTIVATING inode flag
> > + *   - Make all the on-disk updates
> > + *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
> > + *   - Decrement the per-AG inactive count
> > + *   - Clear the ICI_INODEGC tag in the per-AG inode tree
> > + *   - Clear the ICI_INODEGC tag in the per-fs AG tree if we just inactivated
> > + *     the last inode in the AG
> > + *   - Kick the inode into reclamation per the previous paragraph
> > + *
> > + * When it is time for background inode reclamation, we:
> > + *   - Set the IRECLAIM inode flag
> > + *   - Detach all the resources and remove the inode from the per-AG inode tree
> > + *   - Clear both IRECLAIM and RECLAIMABLE inode flags
> 
> That's wrong - we never clear the IRECLAIM state on a reclaimed
> inode - it is carried into the slab cache when it is freed so that
> racing RCU lookups will always see the IRECLAIM state and skip the
> inode and retry.

Good catch.  I didn't notice that and it's been like 2 years.

> > + *   - Decrement the per-AG reclaim count
> > + *   - Clear the ICI_RECLAIM tag from the per-AG inode tree
> > + *   - Clear the ICI_RECLAIM tag from the per-fs AG tree if we just reclaimed
> > + *     the last inode in the AG
> > + *
> > + * When these state transitions occur, the caller must have taken the per-AG
> > + * incore inode tree lock and then the inode i_flags lock, in that order.
> > + */
> 
> While the comment is good, describing what the code does is just
> going to get out of date as we modify this in future. I'd drop all
> the description of inode/perAG tag and tracking management and just
> replace them with:
> 
>  * If the inode needs inactivation, we:
>  *   - Set the NEED_INACTIVE inode flag
>  *   - Schedule background inode inactivation
>  *
>  * If the inode does not need inactivation, we:
>  *   - Set the IRECLAIMABLE inode flag
>  *   - Schedule background inode reclamation
>  *
>  * If the inode is being inactivated, we:
>  *   - Set the INACTIVATING inode flag
>  *   - Make all the on-disk updates
>  *   - Clear the inactive state and set the IRECLAIMABLE inode flag
>  *   - Schedule background inode reclamation
>  *
>  * If the inode is being reclaimed, we:
>  *   - Set the IRECLAIM inode flag
>  *   - Reclaim the inode and RCU free it.

Changed.

> > +/*
> > + * Decide if the given @ip is eligible for inactivation, and grab it if so.
> > + * Returns true if it's ready to go or false if we should just ignore it.
> > + */
> > +static bool
> > +xfs_inodegc_igrab(
> > +	struct xfs_inode	*ip)
> > +{
> > +	ASSERT(rcu_read_lock_held());
> > +
> > +	/* Check for stale RCU freed inode */
> > +	spin_lock(&ip->i_flags_lock);
> > +	if (!ip->i_ino)
> > +		goto out_unlock_noent;
> > +
> > +	/*
> > +	 * Skip inodes that don't need inactivation or are being inactivated
> > +	 * (or reactivated) by another thread.  Inodes should not be tagged
> > +	 * for inactivation while also in INEW or any reclaim state.
> > +	 */
> > +	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
> > +	    (ip->i_flags & XFS_INACTIVATING))
> > +		goto out_unlock_noent;
> > +
> > +	/*
> > +	 * Mark this inode as being inactivated even if the fs is shut down
> > +	 * because we need xfs_inodegc_inactivate to push this inode into the
> > +	 * reclaim state.
> > +	 */
> 
> These two comments really should go at the head of the function.
> i.e. if you define what "eligible for inactivation" where it is
> stated, then you don't need these comments in the code.

Ok.

> > +	ip->i_flags |= XFS_INACTIVATING;
> > +	spin_unlock(&ip->i_flags_lock);
> > +	return true;
> > +
> > +out_unlock_noent:
> > +	spin_unlock(&ip->i_flags_lock);
> > +	return false;
> > +}
> > +
> > +/*
> > + * Free all speculative preallocations and possibly even the inode itself.
> > + * This is the last chance to make changes to an otherwise unreferenced file
> > + * before incore reclamation happens.
> > + */
> > +static int
> > +xfs_inodegc_inactivate(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_perag	*pag,
> > +	struct xfs_icwalk	*icw)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> > +
> > +	/*
> > +	 * Inactivation isn't supposed to run when the fs is frozen because
> > +	 * we don't want kernel threads to block on transaction allocation.
> > +	 */
> > +	ASSERT(mp->m_super->s_writers.frozen < SB_FREEZE_FS);
> > +
> > +	/*
> > +	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed to
> > +	 * pass in a icw structure to look for inodes to inactivate
> > +	 * immediately to free some resources.  If this inode isn't a match,
> > +	 * put it back on the shelf and move on.
> > +	 */
> > +	spin_lock(&ip->i_flags_lock);
> > +	if (!xfs_icwalk_match(ip, icw)) {
> > +		ip->i_flags &= ~XFS_INACTIVATING;
> > +		spin_unlock(&ip->i_flags_lock);
> > +		return 0;
> > +	}
> > +	spin_unlock(&ip->i_flags_lock);
> > +
> > +	trace_xfs_inode_inactivating(ip);
> > +
> > +	xfs_inactive(ip);
> > +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> > +
> > +	/*
> > +	 * Move the inode from the inactivation phase to the reclamation phase
> > +	 * by clearing both inactivation inode state flags and marking the
> > +	 * inode reclaimable.  Schedule background reclaim to run later.
> > +	 */
> 
> Don't describe the code in the comment.
> 
> 	/* Now schedule the inactivated inode for reclaim. */

Fixed, thanks.

> > +	spin_lock(&pag->pag_ici_lock);
> > +	spin_lock(&ip->i_flags_lock);
> > +
> > +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> > +	ip->i_flags |= XFS_IRECLAIMABLE;
> > +
> > +	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_INODEGC_TAG);
> > +	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
> > +
> > +	spin_unlock(&ip->i_flags_lock);
> > +	spin_unlock(&pag->pag_ici_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Walk the fs and inactivate the inodes in them. */
> > +int
> > +xfs_inodegc_free_space(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_icwalk	*icw)
> > +{
> > +	trace_xfs_inodegc_free_space(mp, icw, _RET_IP_);
> > +
> > +	return xfs_icwalk(mp, XFS_ICWALK_INODEGC, icw);
> > +}
> > +
> > +/* Background inode inactivation worker. */
> > +void
> > +xfs_inodegc_worker(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_mount	*mp = container_of(to_delayed_work(work),
> > +					struct xfs_mount, m_inodegc_work);
> > +	int			error;
> > +
> > +	/*
> > +	 * Queueing of this inodegc worker can race with xfs_inodegc_stop,
> > +	 * which means that we can be running after the opflag clears.  Double
> > +	 * check the flag state so that we don't trip asserts.
> > +	 */
> > +	if (!xfs_inodegc_running(mp))
> > +		return;
> > +
> > +	error = xfs_inodegc_free_space(mp, NULL);
> > +	if (error && error != -EAGAIN)
> > +		xfs_err(mp, "inode inactivation failed, error %d", error);
> > +
> > +	xfs_inodegc_queue(mp);
> > +}
> > +
> > +/* Force all currently queued inode inactivation work to run immediately. */
> > +void
> > +xfs_inodegc_flush(
> > +	struct xfs_mount	*mp)
> > +{
> > +	if (!xfs_inodegc_running(mp) ||
> > +	    !radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> > +		return;
> > +
> > +	mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> > +	flush_delayed_work(&mp->m_inodegc_work);
> 
> Doesn't flush_delayed_work() immediately schedule any delayed work?
> 
> Yup, it does:
> 
> /**                                                                              
>  * flush_delayed_work - wait for a dwork to finish executing the last queueing   
>  * @dwork: the delayed work to flush                                             
>  *                                                                               
>  * Delayed timer is cancelled and the pending work is queued for                 
>  * immediate execution.  Like flush_work(), this function only                   
>  * considers the last queueing instance of @dwork.                               
> ....
> 
> So no need for the mod_delayed_work() call here.
> 
> Also, if the gc is not running or there's nothing in the radix tree,
> there is no queued work and so just calling flush_delayed_work()
> would be a no-op, right?

Yeah, that's what the documentation says.

At some point in the last two years I encountered a bug with this
patchset where unmounts would stall forever because the
flush_delayed_work didn't actually requeue the deferred work for
immediate scheduling.  I guess I don't need it anymore.

> > +}
> > +
> > +/* Stop all queued inactivation work. */
> > +void
> > +xfs_inodegc_stop(
> > +	struct xfs_mount	*mp)
> > +{
> > +	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > +}
> 
> what's to stop racing invocations of stop/start? Perhaps:
> 
> 	if (test_and_clear_bit())
> 		cancel_delayed_work_sync(&mp->m_inodegc_work);

That horrible hack below.

> > +
> > +/* Schedule deferred inode inactivation work. */
> > +void
> > +xfs_inodegc_start(
> > +	struct xfs_mount	*mp)
> > +{
> > +	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > +	xfs_inodegc_queue(mp);
> > +}
> 
> 	if (test_and_set_bit())
> 		xfs_inodegc_queue(mp);
> 
> So that the running state will remain in sync with the actual queue
> operation? Though I'm still not sure why we need the running bit...

(see ugly sync_fs SB_FREEZE_PAGEFAULTS hack)

> > @@ -80,4 +80,37 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
> >  void xfs_blockgc_stop(struct xfs_mount *mp);
> >  void xfs_blockgc_start(struct xfs_mount *mp);
> >  
> > +void xfs_inodegc_worker(struct work_struct *work);
> > +void xfs_inodegc_flush(struct xfs_mount *mp);
> > +void xfs_inodegc_stop(struct xfs_mount *mp);
> > +void xfs_inodegc_start(struct xfs_mount *mp);
> > +int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
> > +
> > +/*
> > + * Process all pending inode inactivations immediately (sort of) so that a
> > + * resource usage report will be mostly accurate with regards to files that
> > + * have been unlinked recently.
> > + *
> > + * It isn't practical to maintain a count of the resources used by unlinked
> > + * inodes to adjust the values reported by this function.  Resources that are
> > + * shared (e.g. reflink) when an inode is queued for inactivation cannot be
> > + * counted towards the adjustment, and cross referencing data extents with the
> > + * refcount btree is the only way to decide if a resource is shared.  Worse,
> > + * unsharing of any data blocks in the system requires either a second
> > + * consultation with the refcount btree, or training users to deal with the
> > + * free space counts possibly fluctuating upwards as inactivations occur.
> > + *
> > + * Hence we guard the inactivation flush with a ratelimiter so that the counts
> > + * are not way out of whack while ignoring workloads that hammer us with statfs
> > + * calls.  Once per clock tick seems frequent enough to avoid complaints about
> > + * inaccurate counts.
> > + */
> > +static inline void
> > +xfs_inodegc_summary_flush(
> > +	struct xfs_mount	*mp)
> > +{
> > +	if (__ratelimit(&mp->m_inodegc_ratelimit))
> > +		xfs_inodegc_flush(mp);
> > +}
> 
> ONce per clock tick is still quite fast - once a millisecond on a
> 1000Hz kernel. I'd prefer that we use a known timer base for this
> sort of thing, not something that changes with kernel config. Every
> 100ms, perhaps?

I tried a number of ratelimit values here: 1ms, 4ms, 10ms, 100ms, 500ms,
2s, and 15s.  fstests and most everything else seemed to act the same up
to 10ms.  At 100ms, the tests that delete something and immediately run
df will start to fail, and above that all hell breaks loose because many
tests (from which I extrapolate most programmers) expect that statfs
won't run until unlink has deleted everything.

> > @@ -937,9 +953,9 @@ xfs_mountfs(
> >  	/* Clean out dquots that might be in memory after quotacheck. */
> >  	xfs_qm_unmount(mp);
> >  	/*
> > -	 * Flush all inode reclamation work and flush the log.
> > -	 * We have to do this /after/ rtunmount and qm_unmount because those
> > -	 * two will have scheduled delayed reclaim for the rt/quota inodes.
> > +	 * Flush all inode reclamation work and flush inodes to the log.  Do
> > +	 * this after rtunmount and qm_unmount because those two will have
> > +	 * released the rt and quota inodes.
> >  	 *
> >  	 * This is slightly different from the unmountfs call sequence
> >  	 * because we could be tearing down a partially set up mount.  In
> > @@ -947,6 +963,7 @@ xfs_mountfs(
> >  	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
> >  	 * quota inodes.
> >  	 */
> > +	xfs_inodegc_flush(mp);
> >  	xfs_unmount_flush_inodes(mp);
> 
> Why isn't xfs_inodegc_flush() part of xfs_unmount_flush_inodes()?
> Because, really, xfs_unmount_flush_inodes() depends on all the
> inodes first being inactivated so that all transactions on inodes
> are complete....

The teardown sequence is not the same between a regular unmount and an
aborted mount...

> >   out_log_dealloc:
> >  	xfs_log_mount_cancel(mp);
> > @@ -983,6 +1000,12 @@ xfs_unmountfs(
> >  	uint64_t		resblks;
> >  	int			error;
> >  
> > +	/*
> > +	 * Flush all the queued inode inactivation work to disk before tearing
> > +	 * down rt metadata and quotas.
> > +	 */
> > +	xfs_inodegc_flush(mp);
> > +
> >  	xfs_blockgc_stop(mp);
> >  	xfs_fs_unreserve_ag_blocks(mp);
> >  	xfs_qm_unmount_quotas(mp);
> 
> FWIW, there's inconsistency in the order of operations between
> failure handling in xfs_mountfs() w.r.t. inode flushing and quotas
> vs what xfs_unmountfs() is now doing....

...because during regular unmountfs, we want to inactivate inodes while
we still have a per-ag reservation protecting finobt expansions.  During
an aborted mount, we don't necessarily have the reservation set up but
we have to clean everything out, so the inodegc flush comes much later.

It's convoluted, but do you want me to clean /that/ up too?  That's a
pretty heavy lift; I already tried to fix those two paths, ran out of
brain cells, and gave up.

> >  	uint8_t			m_rt_checked;
> >  	uint8_t			m_rt_sick;
> >  
> > +	unsigned long		m_opflags;
> > +
> >  	/*
> >  	 * End of read-mostly variables. Frequently written variables and locks
> >  	 * should be placed below this comment from now on. The first variable
> > @@ -184,6 +186,7 @@ typedef struct xfs_mount {
> >  	uint64_t		m_resblks_avail;/* available reserved blocks */
> >  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> >  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> > +	struct delayed_work	m_inodegc_work; /* background inode inactive */
> >  	struct xfs_kobj		m_kobj;
> >  	struct xfs_kobj		m_error_kobj;
> >  	struct xfs_kobj		m_error_meta_kobj;
> > @@ -220,6 +223,7 @@ typedef struct xfs_mount {
> >  	unsigned int		*m_errortag;
> >  	struct xfs_kobj		m_errortag_kobj;
> >  #endif
> > +	struct ratelimit_state	m_inodegc_ratelimit;
> >  } xfs_mount_t;
> >  
> >  #define M_IGEO(mp)		(&(mp)->m_ino_geo)
> > @@ -258,6 +262,9 @@ typedef struct xfs_mount {
> >  #define XFS_MOUNT_DAX_ALWAYS	(1ULL << 26)
> >  #define XFS_MOUNT_DAX_NEVER	(1ULL << 27)
> >  
> > +#define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)	/* are we allowed to start the
> > +						   inode inactivation worker? */
> > +
> Ok, "opflags" are undocumented as to how they work, what their
> consistency model is, etc. I understand you want an atomic flag to
> indicate that something is running, and mp->m_flags is not that
> (despite being used that way historically). 
> 
> I dislike the "_BIT" annotations for a variable that is only to be
> used as an index bit field. Or maybe it's a flag field and you
> haven't defined any bitwise flags for it because you're not using it
> that way yet.
> 
> So, is m_opflags an indexed bit field or a normal flag field like
> m_flags?

It's an indexed bit field, which is why I named it _BIT.  I'll try to
add more documentation around what this thing is and how the flags work:

struct xfs_mount {
	...
	/*
	 * This atomic bitset controls flags that alter the behavior of
	 * the filesystem.  Use only the atomic bit helper functions
	 * here; see XFS_OPFLAG_* for information about the actual
	 * flags.
	 */
	unsigned long		m_opflags;
	...
};

/*
 * Operation flags -- each entry here is a bit index into m_opflags and
 * is not itself a flag value.
 */

/* Are we allowed to run the inode inactivation worker? */
#define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)

> 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 3a7fd4f02aa7..120a4426fd64 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -629,6 +629,43 @@ xfs_check_delalloc(
> >  #define xfs_check_delalloc(ip, whichfork)	do { } while (0)
> >  #endif
> >  
> > +#ifdef CONFIG_XFS_QUOTA
> > +/*
> > + * If a quota type is turned off but we still have a dquot attached to the
> > + * inode, detach it before tagging this inode for inactivation (or reclaim) to
> > + * avoid delaying quotaoff for longer than is necessary.  Because the inode has
> > + * no VFS state and has not yet been tagged for reclaim or inactivation, it is
> > + * safe to drop the dquots locklessly because iget, quotaoff, blockgc, and
> > + * reclaim will not touch the inode.
> > + */
> > +static inline void
> > +xfs_fs_dqdestroy_inode(
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +
> > +	if (!XFS_IS_UQUOTA_ON(mp)) {
> > +		xfs_qm_dqrele(ip->i_udquot);
> > +		ip->i_udquot = NULL;
> > +	}
> > +	if (!XFS_IS_GQUOTA_ON(mp)) {
> > +		xfs_qm_dqrele(ip->i_gdquot);
> > +		ip->i_gdquot = NULL;
> > +	}
> > +	if (!XFS_IS_PQUOTA_ON(mp)) {
> > +		xfs_qm_dqrele(ip->i_pdquot);
> > +		ip->i_pdquot = NULL;
> > +	}
> > +}
> > +#else
> > +# define xfs_fs_dqdestroy_inode(ip)		((void)0)
> > +#endif
> 
> This should sit alongside xfs_qm_dqdetach() in the quota code, not
> require additional #ifdef CONFIG_XFS_QUOTA blocks in the code here.
> 
> This could also be split out into a separate patch to reduce the
> size of this one...

Done.

> > +
> > +/* iflags that we shouldn't see before scheduling reclaim or inactivation. */
> > +#define XFS_IDESTROY_BAD_IFLAGS	(XFS_IRECLAIMABLE | \
> > +				 XFS_IRECLAIM | \
> > +				 XFS_NEED_INACTIVE | \
> > +				 XFS_INACTIVATING)
> >  /*
> >   * Now that the generic code is guaranteed not to be accessing
> >   * the linux inode, we can inactivate and reclaim the inode.
> > @@ -638,28 +675,44 @@ xfs_fs_destroy_inode(
> >  	struct inode		*inode)
> >  {
> >  	struct xfs_inode	*ip = XFS_I(inode);
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	bool			need_inactive;
> >  
> >  	trace_xfs_destroy_inode(ip);
> >  
> >  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > -	XFS_STATS_INC(ip->i_mount, vn_rele);
> > -	XFS_STATS_INC(ip->i_mount, vn_remove);
> > +	XFS_STATS_INC(mp, vn_rele);
> > +	XFS_STATS_INC(mp, vn_remove);
> >  
> > -	xfs_inactive(ip);
> > +	need_inactive = xfs_inode_needs_inactivation(ip);
> > +	if (!need_inactive) {
> > +		/*
> > +		 * If the inode doesn't need inactivation, that means we're
> > +		 * going directly into reclaim and can drop the dquots.  It
> > +		 * also means that there shouldn't be any delalloc reservations
> > +		 * or speculative CoW preallocations remaining.
> > +		 */
> > +		xfs_qm_dqdetach(ip);
> >  
> > -	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
> > -		xfs_check_delalloc(ip, XFS_DATA_FORK);
> > -		xfs_check_delalloc(ip, XFS_COW_FORK);
> > -		ASSERT(0);
> > +		if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
> > +			xfs_check_delalloc(ip, XFS_DATA_FORK);
> > +			xfs_check_delalloc(ip, XFS_COW_FORK);
> > +			ASSERT(0);
> > +		}
> 
> These checks should happen when the inode is marked IRECLAIMABLE,
> not here. This comes back to my comments about moving all the
> "need_inactive" logic into xfs_inode_mark_reclaimable() - this
> needs to be checked after inactivation, not just here where the
> inode doesn't require inactivation....

Ok.

> IOWs, xfs_fs_destroy_inode() should largely become a shell function
> with accounting and state checks, and not much else...

Done.  I've moved most of the code from xfs_fs_destroy_inode into
xfs_inode_mark_reclaimable.  The delalloc stuff is checked here if we're
moving the inode straight into reclaim, and also at the point when the
inode moves from INACTIVATING -> IRECLAIMABLE.

> >  static void
> > @@ -780,6 +833,21 @@ xfs_fs_sync_fs(
> >  		flush_delayed_work(&mp->m_log->l_work);
> >  	}
> >  
> > +	/*
> > +	 * If the fs is at FREEZE_PAGEFAULTS, that means the VFS holds the
> > +	 * umount mutex and is syncing the filesystem just before setting the
> > +	 * state to FREEZE_FS.  We are not allowed to run transactions on a
> > +	 * filesystem that is in FREEZE_FS state, so deactivate the background
> > +	 * workers before we get there, and leave them off for the duration of
> > +	 * the freeze.
> > +	 *
> > +	 * We can't do this in xfs_fs_freeze_super because freeze_super takes
> > +	 * s_umount, which means we can't lock out a concurrent thaw request
> > +	 * without adding another layer of locks to the freeze process.
> > +	 */
> > +	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT)
> > +		xfs_inodegc_stop(mp);
> 
> Damn, that's an ugly hack.

Yep.

> SO the problem is that xfs_fs_freeze() is called after SB_FREEZE_FS
> is set and so transactions won't run, hence it's too late to stop
> the inode gc from running? i.e. it might already be blocked on
> sb_start_intwrite() in xfs_trans_alloc()?
> 
> Which tends to imply that inactivation transactions should use
> XFS_TRANS_NO_WRITECOUNT and then xfs_inodegc_stop() blocks waiting
> for all the inactivation transactions to complete in xfs_fs_freeze()
> before we do anything else.

I tried that, and it got ugly pretty quickly, because xfs_inactive does
not itself create transactions, which means that xfs_qm_dqattach and
xfs_reflink_cancel_cow_range and xfs_free_eofblocks have to be taught
when they need to pass NO_WRITECOUNT.  The first one probably never
needs to run a transaction, but the other two do, and they get called
from other non-reclaim of the filesystem too.

> I'd prefer we have a formal mechanism for handling this -
> inactivation is something unknown to and hidden underneath the VFS,
> so it's not considered in the VFS freeze mechanisms. Hence I think
> it's fine to use our own mechanism in xfs_fs_freeze() to synchronise
> as we need to....

I also uncovered a bug in log recovery when someone sets up an xfs to
host some vm images, starts some work, and the host xfs goes down right
after someone unlinks an image file that shares extents.

Basically, log recovery replays transactions and all the unfinished
extents.  If log recovery also recovers the unlink operation, it'll drop
the nlink on that image file to zero and set IRECOVERY.  Once recovery
is complete, we reclaim the entire inode cache.  At that point we'll
inactivate the IRECOVERY inode, which (since it's link count is zero)
means we have to bunmapi the whole thing.  This requires us to drop
refcounts on the unlinked file, which can in turn cause a refcount btree
shape change.  The xfs_inactive() transaction doesn't reserve any blocks
and we haven't set up per-AG reservations yet, so we trip over the
blk_res_used > blk_res assertion during commit.

It's not related to this patch, but might need fixing soon.

> 
> > +
> >  	return 0;
> >  }
> >  
> > @@ -798,6 +866,8 @@ xfs_fs_statfs(
> >  	xfs_extlen_t		lsize;
> >  	int64_t			ffree;
> >  
> > +	xfs_inodegc_summary_flush(mp);
> 
> I suspect that's really going to hurt stat performance. I guess
> benchmarks are in order...

Ok, so here's the question I have: Does POSIX (or any other standard we
care to fit) actually define any behavior between the following
sequence:

unlink("/foo");	/* no sync calls of any kind */
statfs("/");

As I've mentioned in the past, the only reason we need these inodegc
flushes for summary reporting is because users expect that if they
delete an unshared 10GB file, they can immediately df and see that the
inode count went down by one and free space went up by 10GB.

I /guess/ we could modify every single fstest to sync before checking
summary counts instead of doing this, but I bet there will be some users
who will be surprised that xfs now has *trfs df logic.

> > @@ -908,10 +978,27 @@ xfs_fs_unfreeze(
> >  
> >  	xfs_restore_resvblks(mp);
> >  	xfs_log_work_queue(mp);
> > +	xfs_inodegc_start(mp);
> >  	xfs_blockgc_start(mp);
> >  	return 0;
> >  }
> >  
> > +STATIC int
> > +xfs_fs_freeze_super(
> > +	struct super_block	*sb)
> > +{
> > +	struct xfs_mount	*mp = XFS_M(sb);
> > +
> > +	/*
> > +	 * Before we take s_umount to get to FREEZE_WRITE, flush all the
> > +	 * accumulated background work so that there's less recovery work
> > +	 * to do if we crash during the freeze.
> > +	 */
> > +	xfs_inodegc_flush(mp);
> > +
> > +	return freeze_super(sb);
> > +}
> > +
> >  /*
> >   * This function fills in xfs_mount_t fields based on mount args.
> >   * Note: the superblock _has_ now been read in.
> > @@ -1090,6 +1177,7 @@ static const struct super_operations xfs_super_operations = {
> >  	.show_options		= xfs_fs_show_options,
> >  	.nr_cached_objects	= xfs_fs_nr_cached_objects,
> >  	.free_cached_objects	= xfs_fs_free_cached_objects,
> > +	.freeze_super		= xfs_fs_freeze_super,
> >  };
> 
> Ugh, so we have high level freeze control, but not low level.
> Really, if we have to stop inode gc while freezing, then I'd prefer
> we either do it here or in xfs_fs_freeze() than using the hack
> you've got in place now...
> 
> >  
> >  static int
> > @@ -1737,6 +1825,13 @@ xfs_remount_ro(
> >  		return error;
> >  	}
> >  
> > +	/*
> > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > +	 * Since this can involve finobt updates, do it now before we lose the
> > +	 * per-AG space reservations to guarantee that we won't fail there.
> > +	 */
> > +	xfs_inodegc_flush(mp);
> > +
> 
> Ummm - from this point onwards there are no modifications for
> inactivation - why not just turn inodegc off completely?

Oops.  Yes, you're right, we already do that for blockgc, so we might as
well do that for inodegc too.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-08  1:26   ` Dave Chinner
@ 2021-06-08 11:48     ` Brian Foster
  2021-06-08 15:32       ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2021-06-08 11:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, hch

On Tue, Jun 08, 2021 at 11:26:05AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Generally speaking, when a user calls fallocate, they're looking to
> > preallocate space in a file in the largest contiguous chunks possible.
> > If free space is low, it's possible that the free space will look
> > unnecessarily fragmented because there are unlinked inodes that are
> > holding on to space that we could allocate.  When this happens,
> > fallocate makes suboptimal allocation decisions for the sake of deleted
> > files, which doesn't make much sense, so scan the filesystem for dead
> > items to delete to try to avoid this.
> > 
> > Note that there are a handful of fstests that fill a filesystem, delete
> > just enough files to allow a single large allocation, and check that
> > fallocate actually gets the allocation.  These tests regress because the
> > test runs fallocate before the inode gc has a chance to run, so add this
> > behavior to maintain as much of the old behavior as possible.
> 
> I don't think this is a good justification for the change. Just
> because the unit tests exploit an undefined behaviour that no
> filesystem actually guarantees to acheive a specific layout, it
> doesn't mean we always have to behave that way.
> 
> For example, many tests used to use reverse sequential writes to
> exploit deficiencies in the allocation algorithms to generate
> fragmented files. We fixed that problem and the tests broke because
> they couldn't fragment files any more.
> 
> Did we reject those changes because the tests broke? No, we didn't
> because the tests were exploiting an observed behaviour rather than
> a guaranteed behaviour.
> 
> So, yeah, "test does X to make Y happen" doesn't mean "X will always
> make Y happen". It just means the test needs to be made more robust,
> or we have to provide a way for the test to trigger the behaviour it
> needs.
> 

Agree on all this..

> Indeed, I think that the way to fix these sorts of issues is to have
> the tests issue a syncfs(2) after they've deleted the inodes and have
> the filesystem run a inodegc flush as part of the sync mechanism.
> 

... but it seems a bit of a leap to equate exploitation of a
historically poorly handled allocation pattern in developer tests with
adding a new requirement (i.e. sync) to achieve optimal behavior of a
fairly common allocation pattern (delete a file, use the space for
something else).

IOW, how to hack around test regressions aside (are the test regressions
actual ENOSPC failures or something else, btw?), what's the impact on
users/workloads that might operate under these conditions? I guess
historically we've always recommended to not consistently operate in
<20% free space conditions, so to some degree there is an expectation
for less than optimal behavior if one decides to constantly bash an fs
into ENOSPC. Then again with large enough files, will/can we put the
filesystem into that state ourselves without any indication to the user?

I kind of wonder if unless/until there's some kind of efficient feedback
between allocation and "pending" free space, whether deferred
inactivation should be an optimization tied to some kind of heuristic
that balances the amount of currently available free space against
pending free space (but I've not combed through the code enough to grok
whether this already does something like that).

Brian

> Then we don't need to do.....
> 
> > +/*
> > + * If the target device (or some part of it) is full enough that it won't to be
> > + * able to satisfy the entire request, try to free inactive files to free up
> > + * space.  While it's perfectly fine to fill a preallocation request with a
> > + * bunch of short extents, we prefer to slow down preallocation requests to
> > + * combat long term fragmentation in new file data.
> > + */
> > +static int
> > +xfs_alloc_consolidate_freespace(
> > +	struct xfs_inode	*ip,
> > +	xfs_filblks_t		wanted)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_perag	*pag;
> > +	struct xfs_sb		*sbp = &mp->m_sb;
> > +	xfs_agnumber_t		agno;
> > +
> > +	if (!xfs_has_inodegc_work(mp))
> > +		return 0;
> > +
> > +	if (XFS_IS_REALTIME_INODE(ip)) {
> > +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> > +			return 0;
> > +		goto free_space;
> > +	}
> > +
> > +	for_each_perag(mp, agno, pag) {
> > +		if (pag->pagf_freeblks >= wanted) {
> > +			xfs_perag_put(pag);
> > +			return 0;
> > +		}
> > +	}
> 
> ... really hurty things (e.g. on high AG count fs) on every fallocate()
> call, and we have a simple modification to the tests that allow them
> to work as they want to on both old and new kernels....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-08 11:48     ` Brian Foster
@ 2021-06-08 15:32       ` Darrick J. Wong
  2021-06-08 16:06         ` Brian Foster
  2021-06-08 21:55         ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-08 15:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs, hch

On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> On Tue, Jun 08, 2021 at 11:26:05AM +1000, Dave Chinner wrote:
> > On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Generally speaking, when a user calls fallocate, they're looking to
> > > preallocate space in a file in the largest contiguous chunks possible.
> > > If free space is low, it's possible that the free space will look
> > > unnecessarily fragmented because there are unlinked inodes that are
> > > holding on to space that we could allocate.  When this happens,
> > > fallocate makes suboptimal allocation decisions for the sake of deleted
> > > files, which doesn't make much sense, so scan the filesystem for dead
> > > items to delete to try to avoid this.
> > > 
> > > Note that there are a handful of fstests that fill a filesystem, delete
> > > just enough files to allow a single large allocation, and check that
> > > fallocate actually gets the allocation.  These tests regress because the
> > > test runs fallocate before the inode gc has a chance to run, so add this
> > > behavior to maintain as much of the old behavior as possible.
> > 
> > I don't think this is a good justification for the change. Just
> > because the unit tests exploit an undefined behaviour that no
> > filesystem actually guarantees to acheive a specific layout, it
> > doesn't mean we always have to behave that way.
> > 
> > For example, many tests used to use reverse sequential writes to
> > exploit deficiencies in the allocation algorithms to generate
> > fragmented files. We fixed that problem and the tests broke because
> > they couldn't fragment files any more.
> > 
> > Did we reject those changes because the tests broke? No, we didn't
> > because the tests were exploiting an observed behaviour rather than
> > a guaranteed behaviour.
> > 
> > So, yeah, "test does X to make Y happen" doesn't mean "X will always
> > make Y happen". It just means the test needs to be made more robust,
> > or we have to provide a way for the test to trigger the behaviour it
> > needs.
> > 
> 
> Agree on all this..
> 
> > Indeed, I think that the way to fix these sorts of issues is to have
> > the tests issue a syncfs(2) after they've deleted the inodes and have
> > the filesystem run a inodegc flush as part of the sync mechanism.
> > 
> 
> ... but it seems a bit of a leap to equate exploitation of a
> historically poorly handled allocation pattern in developer tests with
> adding a new requirement (i.e. sync) to achieve optimal behavior of a
> fairly common allocation pattern (delete a file, use the space for
> something else).
> 
> IOW, how to hack around test regressions aside (are the test regressions
> actual ENOSPC failures or something else, btw?), what's the impact on

They're not ENOSPC failures, they're fallocate layout tests that assume
that you can format the fs with a stripe alignment, fragment the free
space so that it isn't possible to obtain stripe-aligned blocks, delete
80% of the file(s) you used to fragment the free space, and fallocate a
stripe-aligned extent from the newly freed space in one go after the
last unlink() returns.  Unfortunately, I don't remember which test it
was that tripped over this.

IOWs, tests that confirm the historic behavior of XFS (and presumably
other filesystems) even though we don't guarantee anything about file
layout and never have.  This is a similar issue to the one Dave
complains about a few patches ago about needing to kick the inodegc
workers so that df reporting behavior stays the same as it has been on
xfs for ages.

We /might/ have figured out a solution to some of that nastiness.

> users/workloads that might operate under these conditions? I guess
> historically we've always recommended to not consistently operate in
> <20% free space conditions, so to some degree there is an expectation
> for less than optimal behavior if one decides to constantly bash an fs
> into ENOSPC. Then again with large enough files, will/can we put the
> filesystem into that state ourselves without any indication to the user?
> 
> I kind of wonder if unless/until there's some kind of efficient feedback
> between allocation and "pending" free space, whether deferred
> inactivation should be an optimization tied to some kind of heuristic
> that balances the amount of currently available free space against
> pending free space (but I've not combed through the code enough to grok
> whether this already does something like that).

Ooh!  You mentioned "efficient feedback", and one sprung immediately to
mind -- if the AG is near full (or above 80% full, or whatever) we
schedule the per-AG inodegc worker immediately instead of delaying it.

--D

> 
> Brian
> 
> > Then we don't need to do.....
> > 
> > > +/*
> > > + * If the target device (or some part of it) is full enough that it won't to be
> > > + * able to satisfy the entire request, try to free inactive files to free up
> > > + * space.  While it's perfectly fine to fill a preallocation request with a
> > > + * bunch of short extents, we prefer to slow down preallocation requests to
> > > + * combat long term fragmentation in new file data.
> > > + */
> > > +static int
> > > +xfs_alloc_consolidate_freespace(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_filblks_t		wanted)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_perag	*pag;
> > > +	struct xfs_sb		*sbp = &mp->m_sb;
> > > +	xfs_agnumber_t		agno;
> > > +
> > > +	if (!xfs_has_inodegc_work(mp))
> > > +		return 0;
> > > +
> > > +	if (XFS_IS_REALTIME_INODE(ip)) {
> > > +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> > > +			return 0;
> > > +		goto free_space;
> > > +	}
> > > +
> > > +	for_each_perag(mp, agno, pag) {
> > > +		if (pag->pagf_freeblks >= wanted) {
> > > +			xfs_perag_put(pag);
> > > +			return 0;
> > > +		}
> > > +	}
> > 
> > ... really hurty things (e.g. on high AG count fs) on every fallocate()
> > call, and we have a simple modification to the tests that allow them
> > to work as they want to on both old and new kernels....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-08 15:32       ` Darrick J. Wong
@ 2021-06-08 16:06         ` Brian Foster
  2021-06-08 21:55         ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Brian Foster @ 2021-06-08 16:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, hch

On Tue, Jun 08, 2021 at 08:32:04AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> > On Tue, Jun 08, 2021 at 11:26:05AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Generally speaking, when a user calls fallocate, they're looking to
> > > > preallocate space in a file in the largest contiguous chunks possible.
> > > > If free space is low, it's possible that the free space will look
> > > > unnecessarily fragmented because there are unlinked inodes that are
> > > > holding on to space that we could allocate.  When this happens,
> > > > fallocate makes suboptimal allocation decisions for the sake of deleted
> > > > files, which doesn't make much sense, so scan the filesystem for dead
> > > > items to delete to try to avoid this.
> > > > 
> > > > Note that there are a handful of fstests that fill a filesystem, delete
> > > > just enough files to allow a single large allocation, and check that
> > > > fallocate actually gets the allocation.  These tests regress because the
> > > > test runs fallocate before the inode gc has a chance to run, so add this
> > > > behavior to maintain as much of the old behavior as possible.
> > > 
> > > I don't think this is a good justification for the change. Just
> > > because the unit tests exploit an undefined behaviour that no
> > > filesystem actually guarantees to acheive a specific layout, it
> > > doesn't mean we always have to behave that way.
> > > 
> > > For example, many tests used to use reverse sequential writes to
> > > exploit deficiencies in the allocation algorithms to generate
> > > fragmented files. We fixed that problem and the tests broke because
> > > they couldn't fragment files any more.
> > > 
> > > Did we reject those changes because the tests broke? No, we didn't
> > > because the tests were exploiting an observed behaviour rather than
> > > a guaranteed behaviour.
> > > 
> > > So, yeah, "test does X to make Y happen" doesn't mean "X will always
> > > make Y happen". It just means the test needs to be made more robust,
> > > or we have to provide a way for the test to trigger the behaviour it
> > > needs.
> > > 
> > 
> > Agree on all this..
> > 
> > > Indeed, I think that the way to fix these sorts of issues is to have
> > > the tests issue a syncfs(2) after they've deleted the inodes and have
> > > the filesystem run a inodegc flush as part of the sync mechanism.
> > > 
> > 
> > ... but it seems a bit of a leap to equate exploitation of a
> > historically poorly handled allocation pattern in developer tests with
> > adding a new requirement (i.e. sync) to achieve optimal behavior of a
> > fairly common allocation pattern (delete a file, use the space for
> > something else).
> > 
> > IOW, how to hack around test regressions aside (are the test regressions
> > actual ENOSPC failures or something else, btw?), what's the impact on
> 
> They're not ENOSPC failures, they're fallocate layout tests that assume
> that you can format the fs with a stripe alignment, fragment the free
> space so that it isn't possible to obtain stripe-aligned blocks, delete
> 80% of the file(s) you used to fragment the free space, and fallocate a
> stripe-aligned extent from the newly freed space in one go after the
> last unlink() returns.  Unfortunately, I don't remember which test it
> was that tripped over this.
> 
> IOWs, tests that confirm the historic behavior of XFS (and presumably
> other filesystems) even though we don't guarantee anything about file
> layout and never have.  This is a similar issue to the one Dave
> complains about a few patches ago about needing to kick the inodegc
> workers so that df reporting behavior stays the same as it has been on
> xfs for ages.
> 

Ah, Ok. That's less critical than what I was thinking when I read "test
regression" in the earlier comments.. thanks. To Dave's earlier point, I
think it's reasonable to update tests if they rely on non-guaranteed
behavior as such. I just wanted to make sure that the associated impact
on the user wasn't terrible.

> We /might/ have figured out a solution to some of that nastiness.
> 
> > users/workloads that might operate under these conditions? I guess
> > historically we've always recommended to not consistently operate in
> > <20% free space conditions, so to some degree there is an expectation
> > for less than optimal behavior if one decides to constantly bash an fs
> > into ENOSPC. Then again with large enough files, will/can we put the
> > filesystem into that state ourselves without any indication to the user?
> > 
> > I kind of wonder if unless/until there's some kind of efficient feedback
> > between allocation and "pending" free space, whether deferred
> > inactivation should be an optimization tied to some kind of heuristic
> > that balances the amount of currently available free space against
> > pending free space (but I've not combed through the code enough to grok
> > whether this already does something like that).
> 
> Ooh!  You mentioned "efficient feedback", and one sprung immediately to
> mind -- if the AG is near full (or above 80% full, or whatever) we
> schedule the per-AG inodegc worker immediately instead of delaying it.
> 

Indeed, or not deferring at all in that case (not sure if there's much
different there anyways?). Either way, something like that might be a
nice incremental step to get the mechanism in place while (hopefully)
avoiding some of these corner case behaviors that might require more
thought and care to get right. TBH, even with something more
conservative like "defer/delay when <50% full && fs >= some minimum
size" might be a reasonable starting point. We could always bypass that
entirely via DEBUG=1 to get the full effect. Just a handwavy thought
though..

Brian

> --D
> 
> > 
> > Brian
> > 
> > > Then we don't need to do.....
> > > 
> > > > +/*
> > > > + * If the target device (or some part of it) is full enough that it won't to be
> > > > + * able to satisfy the entire request, try to free inactive files to free up
> > > > + * space.  While it's perfectly fine to fill a preallocation request with a
> > > > + * bunch of short extents, we prefer to slow down preallocation requests to
> > > > + * combat long term fragmentation in new file data.
> > > > + */
> > > > +static int
> > > > +xfs_alloc_consolidate_freespace(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_filblks_t		wanted)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	struct xfs_perag	*pag;
> > > > +	struct xfs_sb		*sbp = &mp->m_sb;
> > > > +	xfs_agnumber_t		agno;
> > > > +
> > > > +	if (!xfs_has_inodegc_work(mp))
> > > > +		return 0;
> > > > +
> > > > +	if (XFS_IS_REALTIME_INODE(ip)) {
> > > > +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> > > > +			return 0;
> > > > +		goto free_space;
> > > > +	}
> > > > +
> > > > +	for_each_perag(mp, agno, pag) {
> > > > +		if (pag->pagf_freeblks >= wanted) {
> > > > +			xfs_perag_put(pag);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > 
> > > ... really hurty things (e.g. on high AG count fs) on every fallocate()
> > > call, and we have a simple modification to the tests that allow them
> > > to work as they want to on both old and new kernels....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-08 15:32       ` Darrick J. Wong
  2021-06-08 16:06         ` Brian Foster
@ 2021-06-08 21:55         ` Dave Chinner
  2021-06-09  0:25           ` Darrick J. Wong
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-06-08 21:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, hch

On Tue, Jun 08, 2021 at 08:32:04AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> > users/workloads that might operate under these conditions? I guess
> > historically we've always recommended to not consistently operate in
> > <20% free space conditions, so to some degree there is an expectation
> > for less than optimal behavior if one decides to constantly bash an fs
> > into ENOSPC. Then again with large enough files, will/can we put the
> > filesystem into that state ourselves without any indication to the user?
> > 
> > I kind of wonder if unless/until there's some kind of efficient feedback
> > between allocation and "pending" free space, whether deferred
> > inactivation should be an optimization tied to some kind of heuristic
> > that balances the amount of currently available free space against
> > pending free space (but I've not combed through the code enough to grok
> > whether this already does something like that).
> 
> Ooh!  You mentioned "efficient feedback", and one sprung immediately to
> mind -- if the AG is near full (or above 80% full, or whatever) we
> schedule the per-AG inodegc worker immediately instead of delaying it.

That's what the lowspace thresholds in speculative preallocation are
for...

20% of a 1TB AG is an awful lot of freespace still remaining, and
if someone is asking for a 200GB fallocate(), they are always going
to get some fragmentation on a used, 80% full filesystem regardless
of deferred inode inactivation.

IMO, if you're going to do this, use the same thresholds we already
use to limit preallocation near global ENOSPC and graduate it to be
more severe the closer we get to global ENOSPC...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low
  2021-06-08 21:55         ` Dave Chinner
@ 2021-06-09  0:25           ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-09  0:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, hch

On Wed, Jun 09, 2021 at 07:55:42AM +1000, Dave Chinner wrote:
> On Tue, Jun 08, 2021 at 08:32:04AM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> > > users/workloads that might operate under these conditions? I guess
> > > historically we've always recommended to not consistently operate in
> > > <20% free space conditions, so to some degree there is an expectation
> > > for less than optimal behavior if one decides to constantly bash an fs
> > > into ENOSPC. Then again with large enough files, will/can we put the
> > > filesystem into that state ourselves without any indication to the user?
> > > 
> > > I kind of wonder if unless/until there's some kind of efficient feedback
> > > between allocation and "pending" free space, whether deferred
> > > inactivation should be an optimization tied to some kind of heuristic
> > > that balances the amount of currently available free space against
> > > pending free space (but I've not combed through the code enough to grok
> > > whether this already does something like that).
> > 
> > Ooh!  You mentioned "efficient feedback", and one sprung immediately to
> > mind -- if the AG is near full (or above 80% full, or whatever) we
> > schedule the per-AG inodegc worker immediately instead of delaying it.
> 
> That's what the lowspace thresholds in speculative preallocation are
> for...
> 
> 20% of a 1TB AG is an awful lot of freespace still remaining, and
> if someone is asking for a 200GB fallocate(), they are always going
> to get some fragmentation on a used, 80% full filesystem regardless
> of deferred inode inactivation.
> 
> IMO, if you're going to do this, use the same thresholds we already
> use to limit preallocation near global ENOSPC and graduate it to be
> more severe the closer we get to global ENOSPC...

Ok.  I'll just crib the same 5/4/3/2/1% thresholds like prealloc, then.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] xfs: deferred inode inactivation
  2021-06-08  4:40     ` Darrick J. Wong
@ 2021-06-09  1:01       ` Dave Chinner
  2021-06-09  1:28         ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-06-09  1:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Jun 07, 2021 at 09:40:51PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 08, 2021 at 10:57:53AM +1000, Dave Chinner wrote:
> > On Mon, Jun 07, 2021 at 03:25:04PM -0700, Darrick J. Wong wrote:
> > > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > > +static void
> > > +xfs_inodegc_queue(
> > > +	struct xfs_mount        *mp)
> > > +{
> > > +	if (!xfs_inodegc_running(mp))
> > > +		return;
> > > +
> > > +	rcu_read_lock();
> > > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> > > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> > > +	rcu_read_unlock();
> > > +}
> > 
> > I have no idea why we are checking if the gc is running here. All
> > our other background stuff runs and re-queues until it is directly
> > stopped or there's nothing left in the tree. Hence I'm a bit
> > clueless right now about what this semantic is for...
> 
> The opflag is supposed to control whether inactivation actually runs.
> As you might guess from _running calls being scattered everywhere, it's
> pretty ugly code.  All this crap exists because there's no easy solution
> to shutting down background threads after we commit to freezing the fs
> but before an fs freeze hits SB_FREEZE_FS and we can't allocate new
> transactions.
> 
> Fixing inactivation to use NO_WRITECOUNT means auditing every function
> call that xfs_inactive makes to look for an xfs_trans_alloc* call and
> probably modifying all of them to be able to switch between regular and
> NOWRITECOUNT mode.  I tried that, it's ugly.
> 
> Another solution is to add ->freeze_super and ->thaw_super functions
> to prevent a FITHAW caller from racing with a FIFREEZE caller and
> accidentally rearming the inodegc while a freeze starts.

This seems like the right way to proceed.

Of course, all the freeze/thaw exclusion is in freeze_super and
thaw_super via the umount lock, so to do this we need to factor
freeze_super() so that we can take the active references to the
superblock ourselves, then turn off inodegc, then run the generic
freeze_super() code...

The unfreeze side where we'd turn the inode gc back on is already
inside the protected region (via ->unfreeze_fs callout in
thaw_super_locked()) so we don't need to do anything special there.

[I'll reorder bits to address all the other freeze stuff now so it's
not all mixed up with the stat/df stuff]

> > > +/* Stop all queued inactivation work. */
> > > +void
> > > +xfs_inodegc_stop(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > > +}
> > 
> > what's to stop racing invocations of stop/start? Perhaps:
> > 
> > 	if (test_and_clear_bit())
> > 		cancel_delayed_work_sync(&mp->m_inodegc_work);
> 
> That horrible hack below.
>
> > > +
> > > +/* Schedule deferred inode inactivation work. */
> > > +void
> > > +xfs_inodegc_start(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > > +	xfs_inodegc_queue(mp);
> > > +}
> > 
> > 	if (test_and_set_bit())
> > 		xfs_inodegc_queue(mp);
> > 
> > So that the running state will remain in sync with the actual queue
> > operation? Though I'm still not sure why we need the running bit...
> 
> (see ugly sync_fs SB_FREEZE_PAGEFAULTS hack)

I'm not sure how that addresses any sort of concurrent set/clear
that could occur as it doesn't guarantee that the running state
matches the opflag bit state...

> > Ok, "opflags" are undocumented as to how they work, what their
> > consistency model is, etc. I understand you want an atomic flag to
> > indicate that something is running, and mp->m_flags is not that
> > (despite being used that way historically). 
> > 
> > I dislike the "_BIT" annotations for a variable that is only to be
> > used as an index bit field. Or maybe it's a flag field and you
> > haven't defined any bitwise flags for it because you're not using it
> > that way yet.
> > 
> > So, is m_opflags an indexed bit field or a normal flag field like
> > m_flags?
> 
> It's an indexed bit field, which is why I named it _BIT.  I'll try to
> add more documentation around what this thing is and how the flags work:
> 
> struct xfs_mount {
> 	...
> 	/*
> 	 * This atomic bitset controls flags that alter the behavior of
> 	 * the filesystem.  Use only the atomic bit helper functions
> 	 * here; see XFS_OPFLAG_* for information about the actual
> 	 * flags.
> 	 */
> 	unsigned long		m_opflags;
> 	...
> };
> 
> /*
>  * Operation flags -- each entry here is a bit index into m_opflags and
>  * is not itself a flag value.
>  */
> 
> /* Are we allowed to run the inode inactivation worker? */
> #define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)

This doesn't really address my comments - there's still the _BIT
annotation mixed with "flags" variables. Other examples of this are
that "operational flags" or state variables are updated via
set/clear/test/etc bit op wrappers. An example of this is page and
bufferhead state bits and variables...

I mentioned on #xfs an older patchset I had that cleaned up a lot of
this cruft in the xfs_mount flags fields by separating feature flags
from state flags. That can be found here:

https://lore.kernel.org/linux-xfs/20180820044851.414-1-david@fromorbit.com/

I think if we are going to introduce dynamic mount state flags, we
need to move towards that sort of separation. So leave this patch
set as it is now with the opflags, and I'll update my flag vs state
rework patchset and merge this new code into it...

That all said, I still don't really see a need for a state bit here
if we can stop the inode gc before we start the freeze process as
via a xfs_fs_freeze_super() method.

(and that's freeze done...)

> > > @@ -947,6 +963,7 @@ xfs_mountfs(
> > >  	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
> > >  	 * quota inodes.
> > >  	 */
> > > +	xfs_inodegc_flush(mp);
> > >  	xfs_unmount_flush_inodes(mp);
> > 
> > Why isn't xfs_inodegc_flush() part of xfs_unmount_flush_inodes()?
> > Because, really, xfs_unmount_flush_inodes() depends on all the
> > inodes first being inactivated so that all transactions on inodes
> > are complete....
> 
> The teardown sequence is not the same between a regular unmount and an
> aborted mount...
> 
> > >   out_log_dealloc:
> > >  	xfs_log_mount_cancel(mp);
> > > @@ -983,6 +1000,12 @@ xfs_unmountfs(
> > >  	uint64_t		resblks;
> > >  	int			error;
> > >  
> > > +	/*
> > > +	 * Flush all the queued inode inactivation work to disk before tearing
> > > +	 * down rt metadata and quotas.
> > > +	 */
> > > +	xfs_inodegc_flush(mp);
> > > +
> > >  	xfs_blockgc_stop(mp);
> > >  	xfs_fs_unreserve_ag_blocks(mp);
> > >  	xfs_qm_unmount_quotas(mp);
> > 
> > FWIW, there's inconsistency in the order of operations between
> > failure handling in xfs_mountfs() w.r.t. inode flushing and quotas
> > vs what xfs_unmountfs() is now doing....
> 
> ...because during regular unmountfs, we want to inactivate inodes while
> we still have a per-ag reservation protecting finobt expansions.  During
> an aborted mount, we don't necessarily have the reservation set up but
> we have to clean everything out, so the inodegc flush comes much later.
> 
> It's convoluted, but do you want me to clean /that/ up too?  That's a
> pretty heavy lift; I already tried to fix those two paths, ran out of
> brain cells, and gave up.

No, I was just noting that they are different and there was no clear
explaination of why. A comment explaining the difference is really
all I am looking for here...

(and now df vs unlink....)

> > > @@ -80,4 +80,37 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
> > >  void xfs_blockgc_stop(struct xfs_mount *mp);
> > >  void xfs_blockgc_start(struct xfs_mount *mp);
> > >  
> > > +void xfs_inodegc_worker(struct work_struct *work);
> > > +void xfs_inodegc_flush(struct xfs_mount *mp);
> > > +void xfs_inodegc_stop(struct xfs_mount *mp);
> > > +void xfs_inodegc_start(struct xfs_mount *mp);
> > > +int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
> > > +
> > > +/*
> > > + * Process all pending inode inactivations immediately (sort of) so that a
> > > + * resource usage report will be mostly accurate with regards to files that
> > > + * have been unlinked recently.
> > > + *
> > > + * It isn't practical to maintain a count of the resources used by unlinked
> > > + * inodes to adjust the values reported by this function.  Resources that are
> > > + * shared (e.g. reflink) when an inode is queued for inactivation cannot be
> > > + * counted towards the adjustment, and cross referencing data extents with the
> > > + * refcount btree is the only way to decide if a resource is shared.  Worse,
> > > + * unsharing of any data blocks in the system requires either a second
> > > + * consultation with the refcount btree, or training users to deal with the
> > > + * free space counts possibly fluctuating upwards as inactivations occur.
> > > + *
> > > + * Hence we guard the inactivation flush with a ratelimiter so that the counts
> > > + * are not way out of whack while ignoring workloads that hammer us with statfs
> > > + * calls.  Once per clock tick seems frequent enough to avoid complaints about
> > > + * inaccurate counts.
> > > + */
> > > +static inline void
> > > +xfs_inodegc_summary_flush(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	if (__ratelimit(&mp->m_inodegc_ratelimit))
> > > +		xfs_inodegc_flush(mp);
> > > +}
> > 
> > ONce per clock tick is still quite fast - once a millisecond on a
> > 1000Hz kernel. I'd prefer that we use a known timer base for this
> > sort of thing, not something that changes with kernel config. Every
> > 100ms, perhaps?
> 
> I tried a number of ratelimit values here: 1ms, 4ms, 10ms, 100ms, 500ms,
> 2s, and 15s.  fstests and most everything else seemed to act the same up
> to 10ms.  At 100ms, the tests that delete something and immediately run
> df will start to fail, and above that all hell breaks loose because many
> tests (from which I extrapolate most programmers) expect that statfs
> won't run until unlink has deleted everything.

So the main problem I have with this is that it it blocks the caller
until inactivation is done. For something that is supposed to be
fast and non-blocking, this is a bad thing to do.

The quota usage (called on every get/get_next syscall) is really the
only one that should be called with any frequency - if anyone is
calling statfs so fast that we have to rate limit gc flushes, then
they aren't getting any useful information in the delta between
calls a millisecond apart.

Hence I suspect that flushes and/or rate limited flushes are not
necessary at all here. Why not just deal with it like we do the
inode flush at ENOSPC (i.e. xfs_flush_inodes())? i.e. we try to
flush the work first, and if that returns true we waited on a flush
already in progress and we don't need to do our own? Indeed, why
aren't all the inodegc flushes done this way?

For the quota case, I think doing a flush on the first get call
would be sufficient - doing one for every "next" call doesn't make
much sense, because we've already done a flush at the start of the
dquot get walk. IOWs, we've done the work necessary for a point in
time snapshot of the quota state that is largely consistent across
all the quotas at the time the walk started. Hence I don't think we
need to keep flushing over and over again....

For fs_counts, it is non-blocking, best effort only.  The percpu
counters are read, not summed, so they are inaccurate to begin with.
Hence there's not need to flush inactivated inodes there becuse the
counters are not guaranteed accurate. If we do need a flush, then
just do it unconditionally because anyone calling this with
extremely high frequency really isn't going to get anything useful
from it.

For statfs, we actually sum the percpu counters, so we should just
flush the inodegc before this. If someone is calling statfs with
high enough frequency that rate limiting inodegc flushes is actually
needed, then they've already got substantial problems on large CPU
count machines..

Hence I think we should just have flushes where they are needed, and
change the flush to block and return if a flush is already in
progress rather than doing an entire new flush itself. That
effectively rate limits the flushing, but doesn't prevent a flush
when none has been done in a while due to ratelimit state...

> > I suspect that's really going to hurt stat performance. I guess
> > benchmarks are in order...
> 
> Ok, so here's the question I have: Does POSIX (or any other standard we
> care to fit) actually define any behavior between the following
> sequence:
> 
> unlink("/foo");	/* no sync calls of any kind */
> statfs("/");

None that I know of. the man page for statfs even says:

"buf is a pointer to a statfs structure defined approximately as
follows"

so even the man page is extremely cagey about what is actually
returned in the statfs buffer. Free space counters not being totally
accurate at any specific point in time fits with the "approximately
defined" behaviour description in the man page. As long as we do, in
the near term, correct account for deferred operations, then we're
all good.

> As I've mentioned in the past, the only reason we need these inodegc
> flushes for summary reporting is because users expect that if they
> delete an unshared 10GB file, they can immediately df and see that the
> inode count went down by one and free space went up by 10GB.
> 
> I /guess/ we could modify every single fstest to sync before checking
> summary counts instead of doing this, but I bet there will be some users
> who will be surprised that xfs now has *trfs df logic.

If fstests needs accurate counters, it should use 'df --sync' and we
should make sure that the syncfs implementation triggers a flush of
inactive inodes before it returns. We don't have to guarantee that
the inactivation is on stable storage, but it would make sense that
we trigger background ops to get the free space accounting near to
accurate...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] xfs: deferred inode inactivation
  2021-06-09  1:01       ` Dave Chinner
@ 2021-06-09  1:28         ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-06-09  1:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Wed, Jun 09, 2021 at 11:01:09AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 09:40:51PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 08, 2021 at 10:57:53AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 07, 2021 at 03:25:04PM -0700, Darrick J. Wong wrote:
> > > > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > > > +static void
> > > > +xfs_inodegc_queue(
> > > > +	struct xfs_mount        *mp)
> > > > +{
> > > > +	if (!xfs_inodegc_running(mp))
> > > > +		return;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> > > > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> > > > +	rcu_read_unlock();
> > > > +}
> > > 
> > > I have no idea why we are checking if the gc is running here. All
> > > our other background stuff runs and re-queues until it is directly
> > > stopped or there's nothing left in the tree. Hence I'm a bit
> > > clueless right now about what this semantic is for...
> > 
> > The opflag is supposed to control whether inactivation actually runs.
> > As you might guess from _running calls being scattered everywhere, it's
> > pretty ugly code.  All this crap exists because there's no easy solution
> > to shutting down background threads after we commit to freezing the fs
> > but before an fs freeze hits SB_FREEZE_FS and we can't allocate new
> > transactions.
> > 
> > Fixing inactivation to use NO_WRITECOUNT means auditing every function
> > call that xfs_inactive makes to look for an xfs_trans_alloc* call and
> > probably modifying all of them to be able to switch between regular and
> > NOWRITECOUNT mode.  I tried that, it's ugly.
> > 
> > Another solution is to add ->freeze_super and ->thaw_super functions
> > to prevent a FITHAW caller from racing with a FIFREEZE caller and
> > accidentally rearming the inodegc while a freeze starts.
> 
> This seems like the right way to proceed.
> 
> Of course, all the freeze/thaw exclusion is in freeze_super and
> thaw_super via the umount lock, so to do this we need to factor
> freeze_super() so that we can take the active references to the
> superblock ourselves, then turn off inodegc, then run the generic
> freeze_super() code...
> 
> The unfreeze side where we'd turn the inode gc back on is already
> inside the protected region (via ->unfreeze_fs callout in
> thaw_super_locked()) so we don't need to do anything special there.

<nod> Not sure I want to go factoring vfs functions if I can avoid it,
though I guess there's always the copy-pasta approach :P

But for now, I /think/ I figured out a way to make it work and avoid all
that (see below).

> [I'll reorder bits to address all the other freeze stuff now so it's
> not all mixed up with the stat/df stuff]
> 
> > > > +/* Stop all queued inactivation work. */
> > > > +void
> > > > +xfs_inodegc_stop(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > > > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > > > +}
> > > 
> > > what's to stop racing invocations of stop/start? Perhaps:
> > > 
> > > 	if (test_and_clear_bit())
> > > 		cancel_delayed_work_sync(&mp->m_inodegc_work);
> > 
> > That horrible hack below.
> >
> > > > +
> > > > +/* Schedule deferred inode inactivation work. */
> > > > +void
> > > > +xfs_inodegc_start(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > > > +	xfs_inodegc_queue(mp);
> > > > +}
> > > 
> > > 	if (test_and_set_bit())
> > > 		xfs_inodegc_queue(mp);
> > > 
> > > So that the running state will remain in sync with the actual queue
> > > operation? Though I'm still not sure why we need the running bit...
> > 
> > (see ugly sync_fs SB_FREEZE_PAGEFAULTS hack)
> 
> I'm not sure how that addresses any sort of concurrent set/clear
> that could occur as it doesn't guarantee that the running state
> matches the opflag bit state...
> 
> > > Ok, "opflags" are undocumented as to how they work, what their
> > > consistency model is, etc. I understand you want an atomic flag to
> > > indicate that something is running, and mp->m_flags is not that
> > > (despite being used that way historically). 
> > > 
> > > I dislike the "_BIT" annotations for a variable that is only to be
> > > used as an index bit field. Or maybe it's a flag field and you
> > > haven't defined any bitwise flags for it because you're not using it
> > > that way yet.
> > > 
> > > So, is m_opflags an indexed bit field or a normal flag field like
> > > m_flags?
> > 
> > It's an indexed bit field, which is why I named it _BIT.  I'll try to
> > add more documentation around what this thing is and how the flags work:
> > 
> > struct xfs_mount {
> > 	...
> > 	/*
> > 	 * This atomic bitset controls flags that alter the behavior of
> > 	 * the filesystem.  Use only the atomic bit helper functions
> > 	 * here; see XFS_OPFLAG_* for information about the actual
> > 	 * flags.
> > 	 */
> > 	unsigned long		m_opflags;
> > 	...
> > };
> > 
> > /*
> >  * Operation flags -- each entry here is a bit index into m_opflags and
> >  * is not itself a flag value.
> >  */
> > 
> > /* Are we allowed to run the inode inactivation worker? */
> > #define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)
> 
> This doesn't really address my comments - there's still the _BIT
> annotation mixed with "flags" variables. Other examples of this are
> that "operational flags" or state variables are updated via
> set/clear/test/etc bit op wrappers. An example of this is page and
> bufferhead state bits and variables...

Urk, it was late last night and I forgot to update the reply after I
changed that to an enum.

> I mentioned on #xfs an older patchset I had that cleaned up a lot of
> this cruft in the xfs_mount flags fields by separating feature flags
> from state flags. That can be found here:
> 
> https://lore.kernel.org/linux-xfs/20180820044851.414-1-david@fromorbit.com/
> 
> I think if we are going to introduce dynamic mount state flags, we
> need to move towards that sort of separation. So leave this patch
> set as it is now with the opflags, and I'll update my flag vs state
> rework patchset and merge this new code into it...
> 
> That all said, I still don't really see a need for a state bit here
> if we can stop the inode gc before we start the freeze process as
> via a xfs_fs_freeze_super() method.
> 
> (and that's freeze done...)
> 
> > > > @@ -947,6 +963,7 @@ xfs_mountfs(
> > > >  	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
> > > >  	 * quota inodes.
> > > >  	 */
> > > > +	xfs_inodegc_flush(mp);
> > > >  	xfs_unmount_flush_inodes(mp);
> > > 
> > > Why isn't xfs_inodegc_flush() part of xfs_unmount_flush_inodes()?
> > > Because, really, xfs_unmount_flush_inodes() depends on all the
> > > inodes first being inactivated so that all transactions on inodes
> > > are complete....
> > 
> > The teardown sequence is not the same between a regular unmount and an
> > aborted mount...
> > 
> > > >   out_log_dealloc:
> > > >  	xfs_log_mount_cancel(mp);
> > > > @@ -983,6 +1000,12 @@ xfs_unmountfs(
> > > >  	uint64_t		resblks;
> > > >  	int			error;
> > > >  
> > > > +	/*
> > > > +	 * Flush all the queued inode inactivation work to disk before tearing
> > > > +	 * down rt metadata and quotas.
> > > > +	 */
> > > > +	xfs_inodegc_flush(mp);
> > > > +
> > > >  	xfs_blockgc_stop(mp);
> > > >  	xfs_fs_unreserve_ag_blocks(mp);
> > > >  	xfs_qm_unmount_quotas(mp);
> > > 
> > > FWIW, there's inconsistency in the order of operations between
> > > failure handling in xfs_mountfs() w.r.t. inode flushing and quotas
> > > vs what xfs_unmountfs() is now doing....
> > 
> > ...because during regular unmountfs, we want to inactivate inodes while
> > we still have a per-ag reservation protecting finobt expansions.  During
> > an aborted mount, we don't necessarily have the reservation set up but
> > we have to clean everything out, so the inodegc flush comes much later.
> > 
> > It's convoluted, but do you want me to clean /that/ up too?  That's a
> > pretty heavy lift; I already tried to fix those two paths, ran out of
> > brain cells, and gave up.
> 
> No, I was just noting that they are different and there was no clear
> explaination of why. A comment explaining the difference is really
> all I am looking for here...

Ok.  I'll document why xfs_unmountfs calls inodegc_flush explicitly:

	/*
	 * Perform all on-disk metadata updates required to inactivate
	 * inodes.  Freeing inodes can cause shape changes to the
	 * finobt, so take care of this before we lose the per-AG space
	 * reservations.
	 */
	xfs_inodegc_flush(mp);

It shouldn't be necessary to call it again from xfs_unmount_flush_inodes
since the only inodes that should be in memory at that point are the
quota, realtime, and root directory inodes, none of which ever go
through inactivation (quota/rt are metadata, and the root dir should
never get deleted).

However, it's a bit murkier for the failed mountfs case -- I /think/ I
put the flush in there to make sure that we always sent all inodes to
reclaim no matter what weird things happened during mount.  In theory
it's always a NOP since log recovery inactivates all the inodes it
touches (if necessary) and the only other inodes that mount touches are
the quota/rt/rootdir inodes.

> (and now df vs unlink....)
> 
> > > > @@ -80,4 +80,37 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
> > > >  void xfs_blockgc_stop(struct xfs_mount *mp);
> > > >  void xfs_blockgc_start(struct xfs_mount *mp);
> > > >  
> > > > +void xfs_inodegc_worker(struct work_struct *work);
> > > > +void xfs_inodegc_flush(struct xfs_mount *mp);
> > > > +void xfs_inodegc_stop(struct xfs_mount *mp);
> > > > +void xfs_inodegc_start(struct xfs_mount *mp);
> > > > +int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
> > > > +
> > > > +/*
> > > > + * Process all pending inode inactivations immediately (sort of) so that a
> > > > + * resource usage report will be mostly accurate with regards to files that
> > > > + * have been unlinked recently.
> > > > + *
> > > > + * It isn't practical to maintain a count of the resources used by unlinked
> > > > + * inodes to adjust the values reported by this function.  Resources that are
> > > > + * shared (e.g. reflink) when an inode is queued for inactivation cannot be
> > > > + * counted towards the adjustment, and cross referencing data extents with the
> > > > + * refcount btree is the only way to decide if a resource is shared.  Worse,
> > > > + * unsharing of any data blocks in the system requires either a second
> > > > + * consultation with the refcount btree, or training users to deal with the
> > > > + * free space counts possibly fluctuating upwards as inactivations occur.
> > > > + *
> > > > + * Hence we guard the inactivation flush with a ratelimiter so that the counts
> > > > + * are not way out of whack while ignoring workloads that hammer us with statfs
> > > > + * calls.  Once per clock tick seems frequent enough to avoid complaints about
> > > > + * inaccurate counts.
> > > > + */
> > > > +static inline void
> > > > +xfs_inodegc_summary_flush(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	if (__ratelimit(&mp->m_inodegc_ratelimit))
> > > > +		xfs_inodegc_flush(mp);
> > > > +}
> > > 
> > > ONce per clock tick is still quite fast - once a millisecond on a
> > > 1000Hz kernel. I'd prefer that we use a known timer base for this
> > > sort of thing, not something that changes with kernel config. Every
> > > 100ms, perhaps?
> > 
> > I tried a number of ratelimit values here: 1ms, 4ms, 10ms, 100ms, 500ms,
> > 2s, and 15s.  fstests and most everything else seemed to act the same up
> > to 10ms.  At 100ms, the tests that delete something and immediately run
> > df will start to fail, and above that all hell breaks loose because many
> > tests (from which I extrapolate most programmers) expect that statfs
> > won't run until unlink has deleted everything.
> 
> So the main problem I have with this is that it it blocks the caller
> until inactivation is done. For something that is supposed to be
> fast and non-blocking, this is a bad thing to do.
> 
> The quota usage (called on every get/get_next syscall) is really the
> only one that should be called with any frequency - if anyone is
> calling statfs so fast that we have to rate limit gc flushes, then
> they aren't getting any useful information in the delta between
> calls a millisecond apart.
> 
> Hence I suspect that flushes and/or rate limited flushes are not
> necessary at all here. Why not just deal with it like we do the
> inode flush at ENOSPC (i.e. xfs_flush_inodes())? i.e. we try to
> flush the work first, and if that returns true we waited on a flush
> already in progress and we don't need to do our own? Indeed, why
> aren't all the inodegc flushes done this way?
> 
> For the quota case, I think doing a flush on the first get call
> would be sufficient - doing one for every "next" call doesn't make
> much sense, because we've already done a flush at the start of the
> dquot get walk. IOWs, we've done the work necessary for a point in
> time snapshot of the quota state that is largely consistent across
> all the quotas at the time the walk started. Hence I don't think we
> need to keep flushing over and over again....

<nod> The quota case might get put back, then.

> For fs_counts, it is non-blocking, best effort only.  The percpu
> counters are read, not summed, so they are inaccurate to begin with.
> Hence there's not need to flush inactivated inodes there becuse the
> counters are not guaranteed accurate. If we do need a flush, then
> just do it unconditionally because anyone calling this with
> extremely high frequency really isn't going to get anything useful
> from it.
> 
> For statfs, we actually sum the percpu counters, so we should just
> flush the inodegc before this. If someone is calling statfs with
> high enough frequency that rate limiting inodegc flushes is actually
> needed, then they've already got substantial problems on large CPU
> count machines..
> 
> Hence I think we should just have flushes where they are needed, and
> change the flush to block and return if a flush is already in
> progress rather than doing an entire new flush itself. That
> effectively rate limits the flushing, but doesn't prevent a flush
> when none has been done in a while due to ratelimit state...

For now, I'm taking your suggestion to kick off the inactivation work at
the end of xfs_fs_syncfs, since regular syncfs is known to involve disk
access already, and the people who "unlink(); syncfs(); statfs();" like
they're supposed to will still get the results they expect.

I also made it so that if fdblocks is below the low space thresholds
then we'll delay the inode/blockgc workers less and less, and removed
xfs_inodegc_summary_flush completely.

I /think/ that flushing inodegc during a syncfs also solves the problem
of coordinating inodegc flush and stop with the rest of freeze.  I might
still need the opflag to prevent requeuing if a destroy_inode races with
a readonly remount, but I need to recheck that a little more carefully
tomorrow when I'm not tired.

But, at worst, if that doesn't work, I can open-code freeze_super to
call the parts of XFS that we really want.

> 
> > > I suspect that's really going to hurt stat performance. I guess
> > > benchmarks are in order...
> > 
> > Ok, so here's the question I have: Does POSIX (or any other standard we
> > care to fit) actually define any behavior between the following
> > sequence:
> > 
> > unlink("/foo");	/* no sync calls of any kind */
> > statfs("/");
> 
> None that I know of. the man page for statfs even says:
> 
> "buf is a pointer to a statfs structure defined approximately as
> follows"
> 
> so even the man page is extremely cagey about what is actually
> returned in the statfs buffer. Free space counters not being totally
> accurate at any specific point in time fits with the "approximately
> defined" behaviour description in the man page. As long as we do, in
> the near term, correct account for deferred operations, then we're
> all good.
> 
> > As I've mentioned in the past, the only reason we need these inodegc
> > flushes for summary reporting is because users expect that if they
> > delete an unshared 10GB file, they can immediately df and see that the
> > inode count went down by one and free space went up by 10GB.
> > 
> > I /guess/ we could modify every single fstest to sync before checking
> > summary counts instead of doing this, but I bet there will be some users
> > who will be surprised that xfs now has *trfs df logic.
> 
> If fstests needs accurate counters, it should use 'df --sync' and we
> should make sure that the syncfs implementation triggers a flush of
> inactive inodes before it returns. We don't have to guarantee that
> the inactivation is on stable storage, but it would make sense that
> we trigger background ops to get the free space accounting near to
> accurate...

<nod> Well let's see how it does overnight...

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/9] xfs: refactor the inode recycling code
  2021-03-26  0:21 [PATCHSET v5 0/9] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-03-26  0:21 ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2021-03-26  0:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, hch

From: Darrick J. Wong <djwong@kernel.org>

Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
to an xfs_inode that was previously vfs-destroyed.  The next patch will
add a new set of state flags, so we need the helper to avoid
duplication.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |  144 +++++++++++++++++++++++++++++----------------------
 1 file changed, 83 insertions(+), 61 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4c124bc98f39..79f61a7f40b2 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -286,19 +286,19 @@ xfs_inew_wait(
  * need to retain across reinitialisation, and rewrite them into the VFS inode
  * after reinitialisation even if it fails.
  */
-static int
+static inline int
 xfs_reinit_inode(
 	struct xfs_mount	*mp,
 	struct inode		*inode)
 {
-	int		error;
-	uint32_t	nlink = inode->i_nlink;
-	uint32_t	generation = inode->i_generation;
-	uint64_t	version = inode_peek_iversion(inode);
-	umode_t		mode = inode->i_mode;
-	dev_t		dev = inode->i_rdev;
-	kuid_t		uid = inode->i_uid;
-	kgid_t		gid = inode->i_gid;
+	int			error;
+	uint32_t		nlink = inode->i_nlink;
+	uint32_t		generation = inode->i_generation;
+	uint64_t		version = inode_peek_iversion(inode);
+	umode_t			mode = inode->i_mode;
+	dev_t			dev = inode->i_rdev;
+	kuid_t			uid = inode->i_uid;
+	kgid_t			gid = inode->i_gid;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -312,6 +312,73 @@ xfs_reinit_inode(
 	return error;
 }
 
+/*
+ * Carefully nudge an inode whose VFS state has been torn down back into a
+ * usable state.
+ */
+static int
+xfs_iget_recycle(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	int			error;
+
+	/*
+	 * We need to make it look like the inode is being reclaimed to prevent
+	 * the actual reclaim workers from stomping over us while we recycle
+	 * the inode.  We can't clear the radix tree tag yet as it requires
+	 * pag_ici_lock to be held exclusive.
+	 */
+	ip->i_flags |= XFS_IRECLAIM;
+
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+
+	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+	error = xfs_reinit_inode(mp, inode);
+	if (error) {
+		bool		wake;
+
+		/*
+		 * Re-initializing the inode failed, and we are in deep
+		 * trouble.  Try to re-add it to the inactive list.
+		 */
+		rcu_read_lock();
+		spin_lock(&ip->i_flags_lock);
+		wake = !!__xfs_iflags_test(ip, XFS_INEW);
+		ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+		if (wake)
+			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
+		spin_unlock(&ip->i_flags_lock);
+		rcu_read_unlock();
+		return error;
+	}
+
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	/*
+	 * Clear the per-lifetime state in the inode as we are now effectively
+	 * a new inode and need to return to the initial state before reuse
+	 * occurs.
+	 */
+	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
+	ip->i_flags |= XFS_INEW;
+	xfs_perag_clear_ici_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			XFS_ICI_RECLAIM_TAG);
+	inode->i_state = I_NEW;
+
+	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+	init_rwsem(&inode->i_rwsem);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+
+	return 0;
+}
+
 /*
  * If we are allocating a new inode, then check what was returned is
  * actually a free, empty inode. If we are not allocating an inode,
@@ -386,7 +453,7 @@ xfs_iget_cache_hit(
 	/*
 	 * If we are racing with another cache hit that is currently
 	 * instantiating this inode or currently recycling it out of
-	 * reclaimabe state, wait for the initialisation to complete
+	 * reclaimable state, wait for the initialisation to complete
 	 * before continuing.
 	 *
 	 * XXX(hch): eventually we should do something equivalent to
@@ -408,11 +475,11 @@ xfs_iget_cache_hit(
 	if (error)
 		goto out_error;
 
-	/*
-	 * If IRECLAIMABLE is set, we've torn down the VFS inode already.
-	 * Need to carefully get it back into useable state.
-	 */
 	if (ip->i_flags & XFS_IRECLAIMABLE) {
+		/*
+		 * If IRECLAIMABLE is set, we've torn down the VFS inode
+		 * already, and must carefully restore it to usable state.
+		 */
 		trace_xfs_iget_reclaim(ip);
 
 		if (flags & XFS_IGET_INCORE) {
@@ -420,55 +487,11 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
-		/*
-		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
-		 * from stomping over us while we recycle the inode.  We can't
-		 * clear the radix tree reclaimable tag yet as it requires
-		 * pag_ici_lock to be held exclusive.
-		 */
-		ip->i_flags |= XFS_IRECLAIM;
-
-		spin_unlock(&ip->i_flags_lock);
-		rcu_read_unlock();
-
-		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-		error = xfs_reinit_inode(mp, inode);
+		error = xfs_iget_recycle(pag, ip);
 		if (error) {
-			bool wake;
-			/*
-			 * Re-initializing the inode failed, and we are in deep
-			 * trouble.  Try to re-add it to the reclaim list.
-			 */
-			rcu_read_lock();
-			spin_lock(&ip->i_flags_lock);
-			wake = !!__xfs_iflags_test(ip, XFS_INEW);
-			ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
-			if (wake)
-				wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
-			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
 			trace_xfs_iget_reclaim_fail(ip);
-			goto out_error;
+			return error;
 		}
-
-		spin_lock(&pag->pag_ici_lock);
-		spin_lock(&ip->i_flags_lock);
-
-		/*
-		 * Clear the per-lifetime state in the inode as we are now
-		 * effectively a new inode and need to return to the initial
-		 * state before reuse occurs.
-		 */
-		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
-		ip->i_flags |= XFS_INEW;
-		xfs_perag_clear_ici_tag(pag,
-				XFS_INO_TO_AGINO(pag->pag_mount, ino),
-				XFS_ICI_RECLAIM_TAG);
-		inode->i_state = I_NEW;
-		ip->i_sick = 0;
-		ip->i_checked = 0;
-
-		spin_unlock(&ip->i_flags_lock);
-		spin_unlock(&pag->pag_ici_lock);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -498,7 +521,6 @@ xfs_iget_cache_hit(
 	return error;
 }
 
-
 static int
 xfs_iget_cache_miss(
 	struct xfs_mount	*mp,


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-06-09  1:28 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
2021-06-07 22:24 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong
2021-06-07 22:59   ` Dave Chinner
2021-06-08  0:14     ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 2/9] xfs: deferred inode inactivation Darrick J. Wong
2021-06-08  0:57   ` Dave Chinner
2021-06-08  4:40     ` Darrick J. Wong
2021-06-09  1:01       ` Dave Chinner
2021-06-09  1:28         ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
2021-06-08  1:09   ` Dave Chinner
2021-06-08  2:02     ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 4/9] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
2021-06-07 22:25 ` [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
2021-06-08  1:26   ` Dave Chinner
2021-06-08 11:48     ` Brian Foster
2021-06-08 15:32       ` Darrick J. Wong
2021-06-08 16:06         ` Brian Foster
2021-06-08 21:55         ` Dave Chinner
2021-06-09  0:25           ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 6/9] xfs: parallelize inode inactivation Darrick J. Wong
2021-06-07 22:25 ` [PATCH 7/9] xfs: create a polled function to force " Darrick J. Wong
2021-06-07 22:25 ` [PATCH 8/9] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
2021-06-07 22:25 ` [PATCH 9/9] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2021-03-26  0:21 [PATCHSET v5 0/9] xfs: deferred inode inactivation Darrick J. Wong
2021-03-26  0:21 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.