All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v3 00/11] xfs: deferred inode inactivation
@ 2021-03-11  3:05 Darrick J. Wong
  2021-03-11  3:05 ` [PATCH 01/11] xfs: prevent metadata files from being inactivated Darrick J. Wong
                   ` (10 more replies)
  0 siblings, 11 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:05 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This patch series implements deferred inode inactivation.  Inactivation
is what happens when an open file loses its last incore reference: if
the file has speculative preallocations, they must be freed, and if the
file is unlinked, all forks must be truncated, and the inode marked
freed in the inode chunk and the inode btrees.

Currently, all of this activity is performed in frontend threads when
the last in-memory reference is lost and/or the vfs decides to drop the
inode.  Three complaints stem from this behavior: first, that the time
to unlink (in the worst case) depends on both the complexity of the
directory as well as the the number of extents in that file; second,
that deleting a directory tree is inefficient and seeky because we free
the inodes in readdir order, not disk order; and third, the upcoming
online repair feature needs to be able to xfs_irele while scanning a
filesystem in transaction context.  It cannot perform inode inactivation
in this context because xfs does not support nested transactions.

The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation.  When inodes need inactivation, we set
NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
deferred work item.  The deferred worker runs in an unbounded workqueue,
scanning the inode radix tree for tagged inodes to inactivate, and
performing all the on-disk metadata updates.  Once the inode has been
inactivated, it is left in the reclaim state and the background reclaim
worker (or direct reclaim) will get to it eventually.

Doing the inactivations from kernel threads solves the first problem by
constraining the amount of work done by the unlink() call to removing
the directory entry.  It solves the third problem by moving inactivation
to a separate process.  Because the inactivations are done in order of
inode number, we solve the second problem by performing updates in (we
hope) disk order.  This also decreases the amount of time it takes to
let go of an inode cluster if we're deleting entire directory trees.

There are three big warts I can think of in this series: first, because
the actual freeing of nlink==0 inodes is now done in the background,
this means that the system will be busy making metadata updates for some
time after the unlink() call returns.  This temporarily reduces
available iops.  Second, in order to retain the behavior that deleting
100TB of unshared data should result in a free space gain of 100TB, the
statvfs and quota reporting ioctls wait for inactivation to finish,
which increases the long tail latency of those calls.  This behavior is,
unfortunately, key to not introducing regressions in fstests.  The third
problem is that the deferrals keep memory usage higher for longer,
reduce opportunities to throttle the frontend when metadata load is
heavy, and the unbounded workqueues can create transaction storms.

The first patch prohibits automatic inactivation of metadata files.
This has been the source of subtle fs corruption problems in the past,
either due to growfs bugs or nlink incorrectly being set to zero.

The next four patches in the set perform prep work, refactoring
predicates and changing dquot behavior slightly to handle what comes
next.

The four patches after that shift the inactivation call paths over to
the background workqueue, and fix a few places where it was found to be
advantageous to force frontend threads to push and wait for inactivation
before making allocation decisions.

The final two patches improve the performance of inactivation by
enabling parallelization of the work and playing more nicely with vfs
callers who hold locks.

v1-v2: NYE patchbombs
v3: rebase against 5.12-rc2 for submission.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.13
---
 Documentation/admin-guide/xfs.rst |   14 +
 fs/xfs/libxfs/xfs_iext_tree.c     |    2 
 fs/xfs/scrub/common.c             |    2 
 fs/xfs/xfs_bmap_util.c            |  173 +++++++++----
 fs/xfs/xfs_bmap_util.h            |    1 
 fs/xfs/xfs_fsops.c                |    9 +
 fs/xfs/xfs_globals.c              |    3 
 fs/xfs/xfs_icache.c               |  512 ++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h               |   11 +
 fs/xfs/xfs_inode.c                |  112 ++++++++
 fs/xfs/xfs_inode.h                |   24 ++
 fs/xfs/xfs_linux.h                |    1 
 fs/xfs/xfs_log_recover.c          |    7 +
 fs/xfs/xfs_mount.c                |   16 +
 fs/xfs/xfs_mount.h                |   13 +
 fs/xfs/xfs_qm.c                   |   29 ++
 fs/xfs/xfs_qm.h                   |   17 +
 fs/xfs/xfs_qm_syscalls.c          |   20 +
 fs/xfs/xfs_super.c                |   61 ++++
 fs/xfs/xfs_sysctl.c               |    9 +
 fs/xfs/xfs_sysctl.h               |    1 
 fs/xfs/xfs_trace.h                |   15 +
 fs/xfs/xfs_xattr.c                |    2 
 23 files changed, 960 insertions(+), 94 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/11] xfs: prevent metadata files from being inactivated
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-03-11  3:05 ` Darrick J. Wong
  2021-03-11 13:05   ` Christoph Hellwig
  2021-03-22 23:13   ` Dave Chinner
  2021-03-11  3:05 ` [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:05 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic.  This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.

In other words, xfs_inactive() should leave these files alone.  Add the
necessary predicate functions to make this happen.  This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
commit f4c32e87de7d.  If we ever decide to support removing metadata
files, we should make all those metadata updates explicit.

Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h

Followup-to: f4c32e87de7d ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_iext_tree.c |    2 +-
 fs/xfs/xfs_inode.c            |    4 ++++
 fs/xfs/xfs_inode.h            |    8 ++++++++
 fs/xfs/xfs_xattr.c            |    2 ++
 4 files changed, 15 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index b4164256993d..773cf4349428 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -8,9 +8,9 @@
 #include "xfs_format.h"
 #include "xfs_bit.h"
 #include "xfs_log_format.h"
-#include "xfs_inode.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_inode.h"
 #include "xfs_trace.h"
 
 /*
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f93370bd7b1e..12c79962f8c3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1697,6 +1697,10 @@ xfs_inactive(
 	if (mp->m_flags & XFS_MOUNT_RDONLY)
 		return;
 
+	/* Metadata inodes require explicit resource cleanup. */
+	if (xfs_is_metadata_inode(ip))
+		return;
+
 	/* Try to clean out the cow blocks if there are any. */
 	if (xfs_inode_has_cow_data(ip))
 		xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 88ee4c3930ae..c2c26f8f4a81 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -185,6 +185,14 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
 	return ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
 }
 
+static inline bool xfs_is_metadata_inode(struct xfs_inode *ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	return ip == mp->m_rbmip || ip == mp->m_rsumip ||
+		xfs_is_quota_inode(&mp->m_sb, ip->i_ino);
+}
+
 /*
  * Check if an inode has any data in the COW fork.  This might be often false
  * even for inodes with the reflink flag when there is no pending COW operation.
diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
index 12be32f66dc1..0d050f8829ef 100644
--- a/fs/xfs/xfs_xattr.c
+++ b/fs/xfs/xfs_xattr.c
@@ -9,6 +9,8 @@
 #include "xfs_format.h"
 #include "xfs_log_format.h"
 #include "xfs_da_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_attr.h"
 #include "xfs_acl.h"


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
  2021-03-11  3:05 ` [PATCH 01/11] xfs: prevent metadata files from being inactivated Darrick J. Wong
@ 2021-03-11  3:05 ` Darrick J. Wong
  2021-03-11 13:09   ` Christoph Hellwig
  2021-03-15 18:46   ` Christoph Hellwig
  2021-03-11  3:05 ` [PATCH 03/11] xfs: don't reclaim dquots with incore reservations Darrick J. Wong
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:05 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Refactor the part of _free_eofblocks that decides if it's really going
to truncate post-EOF blocks into a separate helper function.  The
upcoming deferred inode inactivation patch requires us to be able to
decide this prior to actual inactivation.  No functionality changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  129 ++++++++++++++++++++++++++++--------------------
 fs/xfs/xfs_bmap_util.h |    1 
 2 files changed, 76 insertions(+), 54 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e7d68318e6a5..21aa38183ae9 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -628,27 +628,23 @@ xfs_can_free_eofblocks(struct xfs_inode *ip, bool force)
 }
 
 /*
- * This is called to free any blocks beyond eof. The caller must hold
- * IOLOCK_EXCL unless we are in the inode reclaim path and have the only
- * reference to the inode.
+ * Decide if this inode have post-EOF blocks.  The caller is responsible
+ * for knowing / caring about the PREALLOC/APPEND flags.
  */
 int
-xfs_free_eofblocks(
-	struct xfs_inode	*ip)
+xfs_has_eofblocks(
+	struct xfs_inode	*ip,
+	bool			*has)
 {
-	struct xfs_trans	*tp;
-	int			error;
+	struct xfs_bmbt_irec	imap;
+	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		end_fsb;
 	xfs_fileoff_t		last_fsb;
 	xfs_filblks_t		map_len;
 	int			nimaps;
-	struct xfs_bmbt_irec	imap;
-	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
-	/*
-	 * Figure out if there are any blocks beyond the end
-	 * of the file.  If not, then there is nothing to do.
-	 */
+	*has = false;
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
 	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
 	if (last_fsb <= end_fsb)
@@ -660,55 +656,80 @@ xfs_free_eofblocks(
 	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
+	if (error || nimaps == 0)
+		return error;
+
+	*has = imap.br_startblock != HOLESTARTBLOCK || ip->i_delayed_blks;
+	return 0;
+}
+
+/*
+ * This is called to free any blocks beyond eof. The caller must hold
+ * IOLOCK_EXCL unless we are in the inode reclaim path and have the only
+ * reference to the inode.
+ */
+int
+xfs_free_eofblocks(
+	struct xfs_inode	*ip)
+{
+	struct xfs_trans	*tp;
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			has;
+	int			error;
+
 	/*
 	 * If there are blocks after the end of file, truncate the file to its
 	 * current size to free them up.
 	 */
-	if (!error && (nimaps != 0) &&
-	    (imap.br_startblock != HOLESTARTBLOCK ||
-	     ip->i_delayed_blks)) {
-		/*
-		 * Attach the dquots to the inode up front.
-		 */
-		error = xfs_qm_dqattach(ip);
-		if (error)
-			return error;
+	error = xfs_has_eofblocks(ip, &has);
+	if (error || !has)
+		return error;
 
-		/* wait on dio to ensure i_size has settled */
-		inode_dio_wait(VFS_I(ip));
+	/*
+	 * Attach the dquots to the inode up front.
+	 */
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
 
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
-				&tp);
-		if (error) {
-			ASSERT(XFS_FORCED_SHUTDOWN(mp));
-			return error;
-		}
+	/* wait on dio to ensure i_size has settled */
+	inode_dio_wait(VFS_I(ip));
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, ip, 0);
-
-		/*
-		 * Do not update the on-disk file size.  If we update the
-		 * on-disk file size and then the system crashes before the
-		 * contents of the file are flushed to disk then the files
-		 * may be full of holes (ie NULL files bug).
-		 */
-		error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
-					XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
-		if (error) {
-			/*
-			 * If we get an error at this point we simply don't
-			 * bother truncating the file.
-			 */
-			xfs_trans_cancel(tp);
-		} else {
-			error = xfs_trans_commit(tp);
-			if (!error)
-				xfs_inode_clear_eofblocks_tag(ip);
-		}
-
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error) {
+		ASSERT(XFS_FORCED_SHUTDOWN(mp));
+		return error;
 	}
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/*
+	 * Do not update the on-disk file size.  If we update the
+	 * on-disk file size and then the system crashes before the
+	 * contents of the file are flushed to disk then the files
+	 * may be full of holes (ie NULL files bug).
+	 */
+	error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
+				XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
+	if (error)
+		goto err_cancel;
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_unlock;
+
+	xfs_inode_clear_eofblocks_tag(ip);
+	goto out_unlock;
+
+err_cancel:
+	/*
+	 * If we get an error at this point we simply don't
+	 * bother truncating the file.
+	 */
+	xfs_trans_cancel(tp);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 9f993168b55b..af07a4a20d7c 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -63,6 +63,7 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 
 /* EOF block manipulation functions */
+int	xfs_has_eofblocks(struct xfs_inode *ip, bool *has);
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
 int	xfs_free_eofblocks(struct xfs_inode *ip);
 


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 03/11] xfs: don't reclaim dquots with incore reservations
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
  2021-03-11  3:05 ` [PATCH 01/11] xfs: prevent metadata files from being inactivated Darrick J. Wong
  2021-03-11  3:05 ` [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
@ 2021-03-11  3:05 ` Darrick J. Wong
  2021-03-15 18:29   ` Christoph Hellwig
  2021-03-22 23:31   ` Dave Chinner
  2021-03-11  3:06 ` [PATCH 04/11] xfs: decide if inode needs inactivation Darrick J. Wong
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:05 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If a dquot has an incore reservation that exceeds the ondisk count, it
by definition has active incore state and must not be reclaimed.  Up to
this point every inode with an incore dquot reservation has always
retained a reference to the dquot so it was never possible for
xfs_qm_dquot_isolate to be called on a dquot with active state and zero
refcount, but this will soon change.

Deferred inode inactivation is about to reorganize how inodes are
inactivated by shunting all that work to a background workqueue.  In
order to avoid deadlocks with the quotaoff inode scan and reduce overall
memory requirements (since inodes can spend a lot of time waiting for
inactivation), inactive inodes will drop their dquot references while
they're waiting to be inactivated.

However, inactive inodes can have delalloc extents in the data fork or
any extents in the CoW fork.  Either of these contribute to the dquot's
incore reservation being larger than the resource count (i.e. they're
the reason the dquot still has active incore state), so we cannot allow
the dquot to be reclaimed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_qm.c |   29 ++++++++++++++++++++++++-----
 fs/xfs/xfs_qm.h |   17 +++++++++++++++++
 2 files changed, 41 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index bfa4164990b1..b3ce04dec181 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -166,9 +166,14 @@ xfs_qm_dqpurge(
 
 	/*
 	 * We move dquots to the freelist as soon as their reference count
-	 * hits zero, so it really should be on the freelist here.
+	 * hits zero, so it really should be on the freelist here.  If we're
+	 * running quotaoff, it's possible that we're purging a zero-refcount
+	 * dquot with active incore reservation because there are inodes
+	 * awaiting inactivation.  Dquots in this state will not be on the LRU
+	 * but it's quotaoff, so we don't care.
 	 */
-	ASSERT(!list_empty(&dqp->q_lru));
+	ASSERT(!(mp->m_qflags & xfs_quota_active_flag(xfs_dquot_type(dqp))) ||
+	       !list_empty(&dqp->q_lru));
 	list_lru_del(&qi->qi_lru, &dqp->q_lru);
 	XFS_STATS_DEC(mp, xs_qm_dquot_unused);
 
@@ -411,6 +416,15 @@ struct xfs_qm_isolate {
 	struct list_head	dispose;
 };
 
+static inline bool
+xfs_dquot_has_incore_resv(
+	struct xfs_dquot	*dqp)
+{
+	return  dqp->q_blk.reserved > dqp->q_blk.count ||
+		dqp->q_ino.reserved > dqp->q_ino.count ||
+		dqp->q_rtb.reserved > dqp->q_rtb.count;
+}
+
 static enum lru_status
 xfs_qm_dquot_isolate(
 	struct list_head	*item,
@@ -427,10 +441,15 @@ xfs_qm_dquot_isolate(
 		goto out_miss_busy;
 
 	/*
-	 * This dquot has acquired a reference in the meantime remove it from
-	 * the freelist and try again.
+	 * Either this dquot has incore reservations or it has acquired a
+	 * reference.  Remove it from the freelist and try again.
+	 *
+	 * Inodes tagged for inactivation drop their dquot references to avoid
+	 * deadlocks with quotaoff.  If these inodes have delalloc reservations
+	 * in the data fork or any extents in the CoW fork, these contribute
+	 * to the dquot's incore block reservation exceeding the count.
 	 */
-	if (dqp->q_nrefs) {
+	if (xfs_dquot_has_incore_resv(dqp) || dqp->q_nrefs) {
 		xfs_dqunlock(dqp);
 		XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants);
 
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index e3dabab44097..78f90935e91e 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -105,6 +105,23 @@ xfs_quota_inode(struct xfs_mount *mp, xfs_dqtype_t type)
 	return NULL;
 }
 
+static inline unsigned int
+xfs_quota_active_flag(
+	xfs_dqtype_t		type)
+{
+	switch (type) {
+	case XFS_DQTYPE_USER:
+		return XFS_UQUOTA_ACTIVE;
+	case XFS_DQTYPE_GROUP:
+		return XFS_GQUOTA_ACTIVE;
+	case XFS_DQTYPE_PROJ:
+		return XFS_PQUOTA_ACTIVE;
+	default:
+		ASSERT(0);
+	}
+	return 0;
+}
+
 extern void	xfs_trans_mod_dquot(struct xfs_trans *tp, struct xfs_dquot *dqp,
 				    uint field, int64_t delta);
 extern void	xfs_trans_dqjoin(struct xfs_trans *, struct xfs_dquot *);


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 04/11] xfs: decide if inode needs inactivation
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-03-11  3:05 ` [PATCH 03/11] xfs: don't reclaim dquots with incore reservations Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-15 18:47   ` Christoph Hellwig
  2021-03-11  3:06 ` [PATCH 05/11] xfs: rename the blockgc workqueue Darrick J. Wong
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a predicate function to decide if an inode needs (deferred)
inactivation.  Any file that has been unlinked or has speculative
preallocations either for post-EOF writes or for CoW qualifies.
This function will also be used by the upcoming deferred inactivation
patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_inode.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h |    2 ++
 2 files changed, 65 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 12c79962f8c3..65897cb0cf2a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1665,6 +1665,69 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/*
+ * Returns true if we need to update the on-disk metadata before we can free
+ * the memory used by this inode.  Updates include freeing post-eof
+ * preallocations; freeing COW staging extents; and marking the inode free in
+ * the inobt if it is on the unlinked list.
+ */
+bool
+xfs_inode_needs_inactivation(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*cow_ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+
+	/*
+	 * If the inode is already free, then there can be nothing
+	 * to clean up here.
+	 */
+	if (VFS_I(ip)->i_mode == 0)
+		return false;
+
+	/* If this is a read-only mount, don't do this (would generate I/O) */
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return false;
+
+	/* Metadata inodes require explicit resource cleanup. */
+	if (xfs_is_metadata_inode(ip))
+		return false;
+
+	/* Try to clean out the cow blocks if there are any. */
+	if (cow_ifp && cow_ifp->if_bytes > 0)
+		return true;
+
+	if (VFS_I(ip)->i_nlink != 0) {
+		int	error;
+		bool	has;
+
+		/*
+		 * force is true because we are evicting an inode from the
+		 * cache. Post-eof blocks must be freed, lest we end up with
+		 * broken free space accounting.
+		 *
+		 * Note: don't bother with iolock here since lockdep complains
+		 * about acquiring it in reclaim context. We have the only
+		 * reference to the inode at this point anyways.
+		 *
+		 * If the predicate errors out, send the inode through
+		 * inactivation anyway, because that's what we did before.
+		 * The inactivation worker will ignore an inode that doesn't
+		 * actually need it.
+		 */
+		if (!xfs_can_free_eofblocks(ip, true))
+			return false;
+		error = xfs_has_eofblocks(ip, &has);
+		return error != 0 || has;
+	}
+
+	/*
+	 * Link count dropped to zero, which means we have to mark the inode
+	 * free on disk and remove it from the AGI unlinked list.
+	 */
+	return true;
+}
+
 /*
  * xfs_inactive
  *
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c2c26f8f4a81..3fe8c8afbc72 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -480,6 +480,8 @@ extern struct kmem_zone	*xfs_inode_zone;
 /* The default CoW extent size hint. */
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
+bool xfs_inode_needs_inactivation(struct xfs_inode *ip);
+
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
 


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 05/11] xfs: rename the blockgc workqueue
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 04/11] xfs: decide if inode needs inactivation Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-15 18:49   ` Christoph Hellwig
  2021-03-11  3:06 ` [PATCH 06/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Since we're about to start using the blockgc workqueue to dispose of
inactivated inodes, strip the "block" prefix from the name; now it's
merely the general garbage collection (gc) workqueue.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    2 +-
 fs/xfs/xfs_icache.c               |    2 +-
 fs/xfs/xfs_mount.h                |    2 +-
 fs/xfs/xfs_super.c                |    8 ++++----
 4 files changed, 7 insertions(+), 7 deletions(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index 5422407a96d7..8de008c0c5ad 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -522,7 +522,7 @@ and the short name of the data device.  They all can be found in:
 ================  ===========
   xfs_iwalk-$pid  Inode scans of the entire filesystem. Currently limited to
                   mount time quotacheck.
-  xfs-blockgc     Background garbage collection of disk space that have been
+  xfs-gc          Background garbage collection of disk space that have been
                   speculatively allocated beyond EOF or for staging copy on
                   write operations.
 ================  ===========
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 1d7720a0c068..e6a62f765422 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1335,7 +1335,7 @@ xfs_blockgc_queue(
 {
 	rcu_read_lock();
 	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG))
-		queue_delayed_work(pag->pag_mount->m_blockgc_workqueue,
+		queue_delayed_work(pag->pag_mount->m_gc_workqueue,
 				   &pag->pag_blockgc_work,
 				   msecs_to_jiffies(xfs_blockgc_secs * 1000));
 	rcu_read_unlock();
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 659ad95fe3e0..81829d19596e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -93,7 +93,7 @@ typedef struct xfs_mount {
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
-	struct workqueue_struct *m_blockgc_workqueue;
+	struct workqueue_struct *m_gc_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
 
 	int			m_bsize;	/* fs logical block size */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e5e0713bebcd..e774358383d6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -519,10 +519,10 @@ xfs_init_mount_workqueues(
 	if (!mp->m_reclaim_workqueue)
 		goto out_destroy_cil;
 
-	mp->m_blockgc_workqueue = alloc_workqueue("xfs-blockgc/%s",
+	mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s",
 			WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM,
 			0, mp->m_super->s_id);
-	if (!mp->m_blockgc_workqueue)
+	if (!mp->m_gc_workqueue)
 		goto out_destroy_reclaim;
 
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
@@ -533,7 +533,7 @@ xfs_init_mount_workqueues(
 	return 0;
 
 out_destroy_eofb:
-	destroy_workqueue(mp->m_blockgc_workqueue);
+	destroy_workqueue(mp->m_gc_workqueue);
 out_destroy_reclaim:
 	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_cil:
@@ -551,7 +551,7 @@ xfs_destroy_mount_workqueues(
 	struct xfs_mount	*mp)
 {
 	destroy_workqueue(mp->m_sync_workqueue);
-	destroy_workqueue(mp->m_blockgc_workqueue);
+	destroy_workqueue(mp->m_gc_workqueue);
 	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 05/11] xfs: rename the blockgc workqueue Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-16  7:27   ` Christoph Hellwig
  2021-03-23  1:44   ` Dave Chinner
  2021-03-11  3:06 ` [PATCH 07/11] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue.  With this we
avoid blocking memory reclaim on filesystem metadata updates that are
necessary to free an in-core inode, such as post-eof block freeing, COW
staging extent freeing, and truncating and freeing unlinked inodes.  Now
that work is deferred to a workqueue where we can do the freeing in
batches.

We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated.  A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).

Unfortunately, deferring the inactivation has one huge downside --
eventual consistency.  Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we also force inactivation scans in order to maintain the existing
behaviors, at least outwardly.

For this patch we'll set the delay to zero to mimic the old timing as
much as possible; in the next patch we'll play with different delay
settings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    3 
 fs/xfs/scrub/common.c             |    2 
 fs/xfs/xfs_fsops.c                |    9 +
 fs/xfs/xfs_icache.c               |  436 ++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h               |    9 +
 fs/xfs/xfs_inode.c                |   45 +++-
 fs/xfs/xfs_inode.h                |   14 +
 fs/xfs/xfs_log_recover.c          |    7 +
 fs/xfs/xfs_mount.c                |   13 +
 fs/xfs/xfs_mount.h                |    4 
 fs/xfs/xfs_qm_syscalls.c          |   20 ++
 fs/xfs/xfs_super.c                |   53 ++++
 fs/xfs/xfs_trace.h                |   15 +
 13 files changed, 604 insertions(+), 26 deletions(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index 8de008c0c5ad..f9b109bfc6a6 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -524,7 +524,8 @@ and the short name of the data device.  They all can be found in:
                   mount time quotacheck.
   xfs-gc          Background garbage collection of disk space that have been
                   speculatively allocated beyond EOF or for staging copy on
-                  write operations.
+                  write operations; and files that are no longer linked into
+                  the directory tree.
 ================  ===========
 
 For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index da60e7d1f895..8bc824515e0b 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -886,6 +886,7 @@ xchk_stop_reaping(
 {
 	sc->flags |= XCHK_REAPING_DISABLED;
 	xfs_blockgc_stop(sc->mp);
+	xfs_inodegc_stop(sc->mp);
 }
 
 /* Restart background reaping of resources. */
@@ -893,6 +894,7 @@ void
 xchk_start_reaping(
 	struct xfs_scrub	*sc)
 {
+	xfs_inodegc_start(sc->mp);
 	xfs_blockgc_start(sc->mp);
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index a2a407039227..3a3baf56198b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -19,6 +19,8 @@
 #include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
 
 /*
  * growfs operations
@@ -290,6 +292,13 @@ xfs_fs_counts(
 	xfs_mount_t		*mp,
 	xfs_fsop_counts_t	*cnt)
 {
+	/*
+	 * Process all the queued file and speculative preallocation cleanup so
+	 * that the counter values we report here do not incorporate any
+	 * resources that were previously deleted.
+	 */
+	xfs_inodegc_force(mp);
+
 	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
 	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
 	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e6a62f765422..1b7652af5ee5 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -195,6 +195,18 @@ xfs_perag_clear_reclaim_tag(
 	trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
 }
 
+static void
+__xfs_inode_set_reclaim_tag(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			   XFS_ICI_RECLAIM_TAG);
+	xfs_perag_set_reclaim_tag(pag);
+	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+}
 
 /*
  * We set the inode flag atomically with the radix tree tag.
@@ -212,10 +224,7 @@ xfs_inode_set_reclaim_tag(
 	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
 
-	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			   XFS_ICI_RECLAIM_TAG);
-	xfs_perag_set_reclaim_tag(pag);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	__xfs_inode_set_reclaim_tag(pag, ip);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -233,6 +242,94 @@ xfs_inode_clear_reclaim_tag(
 	xfs_perag_clear_reclaim_tag(pag);
 }
 
+/* Queue a new inode gc pass if there are inodes needing inactivation. */
+static void
+xfs_inodegc_queue(
+	struct xfs_mount        *mp)
+{
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
+		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+				2 * HZ);
+	rcu_read_unlock();
+}
+
+/* Remember that an AG has one more inode to inactivate. */
+static void
+xfs_perag_set_inactive_tag(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	lockdep_assert_held(&pag->pag_ici_lock);
+	if (pag->pag_ici_inactive++)
+		return;
+
+	/* propagate the inactive tag up into the perag radix tree */
+	spin_lock(&mp->m_perag_lock);
+	radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
+			   XFS_ICI_INACTIVE_TAG);
+	spin_unlock(&mp->m_perag_lock);
+
+	/* schedule periodic background inode inactivation */
+	xfs_inodegc_queue(mp);
+
+	trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_);
+}
+
+/* Set this inode's inactive tag and set the per-AG tag. */
+void
+xfs_inode_set_inactive_tag(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
+				   XFS_ICI_INACTIVE_TAG);
+	xfs_perag_set_inactive_tag(pag);
+	__xfs_iflags_set(ip, XFS_NEED_INACTIVE);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+}
+
+/* Remember that an AG has one less inode to inactivate. */
+static void
+xfs_perag_clear_inactive_tag(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	lockdep_assert_held(&pag->pag_ici_lock);
+	if (--pag->pag_ici_inactive)
+		return;
+
+	/* clear the inactive tag from the perag radix tree */
+	spin_lock(&mp->m_perag_lock);
+	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
+			     XFS_ICI_INACTIVE_TAG);
+	spin_unlock(&mp->m_perag_lock);
+	trace_xfs_perag_clear_inactive(mp, pag->pag_agno, -1, _RET_IP_);
+}
+
+/* Clear this inode's inactive tag and try to clear the AG's. */
+STATIC void
+xfs_inode_clear_inactive_tag(
+	struct xfs_perag	*pag,
+	xfs_ino_t		ino)
+{
+	radix_tree_tag_clear(&pag->pag_ici_root,
+			     XFS_INO_TO_AGINO(pag->pag_mount, ino),
+			     XFS_ICI_INACTIVE_TAG);
+	xfs_perag_clear_inactive_tag(pag);
+}
+
 static void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
@@ -298,6 +395,13 @@ xfs_iget_check_free_state(
 	struct xfs_inode	*ip,
 	int			flags)
 {
+	/*
+	 * Unlinked inodes awaiting inactivation must not be reused until we
+	 * have a chance to clear the on-disk metadata.
+	 */
+	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
+		return -ENOENT;
+
 	if (flags & XFS_IGET_CREATE) {
 		/* should be a free inode */
 		if (VFS_I(ip)->i_mode != 0) {
@@ -323,6 +427,67 @@ xfs_iget_check_free_state(
 	return 0;
 }
 
+/*
+ * We've torn down the VFS part of this NEED_INACTIVE inode, so we need to get
+ * it back into working state.
+ */
+static int
+xfs_iget_inactive(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	int			error;
+
+	error = xfs_reinit_inode(mp, inode);
+	if (error) {
+		bool wake;
+		/*
+		 * Re-initializing the inode failed, and we are in deep
+		 * trouble.  Try to re-add it to the inactive list.
+		 */
+		rcu_read_lock();
+		spin_lock(&ip->i_flags_lock);
+		wake = !!__xfs_iflags_test(ip, XFS_INEW);
+		ip->i_flags &= ~(XFS_INEW | XFS_INACTIVATING);
+		if (wake)
+			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
+		ASSERT(ip->i_flags & XFS_NEED_INACTIVE);
+		trace_xfs_iget_inactive_fail(ip);
+		spin_unlock(&ip->i_flags_lock);
+		rcu_read_unlock();
+		return error;
+	}
+
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	/*
+	 * Clear the per-lifetime state in the inode as we are now effectively
+	 * a new inode and need to return to the initial state before reuse
+	 * occurs.
+	 */
+	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
+	ip->i_flags |= XFS_INEW;
+	xfs_inode_clear_inactive_tag(pag, ip->i_ino);
+	inode->i_state = I_NEW;
+	ip->i_sick = 0;
+	ip->i_checked = 0;
+
+	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+	init_rwsem(&inode->i_rwsem);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+
+	/*
+	 * Reattach dquots since we might have removed them when we put this
+	 * inode on the inactivation list.
+	 */
+	return xfs_qm_dqattach(ip);
+}
+
 /*
  * Check the validity of the inode we just found it the cache
  */
@@ -357,14 +522,14 @@ xfs_iget_cache_hit(
 	/*
 	 * If we are racing with another cache hit that is currently
 	 * instantiating this inode or currently recycling it out of
-	 * reclaimabe state, wait for the initialisation to complete
+	 * reclaimable state, wait for the initialisation to complete
 	 * before continuing.
 	 *
 	 * XXX(hch): eventually we should do something equivalent to
 	 *	     wait_on_inode to wait for these flags to be cleared
 	 *	     instead of polling for it.
 	 */
-	if (ip->i_flags & (XFS_INEW|XFS_IRECLAIM)) {
+	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING)) {
 		trace_xfs_iget_skip(ip);
 		XFS_STATS_INC(mp, xs_ig_frecycle);
 		error = -EAGAIN;
@@ -438,6 +603,32 @@ xfs_iget_cache_hit(
 
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
+	} else if (ip->i_flags & XFS_NEED_INACTIVE) {
+		/*
+		 * If NEED_INACTIVE is set, we've torn down the VFS inode and
+		 * need to carefully get it back into useable state.
+		 */
+		trace_xfs_iget_inactive(ip);
+
+		if (flags & XFS_IGET_INCORE) {
+			error = -EAGAIN;
+			goto out_error;
+		}
+
+		/*
+		 * We need to set XFS_INACTIVATING to prevent
+		 * xfs_inactive_inode from stomping over us while we recycle
+		 * the inode.  We can't clear the radix tree inactive tag yet
+		 * as it requires pag_ici_lock to be held exclusive.
+		 */
+		ip->i_flags |= XFS_INACTIVATING;
+
+		spin_unlock(&ip->i_flags_lock);
+		rcu_read_unlock();
+
+		error = xfs_iget_inactive(pag, ip);
+		if (error)
+			return error;
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -713,6 +904,43 @@ xfs_icache_inode_is_allocated(
 	return 0;
 }
 
+/*
+ * Grab the inode for inactivation exclusively.
+ * Return true if we grabbed it.
+ */
+static bool
+xfs_inactive_grab(
+	struct xfs_inode	*ip)
+{
+	ASSERT(rcu_read_lock_held());
+
+	/* quick check for stale RCU freed inode */
+	if (!ip->i_ino)
+		return false;
+
+	/*
+	 * The radix tree lock here protects a thread in xfs_iget from racing
+	 * with us starting reclaim on the inode.
+	 *
+	 * Due to RCU lookup, we may find inodes that have been freed and only
+	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
+	 * aren't candidates for reclaim at all, so we must check the
+	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
+	 * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
+	    (ip->i_flags & XFS_INACTIVATING)) {
+		/* not a inactivation candidate. */
+		spin_unlock(&ip->i_flags_lock);
+		return false;
+	}
+
+	ip->i_flags |= XFS_INACTIVATING;
+	spin_unlock(&ip->i_flags_lock);
+	return true;
+}
+
 /*
  * The inode lookup is done in batches to keep the amount of lock traffic and
  * radix tree lookups to a minimum. The batch size is a trade off between
@@ -736,6 +964,9 @@ xfs_inode_walk_ag_grab(
 
 	ASSERT(rcu_read_lock_held());
 
+	if (flags & XFS_INODE_WALK_INACTIVE)
+		return xfs_inactive_grab(ip);
+
 	/* Check for stale RCU freed inode */
 	spin_lock(&ip->i_flags_lock);
 	if (!ip->i_ino)
@@ -743,7 +974,8 @@ xfs_inode_walk_ag_grab(
 
 	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
 	if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
-	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
+	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM |
+				  XFS_NEED_INACTIVE | XFS_INACTIVATING))
 		goto out_unlock_noent;
 	spin_unlock(&ip->i_flags_lock);
 
@@ -848,7 +1080,8 @@ xfs_inode_walk_ag(
 			    xfs_iflags_test(batch[i], XFS_INEW))
 				xfs_inew_wait(batch[i]);
 			error = execute(batch[i], args);
-			xfs_irele(batch[i]);
+			if (!(iter_flags & XFS_INODE_WALK_INACTIVE))
+				xfs_irele(batch[i]);
 			if (error == -EAGAIN) {
 				skipped++;
 				continue;
@@ -986,6 +1219,7 @@ xfs_reclaim_inode(
 
 	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
+	trace_xfs_inode_reclaiming(ip);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1705,3 +1939,189 @@ xfs_blockgc_free_quota(
 			xfs_inode_dquot(ip, XFS_DQTYPE_GROUP),
 			xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), eof_flags);
 }
+
+/*
+ * Deferred Inode Inactivation
+ * ===========================
+ *
+ * Sometimes, inodes need to have work done on them once the last program has
+ * closed the file.  Typically this means cleaning out any leftover post-eof or
+ * CoW staging blocks for linked files.  For inodes that have been totally
+ * unlinked, this means unmapping data/attr/cow blocks, removing the inode
+ * from the unlinked buckets, and marking it free in the inobt and inode table.
+ *
+ * This process can generate many metadata updates, which shows up as close()
+ * and unlink() calls that take a long time.  We defer all that work to a
+ * per-AG workqueue which means that we can batch a lot of work and do it in
+ * inode order for better performance.  Furthermore, we can control the
+ * workqueue, which means that we can avoid doing inactivation work at a bad
+ * time, such as when the fs is frozen.
+ *
+ * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
+ * INACTIVATING) and adds a new INACTIVE radix tree tag for fast access.  We
+ * maintain separate perag counters for both types, and move counts as inodes
+ * wander the state machine, which now works as follows:
+ *
+ * If the inode needs inactivation, we:
+ *   - Set the NEED_INACTIVE inode flag
+ *   - Increment the per-AG inactive count
+ *   - Set the INACTIVE tag in the per-AG inode tree
+ *   - Set the INACTIVE tag in the per-fs AG tree
+ *   - Schedule background inode inactivation
+ *
+ * If the inode does not need inactivation, we:
+ *   - Set the RECLAIMABLE inode flag
+ *   - Increment the per-AG reclaim count
+ *   - Set the RECLAIM tag in the per-AG inode tree
+ *   - Set the RECLAIM tag in the per-fs AG tree
+ *   - Schedule background inode reclamation
+ *
+ * When it is time for background inode inactivation, we:
+ *   - Set the INACTIVATING inode flag
+ *   - Make all the on-disk updates
+ *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
+ *   - Decrement the per-AG inactive count
+ *   - Clear the INACTIVE tag in the per-AG inode tree
+ *   - Clear the INACTIVE tag in the per-fs AG tree if that was the last one
+ *   - Kick the inode into reclamation per the previous paragraph.
+ *
+ * When it is time for background inode reclamation, we:
+ *   - Set the IRECLAIM inode flag
+ *   - Detach all the resources and remove the inode from the per-AG inode tree
+ *   - Clear both IRECLAIM and RECLAIMABLE inode flags
+ *   - Decrement the per-AG reclaim count
+ *   - Clear the RECLAIM tag from the per-AG inode tree
+ *   - Clear the RECLAIM tag from the per-fs AG tree if there are no more
+ *     inodes waiting for reclamation or inactivation
+ *
+ * Note that xfs_inodegc_queue and xfs_inactive_grab are further up in
+ * the source code so that we avoid static function declarations.
+ */
+
+/* Inactivate this inode. */
+STATIC int
+xfs_inactive_inode(
+	struct xfs_inode	*ip,
+	void			*args)
+{
+	struct xfs_eofblocks	*eofb = args;
+	struct xfs_perag	*pag;
+
+	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	/*
+	 * Not a match for our passed in scan filter?  Put it back on the shelf
+	 * and move on.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (!xfs_inode_matches_eofb(ip, eofb)) {
+		ip->i_flags &= ~XFS_INACTIVATING;
+		spin_unlock(&ip->i_flags_lock);
+		return 0;
+	}
+	spin_unlock(&ip->i_flags_lock);
+
+	trace_xfs_inode_inactivating(ip);
+
+	xfs_inactive(ip);
+	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
+
+	/*
+	 * Clear the inactive state flags and schedule a reclaim run once
+	 * we're done with the inactivations.  We must ensure that the inode
+	 * smoothly transitions from inactivating to reclaimable so that iget
+	 * cannot see either data structure midway through the transition.
+	 */
+	pag = xfs_perag_get(ip->i_mount,
+			XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino));
+	spin_lock(&pag->pag_ici_lock);
+	spin_lock(&ip->i_flags_lock);
+
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	xfs_inode_clear_inactive_tag(pag, ip->i_ino);
+
+	__xfs_inode_set_reclaim_tag(pag, ip);
+
+	spin_unlock(&ip->i_flags_lock);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+
+	return 0;
+}
+
+/*
+ * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
+ * corrupted, we still need to clear the INACTIVE iflag so that we can move
+ * on to reclaiming the inode.
+ */
+static int
+xfs_inodegc_free_space(
+	struct xfs_mount	*mp,
+	struct xfs_eofblocks	*eofb)
+{
+	return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
+			xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG);
+}
+
+/* Try to get inode inactivation moving. */
+void
+xfs_inodegc_worker(
+	struct work_struct	*work)
+{
+	struct xfs_mount	*mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_inodegc_work);
+	int			error;
+
+	/*
+	 * We want to skip inode inactivation while the filesystem is frozen
+	 * because we don't want the inactivation thread to block while taking
+	 * sb_intwrite.  Therefore, we try to take sb_write for the duration
+	 * of the inactive scan -- a freeze attempt will block until we're
+	 * done here, and if the fs is past stage 1 freeze we'll bounce out
+	 * until things unfreeze.  If the fs goes down while frozen we'll
+	 * still have log recovery to clean up after us.
+	 */
+	if (!sb_start_write_trylock(mp->m_super))
+		return;
+
+	error = xfs_inodegc_free_space(mp, NULL);
+	if (error && error != -EAGAIN)
+		xfs_err(mp, "inode inactivation failed, error %d", error);
+
+	sb_end_write(mp->m_super);
+	xfs_inodegc_queue(mp);
+}
+
+/* Force all queued inode inactivation work to run immediately. */
+void
+xfs_inodegc_force(
+	struct xfs_mount	*mp)
+{
+	/*
+	 * In order to reset the delay timer to run immediately, we have to
+	 * cancel the work item and requeue it with a zero timer value.  We
+	 * don't care if the worker races with our requeue, because at worst
+	 * we iterate the radix tree and find no inodes to inactivate.
+	 */
+	if (!cancel_delayed_work(&mp->m_inodegc_work))
+		return;
+
+	queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
+	flush_delayed_work(&mp->m_inodegc_work);
+}
+
+/* Stop all queued inactivation work. */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	cancel_delayed_work_sync(&mp->m_inodegc_work);
+}
+
+/* Schedule deferred inode inactivation work. */
+void
+xfs_inodegc_start(
+	struct xfs_mount	*mp)
+{
+	xfs_inodegc_queue(mp);
+}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d1fddb152420..c199b920722a 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -25,6 +25,8 @@ struct xfs_eofblocks {
 #define XFS_ICI_RECLAIM_TAG	0	/* inode is to be reclaimed */
 /* Inode has speculative preallocations (posteof or cow) to clean. */
 #define XFS_ICI_BLOCKGC_TAG	1
+/* Inode can be inactivated. */
+#define XFS_ICI_INACTIVE_TAG	2
 
 /*
  * Flags for xfs_iget()
@@ -38,6 +40,7 @@ struct xfs_eofblocks {
  * flags for AG inode iterator
  */
 #define XFS_INODE_WALK_INEW_WAIT	0x1	/* wait on new inodes */
+#define XFS_INODE_WALK_INACTIVE		0x2	/* inactivation loop */
 
 int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 	     uint flags, uint lock_flags, xfs_inode_t **ipp);
@@ -53,6 +56,7 @@ int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
+void xfs_inode_set_inactive_tag(struct xfs_inode *ip);
 
 int xfs_blockgc_free_dquots(struct xfs_mount *mp, struct xfs_dquot *udqp,
 		struct xfs_dquot *gdqp, struct xfs_dquot *pdqp,
@@ -78,4 +82,9 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_force(struct xfs_mount *mp);
+void xfs_inodegc_stop(struct xfs_mount *mp);
+void xfs_inodegc_start(struct xfs_mount *mp);
+
 #endif
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 65897cb0cf2a..f20694f220c8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1665,6 +1665,35 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/* Prepare inode for inactivation. */
+void
+xfs_inode_inactivation_prep(
+	struct xfs_inode	*ip)
+{
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+		return;
+
+	/*
+	 * If this inode is unlinked (and now unreferenced) we need to dispose
+	 * of it in the on disk metadata.
+	 *
+	 * Change the generation so that the inode can't be opened by handle
+	 * now that the last external references has dropped.  Bulkstat won't
+	 * return inodes with zero nlink so nobody will ever find this inode
+	 * again.  Then add this inode & blocks to the counts of things that
+	 * will be freed during the next inactivation run.
+	 */
+	if (VFS_I(ip)->i_nlink == 0)
+		VFS_I(ip)->i_generation = prandom_u32();
+
+	/*
+	 * Detach dquots just in case someone tries a quotaoff while the inode
+	 * is waiting on the inactive list.  We'll reattach them (if needed)
+	 * when inactivating the inode.
+	 */
+	xfs_qm_dqdetach(ip);
+}
+
 /*
  * Returns true if we need to update the on-disk metadata before we can free
  * the memory used by this inode.  Updates include freeing post-eof
@@ -1738,7 +1767,7 @@ xfs_inode_needs_inactivation(
  */
 void
 xfs_inactive(
-	xfs_inode_t	*ip)
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp;
 	int			error;
@@ -1764,6 +1793,16 @@ xfs_inactive(
 	if (xfs_is_metadata_inode(ip))
 		return;
 
+	/*
+	 * Re-attach dquots prior to freeing EOF blocks or CoW staging extents.
+	 * We dropped the dquot prior to inactivation (because quotaoff can't
+	 * resurrect inactive inodes to force-drop the dquot) so we /must/
+	 * do this before touching any block mappings.
+	 */
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return;
+
 	/* Try to clean out the cow blocks if there are any. */
 	if (xfs_inode_has_cow_data(ip))
 		xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true);
@@ -1789,10 +1828,6 @@ xfs_inactive(
 	     ip->i_df.if_nextents > 0 || ip->i_delayed_blks > 0))
 		truncate = 1;
 
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		return;
-
 	if (S_ISLNK(VFS_I(ip)->i_mode))
 		error = xfs_inactive_symlink(ip);
 	else if (truncate)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3fe8c8afbc72..7aaff07d1210 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -222,6 +222,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECLAIMABLE	(1 << 2) /* inode can be reclaimed */
 #define __XFS_INEW_BIT		3	 /* inode has just been allocated */
 #define XFS_INEW		(1 << __XFS_INEW_BIT)
+#define XFS_NEED_INACTIVE	(1 << 4) /* see XFS_INACTIVATING below */
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
 #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
 #define XFS_IFLUSHING		(1 << 7) /* inode is being flushed */
@@ -236,6 +237,15 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -243,7 +253,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Flags for inode locking.
@@ -481,6 +492,7 @@ extern struct kmem_zone	*xfs_inode_zone;
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
 bool xfs_inode_needs_inactivation(struct xfs_inode *ip);
+void xfs_inode_inactivation_prep(struct xfs_inode *ip);
 
 int xfs_iunlink_init(struct xfs_perag *pag);
 void xfs_iunlink_destroy(struct xfs_perag *pag);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 97f31308de03..b03b127e34cc 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2792,6 +2792,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Now that we've put all the iunlink inodes on the lru, let's make
+	 * sure that we perform all the on-disk metadata updates to actually
+	 * free those inodes.
+	 */
+	xfs_inodegc_force(mp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 1c97b155a8ee..cd015e3d72fc 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -640,6 +640,10 @@ xfs_check_summary_counts(
  * so we need to unpin them, write them back and/or reclaim them before unmount
  * can proceed.
  *
+ * Start the process by pushing all inodes through the inactivation process
+ * so that all file updates to on-disk metadata can be flushed with the log.
+ * After the AIL push, all inodes should be ready for reclamation.
+ *
  * An inode cluster that has been freed can have its buffer still pinned in
  * memory because the transaction is still sitting in a iclog. The stale inodes
  * on that buffer will be pinned to the buffer until the transaction hits the
@@ -663,6 +667,7 @@ static void
 xfs_unmount_flush_inodes(
 	struct xfs_mount	*mp)
 {
+	xfs_inodegc_force(mp);
 	xfs_log_force(mp, XFS_LOG_SYNC);
 	xfs_extent_busy_wait_all(mp);
 	flush_workqueue(xfs_discard_wq);
@@ -670,6 +675,7 @@ xfs_unmount_flush_inodes(
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
 
 	xfs_ail_push_all_sync(mp->m_ail);
+	xfs_inodegc_stop(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
@@ -1095,6 +1101,13 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes.
+	 * Since this can involve finobt updates, do it now before we lose the
+	 * per-AG space reservations.
+	 */
+	xfs_inodegc_force(mp);
+
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 81829d19596e..ce00ad47b8ea 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -177,6 +177,7 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
+	struct delayed_work	m_inodegc_work; /* background inode inactive */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -349,7 +350,8 @@ typedef struct xfs_perag {
 
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
-	int		pag_ici_reclaimable;	/* reclaimable inodes */
+	unsigned int	pag_ici_reclaimable;	/* reclaimable inodes */
+	unsigned int	pag_ici_inactive;	/* inactive inodes */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index ca1b57d291dc..0f9a1450fe0e 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -104,6 +104,12 @@ xfs_qm_scall_quotaoff(
 	uint			inactivate_flags;
 	struct xfs_qoff_logitem	*qoffstart = NULL;
 
+	/*
+	 * Clean up the inactive list before we turn quota off, to reduce the
+	 * amount of quotaoff work we have to do with the mutex held.
+	 */
+	xfs_inodegc_force(mp);
+
 	/*
 	 * No file system can have quotas enabled on disk but not in core.
 	 * Note that quota utilities (like quotaoff) _expect_
@@ -697,6 +703,13 @@ xfs_qm_scall_getquota(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	/*
+	 * Process all the queued file and speculative preallocation cleanup so
+	 * that the counter values we report here do not incorporate any
+	 * resources that were previously deleted.
+	 */
+	xfs_inodegc_force(mp);
+
 	/*
 	 * Try to get the dquot. We don't want it allocated on disk, so don't
 	 * set doalloc. If it doesn't exist, we'll get ENOENT back.
@@ -735,6 +748,13 @@ xfs_qm_scall_getquota_next(
 	struct xfs_dquot	*dqp;
 	int			error;
 
+	/*
+	 * Process all the queued file and speculative preallocation cleanup so
+	 * that the counter values we report here do not incorporate any
+	 * resources that were previously deleted.
+	 */
+	xfs_inodegc_force(mp);
+
 	error = xfs_qm_dqget_next(mp, *id, type, &dqp);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e774358383d6..8d0142487fc7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -637,28 +637,34 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			need_inactive;
 
 	trace_xfs_destroy_inode(ip);
 
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-	XFS_STATS_INC(ip->i_mount, vn_rele);
-	XFS_STATS_INC(ip->i_mount, vn_remove);
+	XFS_STATS_INC(mp, vn_rele);
+	XFS_STATS_INC(mp, vn_remove);
 
-	xfs_inactive(ip);
-
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
+	need_inactive = xfs_inode_needs_inactivation(ip);
+	if (need_inactive) {
+		trace_xfs_inode_set_need_inactive(ip);
+		xfs_inode_inactivation_prep(ip);
+	} else if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
 		ASSERT(0);
 	}
-
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
+	XFS_STATS_INC(mp, vn_reclaim);
+	trace_xfs_inode_set_reclaimable(ip);
 
 	/*
 	 * We should never get here with one of the reclaim flags already set.
 	 */
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_NEED_INACTIVE));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_INACTIVATING));
 
 	/*
 	 * We always use background reclaim here because even if the inode is
@@ -667,7 +673,10 @@ xfs_fs_destroy_inode(
 	 * reclaim path handles this more efficiently than we can here, so
 	 * simply let background reclaim tear down all inodes.
 	 */
-	xfs_inode_set_reclaim_tag(ip);
+	if (need_inactive)
+		xfs_inode_set_inactive_tag(ip);
+	else
+		xfs_inode_set_reclaim_tag(ip);
 }
 
 static void
@@ -797,6 +806,13 @@ xfs_fs_statfs(
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
+	/*
+	 * Process all the queued file and speculative preallocation cleanup so
+	 * that the counter values we report here do not incorporate any
+	 * resources that were previously deleted.
+	 */
+	xfs_inodegc_force(mp);
+
 	statp->f_type = XFS_SUPER_MAGIC;
 	statp->f_namelen = MAXNAMELEN - 1;
 
@@ -911,6 +927,18 @@ xfs_fs_unfreeze(
 	return 0;
 }
 
+/*
+ * Before we get to stage 1 of a freeze, force all the inactivation work so
+ * that there's less work to do if we crash during the freeze.
+ */
+STATIC int
+xfs_fs_freeze_super(
+	struct super_block	*sb)
+{
+	xfs_inodegc_force(XFS_M(sb));
+	return freeze_super(sb);
+}
+
 /*
  * This function fills in xfs_mount_t fields based on mount args.
  * Note: the superblock _has_ now been read in.
@@ -1089,6 +1117,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.freeze_super		= xfs_fs_freeze_super,
 };
 
 static int
@@ -1720,6 +1749,13 @@ xfs_remount_ro(
 		return error;
 	}
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes.
+	 * Since this can involve finobt updates, do it now before we lose the
+	 * per-AG space reservations.
+	 */
+	xfs_inodegc_force(mp);
+
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
 	if (error) {
@@ -1843,6 +1879,7 @@ static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
+	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e74bbb648f83..9193cfbb02ef 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -157,6 +157,8 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_set_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_blockgc);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_blockgc);
+DEFINE_PERAG_REF_EVENT(xfs_perag_set_inactive);
+DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inactive);
 
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
@@ -617,14 +619,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -634,6 +639,8 @@ DEFINE_EVENT(xfs_inode_class, name, \
 DEFINE_INODE_EVENT(xfs_iget_skip);
 DEFINE_INODE_EVENT(xfs_iget_reclaim);
 DEFINE_INODE_EVENT(xfs_iget_reclaim_fail);
+DEFINE_INODE_EVENT(xfs_iget_inactive);
+DEFINE_INODE_EVENT(xfs_iget_inactive_fail);
 DEFINE_INODE_EVENT(xfs_iget_hit);
 DEFINE_INODE_EVENT(xfs_iget_miss);
 
@@ -668,6 +675,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 07/11] xfs: expose sysfs knob to control inode inactivation delay
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 06/11] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-11  3:06 ` [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow administrators to control the length that we defer inode
inactivation.  By default we'll set the delay to 5 seconds, as an
arbitrary choice between allowing for some batching of a deltree
operation, and not letting too many inodes pile up in memory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/admin-guide/xfs.rst |    9 +++++++++
 fs/xfs/xfs_globals.c              |    3 +++
 fs/xfs/xfs_icache.c               |    2 +-
 fs/xfs/xfs_linux.h                |    1 +
 fs/xfs/xfs_sysctl.c               |    9 +++++++++
 fs/xfs/xfs_sysctl.h               |    1 +
 6 files changed, 24 insertions(+), 1 deletion(-)


diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index f9b109bfc6a6..608d0ba7a86e 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -277,6 +277,15 @@ The following sysctls are available for the XFS filesystem:
 	references and returns timed-out AGs back to the free stream
 	pool.
 
+  fs.xfs.inode_gc_delay
+	(Units: centiseconds   Min: 1  Default: 200  Max: 360000)
+	The amount of time to delay garbage collection of inodes that
+	have been closed or have been unlinked from the directory tree.
+	Garbage collection here means clearing speculative preallocations
+	from linked files and freeing unlinked inodes.  A higher value
+	here enables more batching at a cost of delayed reclamation of
+	incore inodes.
+
   fs.xfs.speculative_prealloc_lifetime
 	(Units: seconds   Min: 1  Default: 300  Max: 86400)
 	The interval at which the background scanning for inodes
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index f62fa652c2fd..2945c2c54cf0 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -28,6 +28,9 @@ xfs_param_t xfs_params = {
 	.rotorstep	= {	1,		1,		255	},
 	.inherit_nodfrg	= {	0,		1,		1	},
 	.fstrm_timer	= {	1,		30*100,		3600*100},
+	.inodegc_timer	= {	1,		2*100,		3600*100},
+
+	/* Values below here are measured in seconds */
 	.blockgc_timer	= {	1,		300,		3600*24},
 };
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 1b7652af5ee5..6081bba3c6ce 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -250,7 +250,7 @@ xfs_inodegc_queue(
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
 		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
-				2 * HZ);
+				msecs_to_jiffies(xfs_inodegc_centisecs * 10));
 	rcu_read_unlock();
 }
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index af6be9b9ccdf..b4c5a2c71f43 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -99,6 +99,7 @@ typedef __u32			xfs_nlink_t;
 #define xfs_inherit_nodefrag	xfs_params.inherit_nodfrg.val
 #define xfs_fstrm_centisecs	xfs_params.fstrm_timer.val
 #define xfs_blockgc_secs	xfs_params.blockgc_timer.val
+#define xfs_inodegc_centisecs	xfs_params.inodegc_timer.val
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_set_flags_nested(sp, f)		\
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index 546a6cd96729..878f31d3a587 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -176,6 +176,15 @@ static struct ctl_table xfs_table[] = {
 		.extra1		= &xfs_params.fstrm_timer.min,
 		.extra2		= &xfs_params.fstrm_timer.max,
 	},
+	{
+		.procname	= "inode_gc_delay",
+		.data		= &xfs_params.inodegc_timer.val,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &xfs_params.inodegc_timer.min,
+		.extra2		= &xfs_params.inodegc_timer.max
+	},
 	{
 		.procname	= "speculative_prealloc_lifetime",
 		.data		= &xfs_params.blockgc_timer.val,
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 7692e76ead33..a045c33c3d30 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -36,6 +36,7 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t blockgc_timer;	/* Interval between blockgc scans */
+	xfs_sysctl_val_t inodegc_timer;	/* Inode inactivation scan interval */
 } xfs_param_t;
 
 /*


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 07/11] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-15 18:54   ` Christoph Hellwig
  2021-03-11  3:06 ` [PATCH 09/11] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Any time we try to modify a file's contents and it fails due to ENOSPC
or EDQUOT, force inode inactivation work to try to free space.  We're
going to use the xfs_inodegc_free_space function externally in the next
patch, so add it to xfs_icache.h now to reduce churn.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   10 ++++++++--
 fs/xfs/xfs_icache.h |    1 +
 2 files changed, 9 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 6081bba3c6ce..594d340bbe37 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1868,10 +1868,16 @@ xfs_blockgc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_eofblocks	*eofb)
 {
+	int			error;
+
 	trace_xfs_blockgc_free_space(mp, eofb, _RET_IP_);
 
-	return xfs_inode_walk(mp, 0, xfs_blockgc_scan_inode, eofb,
+	error =  xfs_inode_walk(mp, 0, xfs_blockgc_scan_inode, eofb,
 			XFS_ICI_BLOCKGC_TAG);
+	if (error)
+		return error;
+
+	return xfs_inodegc_free_space(mp, eofb);
 }
 
 /*
@@ -2054,7 +2060,7 @@ xfs_inactive_inode(
  * corrupted, we still need to clear the INACTIVE iflag so that we can move
  * on to reclaiming the inode.
  */
-static int
+int
 xfs_inodegc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_eofblocks	*eofb)
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index c199b920722a..9d5a1f4c0369 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -86,5 +86,6 @@ void xfs_inodegc_worker(struct work_struct *work);
 void xfs_inodegc_force(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
+int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_eofblocks *eofb);
 
 #endif


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 09/11] xfs: force inode garbage collection before fallocate when space is low
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-11  3:06 ` [PATCH 10/11] xfs: parallelize inode inactivation Darrick J. Wong
  2021-03-11  3:06 ` [PATCH 11/11] xfs: create a polled function to force " Darrick J. Wong
  10 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Generally speaking, when a user calls fallocate, they're looking to
preallocate space in a file in the largest contiguous chunks possible.
If free space is low, it's possible that the free space will look
unnecessarily fragmented because there are unlinked inodes that are
holding on to space that we could allocate.  When this happens,
fallocate makes suboptimal allocation decisions for the sake of deleted
files, which doesn't make much sense, so scan the filesystem for dead
items to delete to try to avoid this.

Note that there are a handful of fstests that fill a filesystem, delete
just enough files to allow a single large allocation, and check that
fallocate actually gets the allocation.  These tests regress because the
test runs fallocate before the inode gc has a chance to run, so add this
behavior to maintain as much of the old behavior as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 21aa38183ae9..6d2fece45bdc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -28,6 +28,7 @@
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_sb.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -733,6 +734,44 @@ xfs_free_eofblocks(
 	return error;
 }
 
+/*
+ * If we suspect that the target device is full enough that it isn't to be able
+ * to satisfy the entire request, try a non-sync inode inactivation scan to
+ * free up space.  While it's perfectly fine to fill a preallocation request
+ * with a bunch of short extents, we'd prefer to do the inactivation work now
+ * to combat long term fragmentation in new file data.  This is purely for
+ * optimization, so we don't take any blocking locks and we only look for space
+ * that is already on the reclaim list (i.e. we don't zap speculative
+ * preallocations).
+ */
+static int
+xfs_alloc_reclaim_inactive_space(
+	struct xfs_mount	*mp,
+	bool			is_rt,
+	xfs_filblks_t		allocatesize_fsb)
+{
+	struct xfs_perag	*pag;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	xfs_extlen_t		free;
+	xfs_agnumber_t		agno;
+
+	if (is_rt) {
+		if (sbp->sb_frextents * sbp->sb_rextsize >= allocatesize_fsb)
+			return 0;
+	} else {
+		for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+			pag = xfs_perag_get(mp, agno);
+			free = pag->pagf_freeblks;
+			xfs_perag_put(pag);
+
+			if (free >= allocatesize_fsb)
+				return 0;
+		}
+	}
+
+	return xfs_inodegc_free_space(mp, NULL);
+}
+
 int
 xfs_alloc_file_space(
 	struct xfs_inode	*ip,
@@ -817,6 +856,11 @@ xfs_alloc_file_space(
 			rblocks = 0;
 		}
 
+		error = xfs_alloc_reclaim_inactive_space(mp, rt,
+				allocatesize_fsb);
+		if (error)
+			break;
+
 		/*
 		 * Allocate and setup the transaction.
 		 */


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 10/11] xfs: parallelize inode inactivation
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 09/11] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-15 18:55   ` Christoph Hellwig
  2021-03-23 22:21   ` Dave Chinner
  2021-03-11  3:06 ` [PATCH 11/11] xfs: create a polled function to force " Darrick J. Wong
  10 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   62 ++++++++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_mount.c  |    3 ++
 fs/xfs/xfs_mount.h  |    4 ++-
 fs/xfs/xfs_super.c  |    1 -
 4 files changed, 52 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 594d340bbe37..d5f580b92e48 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -245,11 +245,13 @@ xfs_inode_clear_reclaim_tag(
 /* Queue a new inode gc pass if there are inodes needing inactivation. */
 static void
 xfs_inodegc_queue(
-	struct xfs_mount        *mp)
+	struct xfs_perag	*pag)
 {
+	struct xfs_mount	*mp = pag->pag_mount;
+
 	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
-		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INACTIVE_TAG))
+		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
 				msecs_to_jiffies(xfs_inodegc_centisecs * 10));
 	rcu_read_unlock();
 }
@@ -272,7 +274,7 @@ xfs_perag_set_inactive_tag(
 	spin_unlock(&mp->m_perag_lock);
 
 	/* schedule periodic background inode inactivation */
-	xfs_inodegc_queue(mp);
+	xfs_inodegc_queue(pag);
 
 	trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_);
 }
@@ -2074,8 +2076,9 @@ void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_mount	*mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_inodegc_work);
+	struct xfs_perag	*pag = container_of(to_delayed_work(work),
+					struct xfs_perag, pag_inodegc_work);
+	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
 	/*
@@ -2095,25 +2098,44 @@ xfs_inodegc_worker(
 		xfs_err(mp, "inode inactivation failed, error %d", error);
 
 	sb_end_write(mp->m_super);
-	xfs_inodegc_queue(mp);
+	xfs_inodegc_queue(pag);
 }
 
-/* Force all queued inode inactivation work to run immediately. */
-void
-xfs_inodegc_force(
-	struct xfs_mount	*mp)
+/* Garbage collect all inactive inodes in an AG immediately. */
+static inline bool
+xfs_inodegc_force_pag(
+	struct xfs_perag	*pag)
 {
+	struct xfs_mount	*mp = pag->pag_mount;
+
 	/*
 	 * In order to reset the delay timer to run immediately, we have to
 	 * cancel the work item and requeue it with a zero timer value.  We
 	 * don't care if the worker races with our requeue, because at worst
 	 * we iterate the radix tree and find no inodes to inactivate.
 	 */
-	if (!cancel_delayed_work(&mp->m_inodegc_work))
+	if (!cancel_delayed_work(&pag->pag_inodegc_work))
+		return false;
+
+	queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0);
+	return true;
+}
+
+/* Force all queued inode inactivation work to run immediately. */
+void
+xfs_inodegc_force(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	bool			queued = false;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		queued |= xfs_inodegc_force_pag(pag);
+	if (!queued)
 		return;
 
-	queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
-	flush_delayed_work(&mp->m_inodegc_work);
+	flush_workqueue(mp->m_gc_workqueue);
 }
 
 /* Stop all queued inactivation work. */
@@ -2121,7 +2143,11 @@ void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
-	cancel_delayed_work_sync(&mp->m_inodegc_work);
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 }
 
 /* Schedule deferred inode inactivation work. */
@@ -2129,5 +2155,9 @@ void
 xfs_inodegc_start(
 	struct xfs_mount	*mp)
 {
-	xfs_inodegc_queue(mp);
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		xfs_inodegc_queue(pag);
 }
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index cd015e3d72fc..a5963061485c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -127,6 +127,7 @@ __xfs_free_perag(
 	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
 
 	ASSERT(!delayed_work_pending(&pag->pag_blockgc_work));
+	ASSERT(!delayed_work_pending(&pag->pag_inodegc_work));
 	ASSERT(atomic_read(&pag->pag_ref) == 0);
 	kmem_free(pag);
 }
@@ -148,6 +149,7 @@ xfs_free_perag(
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
@@ -204,6 +206,7 @@ xfs_initialize_perag(
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
+		INIT_DELAYED_WORK(&pag->pag_inodegc_work, xfs_inodegc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 
 		error = xfs_buf_hash_init(pag);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ce00ad47b8ea..835c07d00cd7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -177,7 +177,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
-	struct delayed_work	m_inodegc_work; /* background inode inactive */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -370,6 +369,9 @@ typedef struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
+	/* background inode inactivation */
+	struct delayed_work	pag_inodegc_work;
+
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8d0142487fc7..566e5657c1b0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1879,7 +1879,6 @@ static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
-	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 11/11] xfs: create a polled function to force inode inactivation
  2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-03-11  3:06 ` [PATCH 10/11] xfs: parallelize inode inactivation Darrick J. Wong
@ 2021-03-11  3:06 ` Darrick J. Wong
  2021-03-23 22:31   ` Dave Chinner
  10 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-11  3:06 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a polled version of xfs_inactive_force so that we can force
inactivation while holding a lock (usually the umount lock) without
tripping over the softlockup timer.  This is for callers that hold vfs
locks while calling inactivation, which is currently unmount, iunlink
processing during mount, and rw->ro remount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   38 +++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h |    1 +
 fs/xfs/xfs_mount.c  |    2 +-
 fs/xfs/xfs_mount.h  |    5 +++++
 fs/xfs/xfs_super.c  |    3 ++-
 5 files changed, 46 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d5f580b92e48..9db2beb4e732 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -25,6 +25,7 @@
 #include "xfs_ialloc.h"
 
 #include <linux/iversion.h>
+#include <linux/nmi.h>
 
 /*
  * Allocate and initialise an xfs_inode.
@@ -2067,8 +2068,12 @@ xfs_inodegc_free_space(
 	struct xfs_mount	*mp,
 	struct xfs_eofblocks	*eofb)
 {
-	return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
+	int			error;
+
+	error = xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
 			xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG);
+	wake_up(&mp->m_inactive_wait);
+	return error;
 }
 
 /* Try to get inode inactivation moving. */
@@ -2138,6 +2143,37 @@ xfs_inodegc_force(
 	flush_workqueue(mp->m_gc_workqueue);
 }
 
+/*
+ * Force all inode inactivation work to run immediately, and poll until the
+ * work is complete.  Callers should only use this function if they must
+ * inactivate inodes while holding VFS locks, and must be prepared to prevent
+ * or to wait for inodes that are queued for inactivation while this runs.
+ */
+void
+xfs_inodegc_force_poll(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	bool			queued = false;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		queued |= xfs_inodegc_force_pag(pag);
+	if (!queued)
+		return;
+
+	/*
+	 * Touch the softlockup watchdog every 1/10th of a second while there
+	 * are still inactivation-tagged inodes in the filesystem.
+	 */
+	while (!wait_event_timeout(mp->m_inactive_wait,
+				   !radix_tree_tagged(&mp->m_perag_tree,
+						      XFS_ICI_INACTIVE_TAG),
+				   HZ / 10)) {
+		touch_softlockup_watchdog();
+	}
+}
+
 /* Stop all queued inactivation work. */
 void
 xfs_inodegc_stop(
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 9d5a1f4c0369..80a79bace641 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -84,6 +84,7 @@ void xfs_blockgc_start(struct xfs_mount *mp);
 
 void xfs_inodegc_worker(struct work_struct *work);
 void xfs_inodegc_force(struct xfs_mount *mp);
+void xfs_inodegc_force_poll(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
 int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_eofblocks *eofb);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index a5963061485c..1012b1b361ba 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1109,7 +1109,7 @@ xfs_unmountfs(
 	 * Since this can involve finobt updates, do it now before we lose the
 	 * per-AG space reservations.
 	 */
-	xfs_inodegc_force(mp);
+	xfs_inodegc_force_poll(mp);
 
 	xfs_blockgc_stop(mp);
 	xfs_fs_unreserve_ag_blocks(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 835c07d00cd7..23d9888d2b82 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -213,6 +213,11 @@ typedef struct xfs_mount {
 	unsigned int		*m_errortag;
 	struct xfs_kobj		m_errortag_kobj;
 #endif
+	/*
+	 * Use this to wait for the inode inactivation workqueue to finish
+	 * inactivating all the inodes.
+	 */
+	struct wait_queue_head	m_inactive_wait;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 566e5657c1b0..8329a3efced7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1754,7 +1754,7 @@ xfs_remount_ro(
 	 * Since this can involve finobt updates, do it now before we lose the
 	 * per-AG space reservations.
 	 */
-	xfs_inodegc_force(mp);
+	xfs_inodegc_force_poll(mp);
 
 	/* Free the per-AG metadata reservation pool. */
 	error = xfs_fs_unreserve_ag_blocks(mp);
@@ -1880,6 +1880,7 @@ static int xfs_init_fs_context(
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
+	init_waitqueue_head(&mp->m_inactive_wait);
 	/*
 	 * We don't create the finobt per-ag space reservation until after log
 	 * recovery, so we must set this to true so that an ifree transaction


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/11] xfs: prevent metadata files from being inactivated
  2021-03-11  3:05 ` [PATCH 01/11] xfs: prevent metadata files from being inactivated Darrick J. Wong
@ 2021-03-11 13:05   ` Christoph Hellwig
  2021-03-22 23:13   ` Dave Chinner
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-11 13:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

Although at some point we really need to sort out the header mess..

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks
  2021-03-11  3:05 ` [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
@ 2021-03-11 13:09   ` Christoph Hellwig
  2021-03-15 18:46   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-11 13:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:05:51PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Refactor the part of _free_eofblocks that decides if it's really going
> to truncate post-EOF blocks into a separate helper function.  The
> upcoming deferred inode inactivation patch requires us to be able to
> decide this prior to actual inactivation.  No functionality changes.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/11] xfs: don't reclaim dquots with incore reservations
  2021-03-11  3:05 ` [PATCH 03/11] xfs: don't reclaim dquots with incore reservations Darrick J. Wong
@ 2021-03-15 18:29   ` Christoph Hellwig
  2021-03-22 23:31   ` Dave Chinner
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-15 18:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks
  2021-03-11  3:05 ` [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
  2021-03-11 13:09   ` Christoph Hellwig
@ 2021-03-15 18:46   ` Christoph Hellwig
  2021-03-18  4:33     ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-15 18:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Going further through the series actually made me go back to this one,
so a few more comments:

>  /*
> + * Decide if this inode have post-EOF blocks.  The caller is responsible
> + * for knowing / caring about the PREALLOC/APPEND flags.

Please spell out the XFS_DIFLAG_ here, as this really confused me.  In
fact even with that it still confuses me, as "caller is responsible"
here really means: only call this if you previously called
xfs_can_free_eofblocks and it return true.

Which brings me to the structure of this:  I think without much pain
we can ensure xfs_can_free_eofblocks is always called with the iolock,
in which case we really should merge xfs_can_free_eofblocks and this
new helper to avoid the rather confusing fact that we have two similarly
named helper doing similiar but not the same thing.

>  int
> +xfs_has_eofblocks(
> +	struct xfs_inode	*ip,
> +	bool			*has)

I also think the calling convention can be simplified here.  If an
error occurs we obviously do not want to free the eofblocks.  So
instead of returning two calues we can just return a single bool.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/11] xfs: decide if inode needs inactivation
  2021-03-11  3:06 ` [PATCH 04/11] xfs: decide if inode needs inactivation Darrick J. Wong
@ 2021-03-15 18:47   ` Christoph Hellwig
  2021-03-15 19:06     ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-15 18:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:02PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a predicate function to decide if an inode needs (deferred)
> inactivation.  Any file that has been unlinked or has speculative
> preallocations either for post-EOF writes or for CoW qualifies.
> This function will also be used by the upcoming deferred inactivation
> patch.

The helper looks good, but I'd just merge it into patch 6, without
that is isn't very helpful.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/11] xfs: rename the blockgc workqueue
  2021-03-11  3:06 ` [PATCH 05/11] xfs: rename the blockgc workqueue Darrick J. Wong
@ 2021-03-15 18:49   ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-15 18:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:08PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Since we're about to start using the blockgc workqueue to dispose of
> inactivated inodes, strip the "block" prefix from the name; now it's
> merely the general garbage collection (gc) workqueue.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space
  2021-03-11  3:06 ` [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
@ 2021-03-15 18:54   ` Christoph Hellwig
  2021-03-15 19:06     ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-15 18:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:25PM -0800, Darrick J. Wong wrote:
> +	error =  xfs_inode_walk(mp, 0, xfs_blockgc_scan_inode, eofb,
>  			XFS_ICI_BLOCKGC_TAG);

Nit: strange double whitespace here.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/11] xfs: parallelize inode inactivation
  2021-03-11  3:06 ` [PATCH 10/11] xfs: parallelize inode inactivation Darrick J. Wong
@ 2021-03-15 18:55   ` Christoph Hellwig
  2021-03-15 19:03     ` Darrick J. Wong
  2021-03-23 22:21   ` Dave Chinner
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-15 18:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split the inode inactivation work into per-AG work items so that we can
> take advantage of parallelization.

Any reason this isn't just done from the beginning?  As-is is just
seems to create a fair amount of churn.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/11] xfs: parallelize inode inactivation
  2021-03-15 18:55   ` Christoph Hellwig
@ 2021-03-15 19:03     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-15 19:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 15, 2021 at 06:55:51PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Split the inode inactivation work into per-AG work items so that we can
> > take advantage of parallelization.
> 
> Any reason this isn't just done from the beginning?  As-is is just
> seems to create a fair amount of churn.

I felt like the first patch was already too long at 1100 lines.

I don't mind combining them, but with the usual proviso that I don't
want the whole series to stall on reviewers going back and forth on this
point without anyone offering an RVB.

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space
  2021-03-15 18:54   ` Christoph Hellwig
@ 2021-03-15 19:06     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-15 19:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 15, 2021 at 06:54:53PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 07:06:25PM -0800, Darrick J. Wong wrote:
> > +	error =  xfs_inode_walk(mp, 0, xfs_blockgc_scan_inode, eofb,
> >  			XFS_ICI_BLOCKGC_TAG);
> 
> Nit: strange double whitespace here.

Yeah, that'll go away in the next version.  As part of a new small
series to eliminate the indirect calls in xfs_inode_walk when possible,
I figured out that we could get rid of the flags and tag arguments.

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/11] xfs: decide if inode needs inactivation
  2021-03-15 18:47   ` Christoph Hellwig
@ 2021-03-15 19:06     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-15 19:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 15, 2021 at 06:47:41PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 07:06:02PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a predicate function to decide if an inode needs (deferred)
> > inactivation.  Any file that has been unlinked or has speculative
> > preallocations either for post-EOF writes or for CoW qualifies.
> > This function will also be used by the upcoming deferred inactivation
> > patch.
> 
> The helper looks good, but I'd just merge it into patch 6, without
> that is isn't very helpful.

Done.

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-11  3:06 ` [PATCH 06/11] xfs: deferred inode inactivation Darrick J. Wong
@ 2021-03-16  7:27   ` Christoph Hellwig
  2021-03-16 15:47     ` Darrick J. Wong
  2021-03-23  1:44   ` Dave Chinner
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-16  7:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Still digesting this.  What trips me off a bit is the huge amount of
duplication vs the inode reclaim mechanism.  Did you look into sharing
more code there and if yes what speaks against that?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-16  7:27   ` Christoph Hellwig
@ 2021-03-16 15:47     ` Darrick J. Wong
  2021-03-17 15:21       ` Christoph Hellwig
  2021-03-22 23:37       ` Dave Chinner
  0 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-16 15:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> Still digesting this.  What trips me off a bit is the huge amount of
> duplication vs the inode reclaim mechanism.  Did you look into sharing
> more code there and if yes what speaks against that?

TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
to replace the inode reclaim tagging and iteration with an lru list walk
so I decided not to entangle the two.

[1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@fromorbit.com/

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-16 15:47     ` Darrick J. Wong
@ 2021-03-17 15:21       ` Christoph Hellwig
  2021-03-17 15:49         ` Darrick J. Wong
  2021-03-22 23:37       ` Dave Chinner
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-17 15:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> > Still digesting this.  What trips me off a bit is the huge amount of
> > duplication vs the inode reclaim mechanism.  Did you look into sharing
> > more code there and if yes what speaks against that?
> 
> TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
> to replace the inode reclaim tagging and iteration with an lru list walk
> so I decided not to entangle the two.
> 
> [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@fromorbit.com/

Well, it isn't just the radix tree tagging, but mostly the
infrastructure in iget that seems duplicates a lot of very delicate
code.

For the actual inactivation run:  why don't we queue up the inodes
for deactivation directly that, that use the work_struct in the
inode to directly queue up the inode to the workqueue and let the
workqueue manage the details?  That also means we can piggy back on
flush_work and flush_workqueue to force one or more entries out.

Again I'm not saying I know this is better, but this is something that
comes to my mind when reading the code.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-17 15:21       ` Christoph Hellwig
@ 2021-03-17 15:49         ` Darrick J. Wong
  2021-03-22 23:46           ` Dave Chinner
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-17 15:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Mar 17, 2021 at 03:21:25PM +0000, Christoph Hellwig wrote:
> On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> > > Still digesting this.  What trips me off a bit is the huge amount of
> > > duplication vs the inode reclaim mechanism.  Did you look into sharing
> > > more code there and if yes what speaks against that?
> > 
> > TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
> > to replace the inode reclaim tagging and iteration with an lru list walk
> > so I decided not to entangle the two.
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@fromorbit.com/
> 
> Well, it isn't just the radix tree tagging, but mostly the
> infrastructure in iget that seems duplicates a lot of very delicate
> code.
> 
> For the actual inactivation run:  why don't we queue up the inodes
> for deactivation directly that, that use the work_struct in the
> inode to directly queue up the inode to the workqueue and let the
> workqueue manage the details?  That also means we can piggy back on
> flush_work and flush_workqueue to force one or more entries out.
> 
> Again I'm not saying I know this is better, but this is something that
> comes to my mind when reading the code.

Hmm.  You mean reuse i_ioend_work (which maybe we should just rename to
i_work) and queueing the inodes directly into the workqueue?  I suppose
that would mean we don't even need the radix tree tag + inode walk...

I hadn't thought about reusing i_ioend_work, since this patchset
predates the writeback ioend chaining.  The biggest downside that I can
think of doing it that way is that right after a rm -rf, the unbound gc
workqueue will start hundreds of kworkers to deal with the sudden burst
of queued work, but all those workers will end up fighting each other
for (a) log grant space, and after that (b) the AGI buffer locks, and
meanwhile everything else on the frontend stalls on the log.

The other side benefit I can think of w.r.t. keeping the inactivation
work as a per-AG item is that (at least among AGs) we can walk the
inodes in disk order, which probably results in less seeking (note: I
haven't studied this) and might allow us to free inode cluster buffers
sooner in the rm -rf case.

<shrug> Thoughts?

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks
  2021-03-15 18:46   ` Christoph Hellwig
@ 2021-03-18  4:33     ` Darrick J. Wong
  2021-03-19  1:48       ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-18  4:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 15, 2021 at 06:46:15PM +0000, Christoph Hellwig wrote:
> Going further through the series actually made me go back to this one,
> so a few more comments:
> 
> >  /*
> > + * Decide if this inode have post-EOF blocks.  The caller is responsible
> > + * for knowing / caring about the PREALLOC/APPEND flags.
> 
> Please spell out the XFS_DIFLAG_ here, as this really confused me.  In
> fact even with that it still confuses me, as "caller is responsible"
> here really means: only call this if you previously called
> xfs_can_free_eofblocks and it return true.

Sorry about that; I'll spell them out in the future.

> Which brings me to the structure of this:  I think without much pain
> we can ensure xfs_can_free_eofblocks is always called with the iolock,
> in which case we really should merge xfs_can_free_eofblocks and this
> new helper to avoid the rather confusing fact that we have two similarly
> named helper doing similiar but not the same thing.

I'll have a look into that tomorrow morning. :)

> >  int
> > +xfs_has_eofblocks(
> > +	struct xfs_inode	*ip,
> > +	bool			*has)
> 
> I also think the calling convention can be simplified here.  If an
> error occurs we obviously do not want to free the eofblocks.  So
> instead of returning two calues we can just return a single bool.

Yeah, this area needs some simplification.  Will do.

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks
  2021-03-18  4:33     ` Darrick J. Wong
@ 2021-03-19  1:48       ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-19  1:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Mar 17, 2021 at 09:33:29PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 15, 2021 at 06:46:15PM +0000, Christoph Hellwig wrote:
> > Going further through the series actually made me go back to this one,
> > so a few more comments:
> > 
> > >  /*
> > > + * Decide if this inode have post-EOF blocks.  The caller is responsible
> > > + * for knowing / caring about the PREALLOC/APPEND flags.
> > 
> > Please spell out the XFS_DIFLAG_ here, as this really confused me.  In
> > fact even with that it still confuses me, as "caller is responsible"
> > here really means: only call this if you previously called
> > xfs_can_free_eofblocks and it return true.
> 
> Sorry about that; I'll spell them out in the future.
> 
> > Which brings me to the structure of this:  I think without much pain
> > we can ensure xfs_can_free_eofblocks is always called with the iolock,
> > in which case we really should merge xfs_can_free_eofblocks and this
> > new helper to avoid the rather confusing fact that we have two similarly
> > named helper doing similiar but not the same thing.
> 
> I'll have a look into that tomorrow morning. :)

The only change that was necessary was moving the can_free_eofblocks
call in the blockgc code until after we've taken the IOLOCK.

> > >  int
> > > +xfs_has_eofblocks(
> > > +	struct xfs_inode	*ip,
> > > +	bool			*has)
> > 
> > I also think the calling convention can be simplified here.  If an
> > error occurs we obviously do not want to free the eofblocks.  So
> > instead of returning two calues we can just return a single bool.
> 
> Yeah, this area needs some simplification.  Will do.

I moved all the stuff in this function upwards into
xfs_can_free_eofblocks and it seems to work ok.

--D

> 
> --D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/11] xfs: prevent metadata files from being inactivated
  2021-03-11  3:05 ` [PATCH 01/11] xfs: prevent metadata files from being inactivated Darrick J. Wong
  2021-03-11 13:05   ` Christoph Hellwig
@ 2021-03-22 23:13   ` Dave Chinner
  1 sibling, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2021-03-22 23:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:05:46PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Files containing metadata (quota records, rt bitmap and summary info)
> are fully managed by the filesystem, which means that all resource
> cleanup must be explicit, not automatic.  This means that they should
> never be subjected automatic to post-eof truncation, nor should they be
> freed automatically even if the link count drops to zero.
> 
> In other words, xfs_inactive() should leave these files alone.  Add the
> necessary predicate functions to make this happen.  This adds a second
> layer of prevention for the kinds of fs corruption that was fixed by
> commit f4c32e87de7d.  If we ever decide to support removing metadata
> files, we should make all those metadata updates explicit.
> 
> Rearrange the order of #includes to fix compiler errors, since
> xfs_mount.h is supposed to be included before xfs_inode.h
> 
> Followup-to: f4c32e87de7d ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/11] xfs: don't reclaim dquots with incore reservations
  2021-03-11  3:05 ` [PATCH 03/11] xfs: don't reclaim dquots with incore reservations Darrick J. Wong
  2021-03-15 18:29   ` Christoph Hellwig
@ 2021-03-22 23:31   ` Dave Chinner
  2021-03-23  0:01     ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-22 23:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:05:57PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> If a dquot has an incore reservation that exceeds the ondisk count, it
> by definition has active incore state and must not be reclaimed.  Up to
> this point every inode with an incore dquot reservation has always
> retained a reference to the dquot so it was never possible for
> xfs_qm_dquot_isolate to be called on a dquot with active state and zero
> refcount, but this will soon change.
> 
> Deferred inode inactivation is about to reorganize how inodes are
> inactivated by shunting all that work to a background workqueue.  In
> order to avoid deadlocks with the quotaoff inode scan and reduce overall
> memory requirements (since inodes can spend a lot of time waiting for
> inactivation), inactive inodes will drop their dquot references while
> they're waiting to be inactivated.
> 
> However, inactive inodes can have delalloc extents in the data fork or
> any extents in the CoW fork.  Either of these contribute to the dquot's
> incore reservation being larger than the resource count (i.e. they're
> the reason the dquot still has active incore state), so we cannot allow
> the dquot to be reclaimed.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
.....
>  static enum lru_status
>  xfs_qm_dquot_isolate(
>  	struct list_head	*item,
> @@ -427,10 +441,15 @@ xfs_qm_dquot_isolate(
>  		goto out_miss_busy;
>  
>  	/*
> -	 * This dquot has acquired a reference in the meantime remove it from
> -	 * the freelist and try again.
> +	 * Either this dquot has incore reservations or it has acquired a
> +	 * reference.  Remove it from the freelist and try again.
> +	 *
> +	 * Inodes tagged for inactivation drop their dquot references to avoid
> +	 * deadlocks with quotaoff.  If these inodes have delalloc reservations
> +	 * in the data fork or any extents in the CoW fork, these contribute
> +	 * to the dquot's incore block reservation exceeding the count.
>  	 */
> -	if (dqp->q_nrefs) {
> +	if (xfs_dquot_has_incore_resv(dqp) || dqp->q_nrefs) {
>  		xfs_dqunlock(dqp);
>  		XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants);
>  

This means we can have dquots with no references that aren't on
the free list and aren't actually referenced by any inode, either.

So if we now shut down the filesystem, what frees these dquots?
Are we relying on xfs_qm_dqpurge_all() to find all these dquots
and xfs_qm_dqpurge() guaranteeing that they are always cleaned
and freed?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-16 15:47     ` Darrick J. Wong
  2021-03-17 15:21       ` Christoph Hellwig
@ 2021-03-22 23:37       ` Dave Chinner
  2021-03-23  0:24         ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-22 23:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> > Still digesting this.  What trips me off a bit is the huge amount of
> > duplication vs the inode reclaim mechanism.  Did you look into sharing
> > more code there and if yes what speaks against that?
> 
> TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
> to replace the inode reclaim tagging and iteration with an lru list walk
> so I decided not to entangle the two.
> 
> [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@fromorbit.com/

I prototyped that and discarded it - it made inode reclaim much,
much slower because it introduced delays (lock contention) adding
new inodes to the reclaim list while a reclaim isolation walk was in
progress.

The radix tree based mechanism we have right now is very efficient
as only the inodes being marked for reclaim take the radix tree
lock and hence there is minimal contention for it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-17 15:49         ` Darrick J. Wong
@ 2021-03-22 23:46           ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2021-03-22 23:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Wed, Mar 17, 2021 at 08:49:04AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 17, 2021 at 03:21:25PM +0000, Christoph Hellwig wrote:
> > On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote:
> > > On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> > > > Still digesting this.  What trips me off a bit is the huge amount of
> > > > duplication vs the inode reclaim mechanism.  Did you look into sharing
> > > > more code there and if yes what speaks against that?
> > > 
> > > TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
> > > to replace the inode reclaim tagging and iteration with an lru list walk
> > > so I decided not to entangle the two.
> > > 
> > > [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@fromorbit.com/
> > 
> > Well, it isn't just the radix tree tagging, but mostly the
> > infrastructure in iget that seems duplicates a lot of very delicate
> > code.
> > 
> > For the actual inactivation run:  why don't we queue up the inodes
> > for deactivation directly that, that use the work_struct in the
> > inode to directly queue up the inode to the workqueue and let the
> > workqueue manage the details?  That also means we can piggy back on
> > flush_work and flush_workqueue to force one or more entries out.
> > 
> > Again I'm not saying I know this is better, but this is something that
> > comes to my mind when reading the code.
> 
> Hmm.  You mean reuse i_ioend_work (which maybe we should just rename to
> i_work) and queueing the inodes directly into the workqueue?  I suppose
> that would mean we don't even need the radix tree tag + inode walk...
> 
> I hadn't thought about reusing i_ioend_work, since this patchset
> predates the writeback ioend chaining.  The biggest downside that I can
> think of doing it that way is that right after a rm -rf, the unbound gc
> workqueue will start hundreds of kworkers to deal with the sudden burst
> of queued work, but all those workers will end up fighting each other
> for (a) log grant space, and after that (b) the AGI buffer locks, and
> meanwhile everything else on the frontend stalls on the log.

yeah, this is not a good idea. The deferred inactivation needs to
limit concurrency to a single work per AG at most because otherwise
it will just consume all the reservation space serialising on the
AGI locks. Even so, it can still starve the front end when they
compete for AGI and AGF locks. Hence the background deferral is
going to have to be very careful about how it obtains and blocks on
locks....

(I haven't got that far iinto the patchset yet)

> The other side benefit I can think of w.r.t. keeping the inactivation
> work as a per-AG item is that (at least among AGs) we can walk the
> inodes in disk order, which probably results in less seeking (note: I
> haven't studied this) and might allow us to free inode cluster buffers
> sooner in the rm -rf case.

That is very useful because it allows the CIL to cancel the space
used modifying the inodes and the cluster buffer during the unlink,
allowing it to aggregate many more unlinks into the same checkpoint
and avoid metadata writeback part way through unlink operations. i.e
it is very efficient in terms of journal space consumption and hence
journal IO bandwidth.  (This is how we get multiple hundreds of
thousands of items into a single 32MB journal checkpoint......)

Cheers,

Dave.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/11] xfs: don't reclaim dquots with incore reservations
  2021-03-22 23:31   ` Dave Chinner
@ 2021-03-23  0:01     ` Darrick J. Wong
  2021-03-23  1:48       ` Dave Chinner
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-23  0:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 23, 2021 at 10:31:39AM +1100, Dave Chinner wrote:
> On Wed, Mar 10, 2021 at 07:05:57PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > If a dquot has an incore reservation that exceeds the ondisk count, it
> > by definition has active incore state and must not be reclaimed.  Up to
> > this point every inode with an incore dquot reservation has always
> > retained a reference to the dquot so it was never possible for
> > xfs_qm_dquot_isolate to be called on a dquot with active state and zero
> > refcount, but this will soon change.
> > 
> > Deferred inode inactivation is about to reorganize how inodes are
> > inactivated by shunting all that work to a background workqueue.  In
> > order to avoid deadlocks with the quotaoff inode scan and reduce overall
> > memory requirements (since inodes can spend a lot of time waiting for
> > inactivation), inactive inodes will drop their dquot references while
> > they're waiting to be inactivated.
> > 
> > However, inactive inodes can have delalloc extents in the data fork or
> > any extents in the CoW fork.  Either of these contribute to the dquot's
> > incore reservation being larger than the resource count (i.e. they're
> > the reason the dquot still has active incore state), so we cannot allow
> > the dquot to be reclaimed.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> .....
> >  static enum lru_status
> >  xfs_qm_dquot_isolate(
> >  	struct list_head	*item,
> > @@ -427,10 +441,15 @@ xfs_qm_dquot_isolate(
> >  		goto out_miss_busy;
> >  
> >  	/*
> > -	 * This dquot has acquired a reference in the meantime remove it from
> > -	 * the freelist and try again.
> > +	 * Either this dquot has incore reservations or it has acquired a
> > +	 * reference.  Remove it from the freelist and try again.
> > +	 *
> > +	 * Inodes tagged for inactivation drop their dquot references to avoid
> > +	 * deadlocks with quotaoff.  If these inodes have delalloc reservations
> > +	 * in the data fork or any extents in the CoW fork, these contribute
> > +	 * to the dquot's incore block reservation exceeding the count.
> >  	 */
> > -	if (dqp->q_nrefs) {
> > +	if (xfs_dquot_has_incore_resv(dqp) || dqp->q_nrefs) {
> >  		xfs_dqunlock(dqp);
> >  		XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants);
> >  
> 
> This means we can have dquots with no references that aren't on
> the free list and aren't actually referenced by any inode, either.
> 
> So if we now shut down the filesystem, what frees these dquots?
> Are we relying on xfs_qm_dqpurge_all() to find all these dquots
> and xfs_qm_dqpurge() guaranteeing that they are always cleaned
> and freed?

Yes.  Want me to add that to the comment?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-22 23:37       ` Dave Chinner
@ 2021-03-23  0:24         ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-23  0:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Tue, Mar 23, 2021 at 10:37:21AM +1100, Dave Chinner wrote:
> On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> > > Still digesting this.  What trips me off a bit is the huge amount of
> > > duplication vs the inode reclaim mechanism.  Did you look into sharing
> > > more code there and if yes what speaks against that?
> > 
> > TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
> > to replace the inode reclaim tagging and iteration with an lru list walk
> > so I decided not to entangle the two.
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@fromorbit.com/
> 
> I prototyped that and discarded it - it made inode reclaim much,
> much slower because it introduced delays (lock contention) adding
> new inodes to the reclaim list while a reclaim isolation walk was in
> progress.
> 
> The radix tree based mechanism we have right now is very efficient
> as only the inodes being marked for reclaim take the radix tree
> lock and hence there is minimal contention for it...

Ahah, that's what happened to that patchset.  Well in that case, since
xfs_reclaim_inodes* is going to stick around, I think it makes more
sense to refactor xfs_inodes_walk_ag to handle XFS_ICI_RECLAIM_TAG, and
then xfs_reclaim_inodes_ag can go away entirely.

That said, xfs_reclaim_inodes_ag does have some warts (like updating the
per-ag reclaim cursor and decrementing nr_to_scan) that would add
clutter.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-11  3:06 ` [PATCH 06/11] xfs: deferred inode inactivation Darrick J. Wong
  2021-03-16  7:27   ` Christoph Hellwig
@ 2021-03-23  1:44   ` Dave Chinner
  2021-03-23  4:00     ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-23  1:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:13PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> defer the inactivation phase to a separate workqueue.  With this we
> avoid blocking memory reclaim on filesystem metadata updates that are
> necessary to free an in-core inode, such as post-eof block freeing, COW
> staging extent freeing, and truncating and freeing unlinked inodes.  Now
> that work is deferred to a workqueue where we can do the freeing in
> batches.
> 
> We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> The first flag helps our worker find inodes needing inactivation, and
> the second flag marks inodes that are in the process of being
> inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> 
> Unfortunately, deferring the inactivation has one huge downside --
> eventual consistency.  Since all the freeing is deferred to a worker
> thread, one can rm a file but the space doesn't come back immediately.
> This can cause some odd side effects with quota accounting and statfs,
> so we also force inactivation scans in order to maintain the existing
> behaviors, at least outwardly.
> 
> For this patch we'll set the delay to zero to mimic the old timing as
> much as possible; in the next patch we'll play with different delay
> settings.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
....
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index a2a407039227..3a3baf56198b 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -19,6 +19,8 @@
>  #include "xfs_log.h"
>  #include "xfs_ag.h"
>  #include "xfs_ag_resv.h"
> +#include "xfs_inode.h"
> +#include "xfs_icache.h"
>  
>  /*
>   * growfs operations
> @@ -290,6 +292,13 @@ xfs_fs_counts(
>  	xfs_mount_t		*mp,
>  	xfs_fsop_counts_t	*cnt)
>  {
> +	/*
> +	 * Process all the queued file and speculative preallocation cleanup so
> +	 * that the counter values we report here do not incorporate any
> +	 * resources that were previously deleted.
> +	 */
> +	xfs_inodegc_force(mp);

xfs_fs_counts() is supposed to be a quick, non-blocking summary of
the state - it can never supply userspace with accurate values
because they are wrong even before the ioctl returns to userspace.
Hence we do not attempt to make them correct, just use a fast, point
in time sample of the current counter values.

So this seems like an unnecessarily heavyweight operation
to add to this function....

Also, I don't like the word "force" in functions like this: force it
to do what, exactly? If you want a queue flush, then
xfs_inodegc_flush() matches with how flush_workqueue() works...

>  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
>  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
>  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index e6a62f765422..1b7652af5ee5 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -195,6 +195,18 @@ xfs_perag_clear_reclaim_tag(
>  	trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
>  }
>  
> +static void
> +__xfs_inode_set_reclaim_tag(
> +	struct xfs_perag	*pag,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			   XFS_ICI_RECLAIM_TAG);
> +	xfs_perag_set_reclaim_tag(pag);
> +	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> +}
>  
>  /*
>   * We set the inode flag atomically with the radix tree tag.
> @@ -212,10 +224,7 @@ xfs_inode_set_reclaim_tag(
>  	spin_lock(&pag->pag_ici_lock);
>  	spin_lock(&ip->i_flags_lock);
>  
> -	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			   XFS_ICI_RECLAIM_TAG);
> -	xfs_perag_set_reclaim_tag(pag);
> -	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> +	__xfs_inode_set_reclaim_tag(pag, ip);
>  
>  	spin_unlock(&ip->i_flags_lock);
>  	spin_unlock(&pag->pag_ici_lock);

First thought: rename xfs_inode_set_reclaim_tag() to
xfs_inode_set_reclaim_tag_locked(), leave the guts as
xfs_inode_set_reclaim_tag().

> @@ -233,6 +242,94 @@ xfs_inode_clear_reclaim_tag(
>  	xfs_perag_clear_reclaim_tag(pag);
>  }
>  
> +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> +static void
> +xfs_inodegc_queue(
> +	struct xfs_mount        *mp)
> +{
> +	rcu_read_lock();
> +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
> +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
> +				2 * HZ);
> +	rcu_read_unlock();
> +}

Why half a second and not something referenced against the inode
reclaim/sync period?

> +/* Remember that an AG has one more inode to inactivate. */
> +static void
> +xfs_perag_set_inactive_tag(
> +	struct xfs_perag	*pag)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +
> +	lockdep_assert_held(&pag->pag_ici_lock);
> +	if (pag->pag_ici_inactive++)
> +		return;
> +
> +	/* propagate the inactive tag up into the perag radix tree */
> +	spin_lock(&mp->m_perag_lock);
> +	radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
> +			   XFS_ICI_INACTIVE_TAG);
> +	spin_unlock(&mp->m_perag_lock);
> +
> +	/* schedule periodic background inode inactivation */
> +	xfs_inodegc_queue(mp);
> +
> +	trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_);
> +}
> +
> +/* Set this inode's inactive tag and set the per-AG tag. */
> +void
> +xfs_inode_set_inactive_tag(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_perag	*pag;
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +	spin_lock(&pag->pag_ici_lock);
> +	spin_lock(&ip->i_flags_lock);
> +
> +	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> +				   XFS_ICI_INACTIVE_TAG);
> +	xfs_perag_set_inactive_tag(pag);
> +	__xfs_iflags_set(ip, XFS_NEED_INACTIVE);
> +
> +	spin_unlock(&ip->i_flags_lock);
> +	spin_unlock(&pag->pag_ici_lock);
> +	xfs_perag_put(pag);
> +}
> +
> +/* Remember that an AG has one less inode to inactivate. */
> +static void
> +xfs_perag_clear_inactive_tag(
> +	struct xfs_perag	*pag)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +
> +	lockdep_assert_held(&pag->pag_ici_lock);
> +	if (--pag->pag_ici_inactive)
> +		return;
> +
> +	/* clear the inactive tag from the perag radix tree */
> +	spin_lock(&mp->m_perag_lock);
> +	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
> +			     XFS_ICI_INACTIVE_TAG);
> +	spin_unlock(&mp->m_perag_lock);
> +	trace_xfs_perag_clear_inactive(mp, pag->pag_agno, -1, _RET_IP_);
> +}
> +
> +/* Clear this inode's inactive tag and try to clear the AG's. */
> +STATIC void

static

> +xfs_inode_clear_inactive_tag(
> +	struct xfs_perag	*pag,
> +	xfs_ino_t		ino)
> +{
> +	radix_tree_tag_clear(&pag->pag_ici_root,
> +			     XFS_INO_TO_AGINO(pag->pag_mount, ino),
> +			     XFS_ICI_INACTIVE_TAG);
> +	xfs_perag_clear_inactive_tag(pag);
> +}

These are just straight copies of the reclaim tag code. Do you have
a plan for factoring these into a single implementation to clean
this up? Something like this:

static void
xfs_inode_clear_tag(
	struct xfs_perag	*pag,
	xfs_ino_t		ino,
	int			tag)
{
	struct xfs_mount	*mp = pag->pag_mount;

	lockdep_assert_held(&pag->pag_ici_lock);
	radix_tree_tag_clear(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino),
				tag);
	switch(tag) {
	case XFS_ICI_INACTIVE_TAG:
		if (--pag->pag_ici_inactive)
			return;
		break;
	case XFS_ICI_RECLAIM_TAG:
		if (--pag->pag_ici_reclaim)
			return;
		break;
	default:
		ASSERT(0);
		return;
	}

	spin_lock(&mp->m_perag_lock);
	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno, tag);
	spin_unlock(&mp->m_perag_lock);
}

As a followup patch? The set tag case looks similarly easy to make
generic...

> +
>  static void
>  xfs_inew_wait(
>  	struct xfs_inode	*ip)
> @@ -298,6 +395,13 @@ xfs_iget_check_free_state(
>  	struct xfs_inode	*ip,
>  	int			flags)
>  {
> +	/*
> +	 * Unlinked inodes awaiting inactivation must not be reused until we
> +	 * have a chance to clear the on-disk metadata.
> +	 */
> +	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
> +		return -ENOENT;
> +
>  	if (flags & XFS_IGET_CREATE) {
>  		/* should be a free inode */
>  		if (VFS_I(ip)->i_mode != 0) {

How do we get here with an XFS_NEED_INACTIVE inode?
xfs_iget_check_free_state() is only called from the cache miss path,
but we should never get here with a cached inode that is awaiting
inactivation...

> @@ -323,6 +427,67 @@ xfs_iget_check_free_state(
>  	return 0;
>  }
>  
> +/*
> + * We've torn down the VFS part of this NEED_INACTIVE inode, so we need to get
> + * it back into working state.
> + */
> +static int
> +xfs_iget_inactive(
> +	struct xfs_perag	*pag,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct inode		*inode = VFS_I(ip);
> +	int			error;
> +
> +	error = xfs_reinit_inode(mp, inode);
> +	if (error) {
> +		bool wake;
> +		/*
> +		 * Re-initializing the inode failed, and we are in deep
> +		 * trouble.  Try to re-add it to the inactive list.
> +		 */
> +		rcu_read_lock();
> +		spin_lock(&ip->i_flags_lock);
> +		wake = !!__xfs_iflags_test(ip, XFS_INEW);
> +		ip->i_flags &= ~(XFS_INEW | XFS_INACTIVATING);
> +		if (wake)
> +			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
> +		ASSERT(ip->i_flags & XFS_NEED_INACTIVE);
> +		trace_xfs_iget_inactive_fail(ip);
> +		spin_unlock(&ip->i_flags_lock);
> +		rcu_read_unlock();
> +		return error;
> +	}
> +
> +	spin_lock(&pag->pag_ici_lock);
> +	spin_lock(&ip->i_flags_lock);
> +
> +	/*
> +	 * Clear the per-lifetime state in the inode as we are now effectively
> +	 * a new inode and need to return to the initial state before reuse
> +	 * occurs.
> +	 */
> +	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
> +	ip->i_flags |= XFS_INEW;
> +	xfs_inode_clear_inactive_tag(pag, ip->i_ino);
> +	inode->i_state = I_NEW;
> +	ip->i_sick = 0;
> +	ip->i_checked = 0;
> +
> +	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> +	init_rwsem(&inode->i_rwsem);
> +
> +	spin_unlock(&ip->i_flags_lock);
> +	spin_unlock(&pag->pag_ici_lock);
> +
> +	/*
> +	 * Reattach dquots since we might have removed them when we put this
> +	 * inode on the inactivation list.
> +	 */
> +	return xfs_qm_dqattach(ip);
> +}

Ah, we don't actually perform any of the inactivation stuff here, so
we could be returning a unlinked inode that hasn't had it's data or
attribute forks truncated away at this point. That seems... wrong.

Also, this is largely a copy/paste of the XFS_IRECLAIMABLE reuse
code path...

.....

> @@ -713,6 +904,43 @@ xfs_icache_inode_is_allocated(
>  	return 0;
>  }
>  
> +/*
> + * Grab the inode for inactivation exclusively.
> + * Return true if we grabbed it.
> + */
> +static bool
> +xfs_inactive_grab(
> +	struct xfs_inode	*ip)
> +{
> +	ASSERT(rcu_read_lock_held());
> +
> +	/* quick check for stale RCU freed inode */
> +	if (!ip->i_ino)
> +		return false;
> +
> +	/*
> +	 * The radix tree lock here protects a thread in xfs_iget from racing
> +	 * with us starting reclaim on the inode.
> +	 *
> +	 * Due to RCU lookup, we may find inodes that have been freed and only
> +	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> +	 * aren't candidates for reclaim at all, so we must check the
> +	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> +	 * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
> +	    (ip->i_flags & XFS_INACTIVATING)) {
> +		/* not a inactivation candidate. */
> +		spin_unlock(&ip->i_flags_lock);
> +		return false;
> +	}
> +
> +	ip->i_flags |= XFS_INACTIVATING;
> +	spin_unlock(&ip->i_flags_lock);
> +	return true;
> +}
> +
>  /*
>   * The inode lookup is done in batches to keep the amount of lock traffic and
>   * radix tree lookups to a minimum. The batch size is a trade off between
> @@ -736,6 +964,9 @@ xfs_inode_walk_ag_grab(
>  
>  	ASSERT(rcu_read_lock_held());
>  
> +	if (flags & XFS_INODE_WALK_INACTIVE)
> +		return xfs_inactive_grab(ip);
> +

Hmmm. This doesn't actually grab the inode. It's an unreferenced
inode walk, in a function that assumes that the grab() call returns
a referenced inode. Why isn't this using the inode reclaim walk
which is intended to walk unreferenced inodes?

>  	/* Check for stale RCU freed inode */
>  	spin_lock(&ip->i_flags_lock);
>  	if (!ip->i_ino)
> @@ -743,7 +974,8 @@ xfs_inode_walk_ag_grab(
>  
>  	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
>  	if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
> -	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
> +	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM |
> +				  XFS_NEED_INACTIVE | XFS_INACTIVATING))

Comment needs updating. Also need a mask define here...

>  		goto out_unlock_noent;
>  	spin_unlock(&ip->i_flags_lock);
>  
> @@ -848,7 +1080,8 @@ xfs_inode_walk_ag(
>  			    xfs_iflags_test(batch[i], XFS_INEW))
>  				xfs_inew_wait(batch[i]);
>  			error = execute(batch[i], args);
> -			xfs_irele(batch[i]);
> +			if (!(iter_flags & XFS_INODE_WALK_INACTIVE))
> +				xfs_irele(batch[i]);
>  			if (error == -EAGAIN) {
>  				skipped++;
>  				continue;

Hmmmm.

> +
> +/*
> + * Deferred Inode Inactivation
> + * ===========================
> + *
> + * Sometimes, inodes need to have work done on them once the last program has
> + * closed the file.  Typically this means cleaning out any leftover post-eof or
> + * CoW staging blocks for linked files.  For inodes that have been totally
> + * unlinked, this means unmapping data/attr/cow blocks, removing the inode
> + * from the unlinked buckets, and marking it free in the inobt and inode table.
> + *
> + * This process can generate many metadata updates, which shows up as close()
> + * and unlink() calls that take a long time.  We defer all that work to a
> + * per-AG workqueue which means that we can batch a lot of work and do it in
> + * inode order for better performance.  Furthermore, we can control the
> + * workqueue, which means that we can avoid doing inactivation work at a bad
> + * time, such as when the fs is frozen.
> + *
> + * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
> + * INACTIVATING) and adds a new INACTIVE radix tree tag for fast access.  We
> + * maintain separate perag counters for both types, and move counts as inodes
> + * wander the state machine, which now works as follows:
> + *
> + * If the inode needs inactivation, we:
> + *   - Set the NEED_INACTIVE inode flag
> + *   - Increment the per-AG inactive count
> + *   - Set the INACTIVE tag in the per-AG inode tree
> + *   - Set the INACTIVE tag in the per-fs AG tree
> + *   - Schedule background inode inactivation
> + *
> + * If the inode does not need inactivation, we:
> + *   - Set the RECLAIMABLE inode flag
> + *   - Increment the per-AG reclaim count
> + *   - Set the RECLAIM tag in the per-AG inode tree
> + *   - Set the RECLAIM tag in the per-fs AG tree
> + *   - Schedule background inode reclamation
> + *
> + * When it is time for background inode inactivation, we:
> + *   - Set the INACTIVATING inode flag
> + *   - Make all the on-disk updates
> + *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
> + *   - Decrement the per-AG inactive count
> + *   - Clear the INACTIVE tag in the per-AG inode tree
> + *   - Clear the INACTIVE tag in the per-fs AG tree if that was the last one
> + *   - Kick the inode into reclamation per the previous paragraph.

I suspect this needs to set the IRECLAIMABLE flag before it clears
the INACTIVE flags so that inode_ag_walk() doesn't find it in a
transient state. Hmmm - that may be why you factored the reclaim
flag setting functions?

> + *
> + * When it is time for background inode reclamation, we:
> + *   - Set the IRECLAIM inode flag
> + *   - Detach all the resources and remove the inode from the per-AG inode tree
> + *   - Clear both IRECLAIM and RECLAIMABLE inode flags
> + *   - Decrement the per-AG reclaim count
> + *   - Clear the RECLAIM tag from the per-AG inode tree
> + *   - Clear the RECLAIM tag from the per-fs AG tree if there are no more
> + *     inodes waiting for reclamation or inactivation
> + *
> + * Note that xfs_inodegc_queue and xfs_inactive_grab are further up in
> + * the source code so that we avoid static function declarations.
> + */
> +
> +/* Inactivate this inode. */
> +STATIC int

static

> +xfs_inactive_inode(
> +	struct xfs_inode	*ip,
> +	void			*args)
> +{
> +	struct xfs_eofblocks	*eofb = args;
> +	struct xfs_perag	*pag;
> +
> +	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);

What condition is this trying to catch? It's something to do with
freeze, but you haven't documented what happens to inodes with
pending inactivation when a freeze is started....

> +
> +	/*
> +	 * Not a match for our passed in scan filter?  Put it back on the shelf
> +	 * and move on.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	if (!xfs_inode_matches_eofb(ip, eofb)) {
> +		ip->i_flags &= ~XFS_INACTIVATING;
> +		spin_unlock(&ip->i_flags_lock);
> +		return 0;
> +	}
> +	spin_unlock(&ip->i_flags_lock);

IDGI. What do EOF blocks have to do with running inode inactivation
on this inode?

> +
> +	trace_xfs_inode_inactivating(ip);
> +
> +	xfs_inactive(ip);
> +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> +
> +	/*
> +	 * Clear the inactive state flags and schedule a reclaim run once
> +	 * we're done with the inactivations.  We must ensure that the inode
> +	 * smoothly transitions from inactivating to reclaimable so that iget
> +	 * cannot see either data structure midway through the transition.
> +	 */
> +	pag = xfs_perag_get(ip->i_mount,
> +			XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino));
> +	spin_lock(&pag->pag_ici_lock);
> +	spin_lock(&ip->i_flags_lock);
> +
> +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> +	xfs_inode_clear_inactive_tag(pag, ip->i_ino);
> +
> +	__xfs_inode_set_reclaim_tag(pag, ip);
> +
> +	spin_unlock(&ip->i_flags_lock);
> +	spin_unlock(&pag->pag_ici_lock);
> +	xfs_perag_put(pag);
> +
> +	return 0;
> +}

/me wonders if we really need a separate radix tree tag for
inactivation.

> +/*
> + * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
> + * corrupted, we still need to clear the INACTIVE iflag so that we can move
> + * on to reclaiming the inode.
> + */
> +static int
> +xfs_inodegc_free_space(
> +	struct xfs_mount	*mp,
> +	struct xfs_eofblocks	*eofb)
> +{
> +	return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
> +			xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG);
> +}

This could call the unreferenced reclaim AG walker now that all the reclaim
throttling stuff has been removed from it...

> +/* Try to get inode inactivation moving. */
> +void
> +xfs_inodegc_worker(
> +	struct work_struct	*work)
> +{
> +	struct xfs_mount	*mp = container_of(to_delayed_work(work),
> +					struct xfs_mount, m_inodegc_work);
> +	int			error;
> +
> +	/*
> +	 * We want to skip inode inactivation while the filesystem is frozen
> +	 * because we don't want the inactivation thread to block while taking
> +	 * sb_intwrite.  Therefore, we try to take sb_write for the duration
> +	 * of the inactive scan -- a freeze attempt will block until we're
> +	 * done here, and if the fs is past stage 1 freeze we'll bounce out
> +	 * until things unfreeze.  If the fs goes down while frozen we'll
> +	 * still have log recovery to clean up after us.
> +	 */
> +	if (!sb_start_write_trylock(mp->m_super))
> +		return;
> +
> +	error = xfs_inodegc_free_space(mp, NULL);
> +	if (error && error != -EAGAIN)
> +		xfs_err(mp, "inode inactivation failed, error %d", error);
> +
> +	sb_end_write(mp->m_super);
> +	xfs_inodegc_queue(mp);

Ok....

The way we've done this with other workqueue based background work
is that the freeze flushes and stops the workqueue, then restarts it
once the filesystem is thawed. This takes all the need for the
background work to have to run the freeze gaunlet....

> +}
> +
> +/* Force all queued inode inactivation work to run immediately. */
> +void
> +xfs_inodegc_force(
> +	struct xfs_mount	*mp)
> +{
> +	/*
> +	 * In order to reset the delay timer to run immediately, we have to
> +	 * cancel the work item and requeue it with a zero timer value.  We
> +	 * don't care if the worker races with our requeue, because at worst
> +	 * we iterate the radix tree and find no inodes to inactivate.
> +	 */
> +	if (!cancel_delayed_work(&mp->m_inodegc_work))
> +		return;

We do? I thought we could mod the timer. Yeah:

	mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);

will trigger the delayed work to run immediately...

> +
> +	queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> +	flush_delayed_work(&mp->m_inodegc_work);
> +}

Yeah, that's a flush operation, not a force :)

> +/* Stop all queued inactivation work. */
> +void
> +xfs_inodegc_stop(
> +	struct xfs_mount	*mp)
> +{
> +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> +}

Should this flush first? i.e. it will cancel pending work, but if
there is work running, it will wait for it to complete. Do we want
the queued work run before stopping, or just kill it dead?

> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 65897cb0cf2a..f20694f220c8 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1665,6 +1665,35 @@ xfs_inactive_ifree(
>  	return 0;
>  }
>  
> +/* Prepare inode for inactivation. */
> +void
> +xfs_inode_inactivation_prep(
> +	struct xfs_inode	*ip)
> +{
> +	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> +		return;
> +
> +	/*
> +	 * If this inode is unlinked (and now unreferenced) we need to dispose
> +	 * of it in the on disk metadata.
> +	 *
> +	 * Change the generation so that the inode can't be opened by handle
> +	 * now that the last external references has dropped.  Bulkstat won't
> +	 * return inodes with zero nlink so nobody will ever find this inode
> +	 * again.  Then add this inode & blocks to the counts of things that
> +	 * will be freed during the next inactivation run.
> +	 */
> +	if (VFS_I(ip)->i_nlink == 0)
> +		VFS_I(ip)->i_generation = prandom_u32();

open by handle interfaces should not be able to open inodes that
have a zero nlink, hence I'm not sure what changing the generation
number actually buys us here...

If we can open nlink = 0 files via handles, then I think we've got
a bug or two to fix....

> +	/*
> +	 * Detach dquots just in case someone tries a quotaoff while the inode
> +	 * is waiting on the inactive list.  We'll reattach them (if needed)
> +	 * when inactivating the inode.
> +	 */
> +	xfs_qm_dqdetach(ip);
> +}

I think the dquot handling needs better documentation as it impacts
on the life cycle and interactions of dquots...

> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 97f31308de03..b03b127e34cc 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2792,6 +2792,13 @@ xlog_recover_process_iunlinks(
>  		}
>  		xfs_buf_rele(agibp);
>  	}
> +
> +	/*
> +	 * Now that we've put all the iunlink inodes on the lru, let's make
> +	 * sure that we perform all the on-disk metadata updates to actually
> +	 * free those inodes.
> +	 */

What LRU are we putting these inodes on? They are evicted from cache
immediately. A comment simply to say:

	/*
	 * Flush the pending unlinked inodes to ensure they are
	 * fully completed on disk and can be reclaimed before we
	 * signal that recovery is complete.
	 */
> +	xfs_inodegc_force(mp);
>  }
>  
>  STATIC void

.....
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 1c97b155a8ee..cd015e3d72fc 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -640,6 +640,10 @@ xfs_check_summary_counts(
>   * so we need to unpin them, write them back and/or reclaim them before unmount
>   * can proceed.
>   *
> + * Start the process by pushing all inodes through the inactivation process
> + * so that all file updates to on-disk metadata can be flushed with the log.
> + * After the AIL push, all inodes should be ready for reclamation.
> + *
>   * An inode cluster that has been freed can have its buffer still pinned in
>   * memory because the transaction is still sitting in a iclog. The stale inodes
>   * on that buffer will be pinned to the buffer until the transaction hits the
> @@ -663,6 +667,7 @@ static void
>  xfs_unmount_flush_inodes(
>  	struct xfs_mount	*mp)
>  {
> +	xfs_inodegc_force(mp);
>  	xfs_log_force(mp, XFS_LOG_SYNC);
>  	xfs_extent_busy_wait_all(mp);
>  	flush_workqueue(xfs_discard_wq);
> @@ -670,6 +675,7 @@ xfs_unmount_flush_inodes(
>  	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
>  
>  	xfs_ail_push_all_sync(mp->m_ail);
> +	xfs_inodegc_stop(mp);

That looks wrong. Stopping the background inactivation should be
done before we flush the AIL because bacground inactivation dirties
inodes. So we should be stopping the inodegc the moment we've
finished flushing out all the pending inactivations...

Hmm. xfs_unmount_flush_inodes() doesn't ring a bell with me, and
it's not in the current tree. Did I miss this in an earlier patch in
this patchset, or something else?

>  	cancel_delayed_work_sync(&mp->m_reclaim_work);
>  	xfs_reclaim_inodes(mp);
>  	xfs_health_unmount(mp);
> @@ -1095,6 +1101,13 @@ xfs_unmountfs(
>  	uint64_t		resblks;
>  	int			error;
>  
> +	/*
> +	 * Perform all on-disk metadata updates required to inactivate inodes.
> +	 * Since this can involve finobt updates, do it now before we lose the
> +	 * per-AG space reservations.
> +	 */
> +	xfs_inodegc_force(mp);
> +

I can't tell why this is necessary given what
xfs_unmount_flush_inodes() does. Or, alternatively, why
xfs_unmount_flush_inodes() can do what it does without caring about
per-ag space reservations....

> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index ca1b57d291dc..0f9a1450fe0e 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -104,6 +104,12 @@ xfs_qm_scall_quotaoff(
>  	uint			inactivate_flags;
>  	struct xfs_qoff_logitem	*qoffstart = NULL;
>  
> +	/*
> +	 * Clean up the inactive list before we turn quota off, to reduce the
> +	 * amount of quotaoff work we have to do with the mutex held.
> +	 */
> +	xfs_inodegc_force(mp);
> +

Hmmm. why not just stop background inactivation altogether while
quotaoff runs? i.e. just do normal, inline inactivation when
quotaoff is running, and then we can get rid of the whole "drop
dquot references" issue that background inactivation has...

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e774358383d6..8d0142487fc7 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -637,28 +637,34 @@ xfs_fs_destroy_inode(
>  	struct inode		*inode)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	bool			need_inactive;
>  
>  	trace_xfs_destroy_inode(ip);
>  
>  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> -	XFS_STATS_INC(ip->i_mount, vn_rele);
> -	XFS_STATS_INC(ip->i_mount, vn_remove);
> +	XFS_STATS_INC(mp, vn_rele);
> +	XFS_STATS_INC(mp, vn_remove);
>  
> -	xfs_inactive(ip);
> -
> -	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
> +	need_inactive = xfs_inode_needs_inactivation(ip);
> +	if (need_inactive) {
> +		trace_xfs_inode_set_need_inactive(ip);
> +		xfs_inode_inactivation_prep(ip);
> +	} else if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
>  		xfs_check_delalloc(ip, XFS_DATA_FORK);
>  		xfs_check_delalloc(ip, XFS_COW_FORK);
>  		ASSERT(0);
>  	}

Isn't this i_delayed_blks check still valid even for indoes that
need background invalidation? i.e. all dirty data has been flushed
at this point, and so i_delayed_blks should be zero for all
inodes regardless of whether then need inactivation or not....

> -
> -	XFS_STATS_INC(ip->i_mount, vn_reclaim);
> +	XFS_STATS_INC(mp, vn_reclaim);
> +	trace_xfs_inode_set_reclaimable(ip);
>  
>  	/*
>  	 * We should never get here with one of the reclaim flags already set.
>  	 */
>  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
>  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
> +	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_NEED_INACTIVE));
> +	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_INACTIVATING));

This should probably be opencoded instead of taking the flags
spinlock 4 times...

>  
>  	/*
>  	 * We always use background reclaim here because even if the inode is
> @@ -667,7 +673,10 @@ xfs_fs_destroy_inode(
>  	 * reclaim path handles this more efficiently than we can here, so
>  	 * simply let background reclaim tear down all inodes.
>  	 */
> -	xfs_inode_set_reclaim_tag(ip);
> +	if (need_inactive)
> +		xfs_inode_set_inactive_tag(ip);
> +	else
> +		xfs_inode_set_reclaim_tag(ip);
>  }
>  
>  static void
> @@ -797,6 +806,13 @@ xfs_fs_statfs(
>  	xfs_extlen_t		lsize;
>  	int64_t			ffree;
>  
> +	/*
> +	 * Process all the queued file and speculative preallocation cleanup so
> +	 * that the counter values we report here do not incorporate any
> +	 * resources that were previously deleted.
> +	 */
> +	xfs_inodegc_force(mp);

Same comment as for xfs_fs_counts()....
> +
>  	statp->f_type = XFS_SUPER_MAGIC;
>  	statp->f_namelen = MAXNAMELEN - 1;
>  
> @@ -911,6 +927,18 @@ xfs_fs_unfreeze(
>  	return 0;
>  }
>  
> +/*
> + * Before we get to stage 1 of a freeze, force all the inactivation work so
> + * that there's less work to do if we crash during the freeze.
> + */
> +STATIC int
> +xfs_fs_freeze_super(
> +	struct super_block	*sb)
> +{
> +	xfs_inodegc_force(XFS_M(sb));
> +	return freeze_super(sb);
> +}

Yeah, definitely need a description of freeze interactions...

> @@ -1720,6 +1749,13 @@ xfs_remount_ro(
>  		return error;
>  	}
>  
> +	/*
> +	 * Perform all on-disk metadata updates required to inactivate inodes.
> +	 * Since this can involve finobt updates, do it now before we lose the
> +	 * per-AG space reservations.
> +	 */
> +	xfs_inodegc_force(mp);

Should we stop background inactivation, because we can't make
modifications anymore and hence background inactication makes little
sense...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/11] xfs: don't reclaim dquots with incore reservations
  2021-03-23  0:01     ` Darrick J. Wong
@ 2021-03-23  1:48       ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2021-03-23  1:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 22, 2021 at 05:01:11PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 23, 2021 at 10:31:39AM +1100, Dave Chinner wrote:
> > On Wed, Mar 10, 2021 at 07:05:57PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > If a dquot has an incore reservation that exceeds the ondisk count, it
> > > by definition has active incore state and must not be reclaimed.  Up to
> > > this point every inode with an incore dquot reservation has always
> > > retained a reference to the dquot so it was never possible for
> > > xfs_qm_dquot_isolate to be called on a dquot with active state and zero
> > > refcount, but this will soon change.
> > > 
> > > Deferred inode inactivation is about to reorganize how inodes are
> > > inactivated by shunting all that work to a background workqueue.  In
> > > order to avoid deadlocks with the quotaoff inode scan and reduce overall
> > > memory requirements (since inodes can spend a lot of time waiting for
> > > inactivation), inactive inodes will drop their dquot references while
> > > they're waiting to be inactivated.
> > > 
> > > However, inactive inodes can have delalloc extents in the data fork or
> > > any extents in the CoW fork.  Either of these contribute to the dquot's
> > > incore reservation being larger than the resource count (i.e. they're
> > > the reason the dquot still has active incore state), so we cannot allow
> > > the dquot to be reclaimed.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > .....
> > >  static enum lru_status
> > >  xfs_qm_dquot_isolate(
> > >  	struct list_head	*item,
> > > @@ -427,10 +441,15 @@ xfs_qm_dquot_isolate(
> > >  		goto out_miss_busy;
> > >  
> > >  	/*
> > > -	 * This dquot has acquired a reference in the meantime remove it from
> > > -	 * the freelist and try again.
> > > +	 * Either this dquot has incore reservations or it has acquired a
> > > +	 * reference.  Remove it from the freelist and try again.
> > > +	 *
> > > +	 * Inodes tagged for inactivation drop their dquot references to avoid
> > > +	 * deadlocks with quotaoff.  If these inodes have delalloc reservations
> > > +	 * in the data fork or any extents in the CoW fork, these contribute
> > > +	 * to the dquot's incore block reservation exceeding the count.
> > >  	 */
> > > -	if (dqp->q_nrefs) {
> > > +	if (xfs_dquot_has_incore_resv(dqp) || dqp->q_nrefs) {
> > >  		xfs_dqunlock(dqp);
> > >  		XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants);
> > >  
> > 
> > This means we can have dquots with no references that aren't on
> > the free list and aren't actually referenced by any inode, either.
> > 
> > So if we now shut down the filesystem, what frees these dquots?
> > Are we relying on xfs_qm_dqpurge_all() to find all these dquots
> > and xfs_qm_dqpurge() guaranteeing that they are always cleaned
> > and freed?
> 
> Yes.  Want me to add that to the comment?

Yes Please!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-23  1:44   ` Dave Chinner
@ 2021-03-23  4:00     ` Darrick J. Wong
  2021-03-23  5:19       ` Dave Chinner
  2021-03-24 17:53       ` Christoph Hellwig
  0 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-23  4:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 23, 2021 at 12:44:17PM +1100, Dave Chinner wrote:
> On Wed, Mar 10, 2021 at 07:06:13PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> > defer the inactivation phase to a separate workqueue.  With this we
> > avoid blocking memory reclaim on filesystem metadata updates that are
> > necessary to free an in-core inode, such as post-eof block freeing, COW
> > staging extent freeing, and truncating and freeing unlinked inodes.  Now
> > that work is deferred to a workqueue where we can do the freeing in
> > batches.
> > 
> > We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> > The first flag helps our worker find inodes needing inactivation, and
> > the second flag marks inodes that are in the process of being
> > inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> > inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> > 
> > Unfortunately, deferring the inactivation has one huge downside --
> > eventual consistency.  Since all the freeing is deferred to a worker
> > thread, one can rm a file but the space doesn't come back immediately.
> > This can cause some odd side effects with quota accounting and statfs,
> > so we also force inactivation scans in order to maintain the existing
> > behaviors, at least outwardly.
> > 
> > For this patch we'll set the delay to zero to mimic the old timing as
> > much as possible; in the next patch we'll play with different delay
> > settings.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ....
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index a2a407039227..3a3baf56198b 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -19,6 +19,8 @@
> >  #include "xfs_log.h"
> >  #include "xfs_ag.h"
> >  #include "xfs_ag_resv.h"
> > +#include "xfs_inode.h"
> > +#include "xfs_icache.h"
> >  
> >  /*
> >   * growfs operations
> > @@ -290,6 +292,13 @@ xfs_fs_counts(
> >  	xfs_mount_t		*mp,
> >  	xfs_fsop_counts_t	*cnt)
> >  {
> > +	/*
> > +	 * Process all the queued file and speculative preallocation cleanup so
> > +	 * that the counter values we report here do not incorporate any
> > +	 * resources that were previously deleted.
> > +	 */
> > +	xfs_inodegc_force(mp);
> 
> xfs_fs_counts() is supposed to be a quick, non-blocking summary of
> the state - it can never supply userspace with accurate values
> because they are wrong even before the ioctl returns to userspace.
> Hence we do not attempt to make them correct, just use a fast, point
> in time sample of the current counter values.
> 
> So this seems like an unnecessarily heavyweight operation
> to add to this function....

I agree, xfs_inodegc_force is a heavyweight operation to add to statvfs
and (further down) the quota reporting ioctl.  I added these calls to
maintain the user-visible behavior that one can df a mount, rm -rf a
30T directory tree, df again, and observe a 30T difference in available
space between the two df calls.

There are a lot of fstests that require this kind of behavior to pass.
In my internal testing without this bit applied, I also got complaints
about breaking the user-behavior of XFS that people have gotten used to.

Earlier revisions of this patchset tried to maintain counts of the
resources used by the inactivated inode so that we could adjust the
values reported by statvfs and the quota reporting ioctl.  This meant we
didn't have to delay either call at all, but it turns out that it's
not feasible to maintain an accurate count of inactive resources because
any resources that are shared at destroy_inode time cannot become part
of this liar counter and consulting the refcountbt to decide which
extents should be added just makes unlinking even slower.  Worse yet,
unsharing of shared blocks attached to queued inactive inodes implies
either that we have to update the liar counter or that we have to be ok
with the free block count fluctuating for a while after a deletion if
that deletion ends up freeing more space than the liar counter thinks
we can free by flushing inactivation.

Hmm, maybe this could maintain an approxiate liar counter and only flush
inactivation when the liar counter would cause us to be off by more than
some configurable amount?  The fstests that care about free space
accounting are not going to be happy since they are measured with very
tight tolerances.

> Also, I don't like the word "force" in functions like this: force it
> to do what, exactly? If you want a queue flush, then
> xfs_inodegc_flush() matches with how flush_workqueue() works...

Yes, I like that name better.  xfs_inodegc_force it is.

> 
> >  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
> >  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
> >  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index e6a62f765422..1b7652af5ee5 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -195,6 +195,18 @@ xfs_perag_clear_reclaim_tag(
> >  	trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
> >  }
> >  
> > +static void
> > +__xfs_inode_set_reclaim_tag(
> > +	struct xfs_perag	*pag,
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +
> > +	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > +			   XFS_ICI_RECLAIM_TAG);
> > +	xfs_perag_set_reclaim_tag(pag);
> > +	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> > +}
> >  
> >  /*
> >   * We set the inode flag atomically with the radix tree tag.
> > @@ -212,10 +224,7 @@ xfs_inode_set_reclaim_tag(
> >  	spin_lock(&pag->pag_ici_lock);
> >  	spin_lock(&ip->i_flags_lock);
> >  
> > -	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > -			   XFS_ICI_RECLAIM_TAG);
> > -	xfs_perag_set_reclaim_tag(pag);
> > -	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> > +	__xfs_inode_set_reclaim_tag(pag, ip);
> >  
> >  	spin_unlock(&ip->i_flags_lock);
> >  	spin_unlock(&pag->pag_ici_lock);
> 
> First thought: rename xfs_inode_set_reclaim_tag() to
> xfs_inode_set_reclaim_tag_locked(), leave the guts as
> xfs_inode_set_reclaim_tag().
> 
> > @@ -233,6 +242,94 @@ xfs_inode_clear_reclaim_tag(
> >  	xfs_perag_clear_reclaim_tag(pag);
> >  }
> >  
> > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > +static void
> > +xfs_inodegc_queue(
> > +	struct xfs_mount        *mp)
> > +{
> > +	rcu_read_lock();
> > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
> > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
> > +				2 * HZ);
> > +	rcu_read_unlock();
> > +}
> 
> Why half a second and not something referenced against the inode
> reclaim/sync period?

It's actually 2 seconds, and the next patch adds a knob to tweak the
default value.

The first version of this patchset from 2017 actually did just use
(6 * xfs_syncd_centisecs / 10) like reclaim does.  This turned out to be
pretty foolish because that meant that reclaim and inactivation would
start at the same time, and because inactivation is slow, most of them
would miss the reclaim window and sit around pointlessly until the
next one.

The next iteration from mid 2019 changed this to (xfs_syncd_centisecs/5)
which fixed that, but large deltree storms could lead to so many inodes
being inactivated that we'd still miss the reclaim window sometimes.
Around this time I got my djwong-dev tree hooked up to the ktest robot
and it started complaining about performance regressions and noticeably
higher slab usage for xfs inodes and log items.

The next time I got back to this was shortly after Dave cleaned up the
reclaim behavior (2020) to be driven by the AIL, which mostly fixed the
performance complaints, except for the one about AIM7.  I was intrigued
enough by this to instrument the patchset and fstests and the fstests
cloud hosts <cough> to see if I could derive a reasonable default value.

I've observed through experimentation that 2 seconds seems like a good
default value -- it's long enough to enable a lot of batching of
inactive inodes, but short enough that the background thread can
throttle the foreground threads by competing for the log grant heads.
I also noticed that the amount of overhead introduced by background
inactivation (as measured by fstests run times and other <cough>
performance tests) ranged from minimal at 0 seconds to about 20% at
(6*xfs_syncd_centisecs/10).

Honestly, this could just be zero.  Assuming your distro has power
efficient workqueues enabled, the ~4-10ms delay introduced by that is
enough to realize some batching advantage with zero noticeable effect on
performance.

> > +/* Remember that an AG has one more inode to inactivate. */
> > +static void
> > +xfs_perag_set_inactive_tag(
> > +	struct xfs_perag	*pag)
> > +{
> > +	struct xfs_mount	*mp = pag->pag_mount;
> > +
> > +	lockdep_assert_held(&pag->pag_ici_lock);
> > +	if (pag->pag_ici_inactive++)
> > +		return;
> > +
> > +	/* propagate the inactive tag up into the perag radix tree */
> > +	spin_lock(&mp->m_perag_lock);
> > +	radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
> > +			   XFS_ICI_INACTIVE_TAG);
> > +	spin_unlock(&mp->m_perag_lock);
> > +
> > +	/* schedule periodic background inode inactivation */
> > +	xfs_inodegc_queue(mp);
> > +
> > +	trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_);
> > +}
> > +
> > +/* Set this inode's inactive tag and set the per-AG tag. */
> > +void
> > +xfs_inode_set_inactive_tag(
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_perag	*pag;
> > +
> > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> > +	spin_lock(&pag->pag_ici_lock);
> > +	spin_lock(&ip->i_flags_lock);
> > +
> > +	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > +				   XFS_ICI_INACTIVE_TAG);
> > +	xfs_perag_set_inactive_tag(pag);
> > +	__xfs_iflags_set(ip, XFS_NEED_INACTIVE);
> > +
> > +	spin_unlock(&ip->i_flags_lock);
> > +	spin_unlock(&pag->pag_ici_lock);
> > +	xfs_perag_put(pag);
> > +}
> > +
> > +/* Remember that an AG has one less inode to inactivate. */
> > +static void
> > +xfs_perag_clear_inactive_tag(
> > +	struct xfs_perag	*pag)
> > +{
> > +	struct xfs_mount	*mp = pag->pag_mount;
> > +
> > +	lockdep_assert_held(&pag->pag_ici_lock);
> > +	if (--pag->pag_ici_inactive)
> > +		return;
> > +
> > +	/* clear the inactive tag from the perag radix tree */
> > +	spin_lock(&mp->m_perag_lock);
> > +	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
> > +			     XFS_ICI_INACTIVE_TAG);
> > +	spin_unlock(&mp->m_perag_lock);
> > +	trace_xfs_perag_clear_inactive(mp, pag->pag_agno, -1, _RET_IP_);
> > +}
> > +
> > +/* Clear this inode's inactive tag and try to clear the AG's. */
> > +STATIC void
> 
> static
> 
> > +xfs_inode_clear_inactive_tag(
> > +	struct xfs_perag	*pag,
> > +	xfs_ino_t		ino)
> > +{
> > +	radix_tree_tag_clear(&pag->pag_ici_root,
> > +			     XFS_INO_TO_AGINO(pag->pag_mount, ino),
> > +			     XFS_ICI_INACTIVE_TAG);
> > +	xfs_perag_clear_inactive_tag(pag);
> > +}
> 
> These are just straight copies of the reclaim tag code. Do you have
> a plan for factoring these into a single implementation to clean
> this up? Something like this:
> 
> static void
> xfs_inode_clear_tag(
> 	struct xfs_perag	*pag,
> 	xfs_ino_t		ino,
> 	int			tag)
> {
> 	struct xfs_mount	*mp = pag->pag_mount;
> 
> 	lockdep_assert_held(&pag->pag_ici_lock);
> 	radix_tree_tag_clear(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino),
> 				tag);
> 	switch(tag) {
> 	case XFS_ICI_INACTIVE_TAG:
> 		if (--pag->pag_ici_inactive)
> 			return;
> 		break;
> 	case XFS_ICI_RECLAIM_TAG:
> 		if (--pag->pag_ici_reclaim)
> 			return;
> 		break;
> 	default:
> 		ASSERT(0);
> 		return;
> 	}
> 
> 	spin_lock(&mp->m_perag_lock);
> 	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno, tag);
> 	spin_unlock(&mp->m_perag_lock);
> }
> 
> As a followup patch? The set tag case looks similarly easy to make
> generic...

Yeah.  At this point I might as well just clean all of this up for the
next revision of this series, because as I said earlier I had thought
that you were still working on a second rework of reclaim.  Now that I
know you're not, I'll hack away at this twisty pile too.

> > +
> >  static void
> >  xfs_inew_wait(
> >  	struct xfs_inode	*ip)
> > @@ -298,6 +395,13 @@ xfs_iget_check_free_state(
> >  	struct xfs_inode	*ip,
> >  	int			flags)
> >  {
> > +	/*
> > +	 * Unlinked inodes awaiting inactivation must not be reused until we
> > +	 * have a chance to clear the on-disk metadata.
> > +	 */
> > +	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
> > +		return -ENOENT;
> > +
> >  	if (flags & XFS_IGET_CREATE) {
> >  		/* should be a free inode */
> >  		if (VFS_I(ip)->i_mode != 0) {
> 
> How do we get here with an XFS_NEED_INACTIVE inode?
> xfs_iget_check_free_state() is only called from the cache miss path,

You added it to xfs_iget_cache_hit in 2018, commit afca6c5b2595f...

> but we should never get here with a cached inode that is awaiting
> inactivation...

...which means that any xfs_iget can get ahold of an inode that's
awaiting inactivation but hasn't yet started that process.  It's totally
valid to iget an inode that has NEED_INACTIVE set, since we use
inactivation for one final gc of post-eof and COW blocks on linked files.

> > @@ -323,6 +427,67 @@ xfs_iget_check_free_state(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * We've torn down the VFS part of this NEED_INACTIVE inode, so we need to get
> > + * it back into working state.
> > + */
> > +static int
> > +xfs_iget_inactive(
> > +	struct xfs_perag	*pag,
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct inode		*inode = VFS_I(ip);
> > +	int			error;
> > +
> > +	error = xfs_reinit_inode(mp, inode);
> > +	if (error) {
> > +		bool wake;
> > +		/*
> > +		 * Re-initializing the inode failed, and we are in deep
> > +		 * trouble.  Try to re-add it to the inactive list.
> > +		 */
> > +		rcu_read_lock();
> > +		spin_lock(&ip->i_flags_lock);
> > +		wake = !!__xfs_iflags_test(ip, XFS_INEW);
> > +		ip->i_flags &= ~(XFS_INEW | XFS_INACTIVATING);
> > +		if (wake)
> > +			wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
> > +		ASSERT(ip->i_flags & XFS_NEED_INACTIVE);
> > +		trace_xfs_iget_inactive_fail(ip);
> > +		spin_unlock(&ip->i_flags_lock);
> > +		rcu_read_unlock();
> > +		return error;
> > +	}
> > +
> > +	spin_lock(&pag->pag_ici_lock);
> > +	spin_lock(&ip->i_flags_lock);
> > +
> > +	/*
> > +	 * Clear the per-lifetime state in the inode as we are now effectively
> > +	 * a new inode and need to return to the initial state before reuse
> > +	 * occurs.
> > +	 */
> > +	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
> > +	ip->i_flags |= XFS_INEW;
> > +	xfs_inode_clear_inactive_tag(pag, ip->i_ino);
> > +	inode->i_state = I_NEW;
> > +	ip->i_sick = 0;
> > +	ip->i_checked = 0;
> > +
> > +	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > +	init_rwsem(&inode->i_rwsem);
> > +
> > +	spin_unlock(&ip->i_flags_lock);
> > +	spin_unlock(&pag->pag_ici_lock);
> > +
> > +	/*
> > +	 * Reattach dquots since we might have removed them when we put this
> > +	 * inode on the inactivation list.
> > +	 */
> > +	return xfs_qm_dqattach(ip);
> > +}
> 
> Ah, we don't actually perform any of the inactivation stuff here, so
> we could be returning a unlinked inode that hasn't had it's data or
> attribute forks truncated away at this point. That seems... wrong.

If the inode is unlinked then the code you asked about earlier in
xfs_inode_check_free_state will prevent us from returning the inode.

If the inode is linked, then I don't see what's wrong with returning it
to userspace with speculative preallocations still attached.

> Also, this is largely a copy/paste of the XFS_IRECLAIMABLE reuse
> code path...

Yeah, I should try to merge them.

> .....
> 
> > @@ -713,6 +904,43 @@ xfs_icache_inode_is_allocated(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Grab the inode for inactivation exclusively.
> > + * Return true if we grabbed it.
> > + */
> > +static bool
> > +xfs_inactive_grab(
> > +	struct xfs_inode	*ip)
> > +{
> > +	ASSERT(rcu_read_lock_held());
> > +
> > +	/* quick check for stale RCU freed inode */
> > +	if (!ip->i_ino)
> > +		return false;
> > +
> > +	/*
> > +	 * The radix tree lock here protects a thread in xfs_iget from racing
> > +	 * with us starting reclaim on the inode.
> > +	 *
> > +	 * Due to RCU lookup, we may find inodes that have been freed and only
> > +	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> > +	 * aren't candidates for reclaim at all, so we must check the
> > +	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> > +	 * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode.
> > +	 */
> > +	spin_lock(&ip->i_flags_lock);
> > +	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
> > +	    (ip->i_flags & XFS_INACTIVATING)) {
> > +		/* not a inactivation candidate. */
> > +		spin_unlock(&ip->i_flags_lock);
> > +		return false;
> > +	}
> > +
> > +	ip->i_flags |= XFS_INACTIVATING;
> > +	spin_unlock(&ip->i_flags_lock);
> > +	return true;
> > +}
> > +
> >  /*
> >   * The inode lookup is done in batches to keep the amount of lock traffic and
> >   * radix tree lookups to a minimum. The batch size is a trade off between
> > @@ -736,6 +964,9 @@ xfs_inode_walk_ag_grab(
> >  
> >  	ASSERT(rcu_read_lock_held());
> >  
> > +	if (flags & XFS_INODE_WALK_INACTIVE)
> > +		return xfs_inactive_grab(ip);
> > +
> 
> Hmmm. This doesn't actually grab the inode. It's an unreferenced
> inode walk, in a function that assumes that the grab() call returns
> a referenced inode. Why isn't this using the inode reclaim walk
> which is intended to walk unreferenced inodes?

Because I thought that some day you might want to rebase the inode
reclaim cleanups from 2019 and didn't want to slow either of us down by
forcing a gigantic rebase.  So I left the duplicative inode walk
functions.

FWIW these are current separate functions with separate call sites in
xfs_inode_walk_ag since the "remove indirect calls from inode walk"
series made it more convenient to have a separate function for each tag.

As for the name ... reclaim also has a "grab" function even though it
walks unreferenced inodes.

> 
> >  	/* Check for stale RCU freed inode */
> >  	spin_lock(&ip->i_flags_lock);
> >  	if (!ip->i_ino)
> > @@ -743,7 +974,8 @@ xfs_inode_walk_ag_grab(
> >  
> >  	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> >  	if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
> > -	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
> > +	    __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM |
> > +				  XFS_NEED_INACTIVE | XFS_INACTIVATING))
> 
> Comment needs updating. Also need a mask define here...

This function is now called xfs_blockgc_grab, and yes I did change it.

> 
> >  		goto out_unlock_noent;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> > @@ -848,7 +1080,8 @@ xfs_inode_walk_ag(
> >  			    xfs_iflags_test(batch[i], XFS_INEW))
> >  				xfs_inew_wait(batch[i]);
> >  			error = execute(batch[i], args);
> > -			xfs_irele(batch[i]);
> > +			if (!(iter_flags & XFS_INODE_WALK_INACTIVE))
> > +				xfs_irele(batch[i]);
> >  			if (error == -EAGAIN) {
> >  				skipped++;
> >  				continue;
> 
> Hmmmm.
> 
> > +
> > +/*
> > + * Deferred Inode Inactivation
> > + * ===========================
> > + *
> > + * Sometimes, inodes need to have work done on them once the last program has
> > + * closed the file.  Typically this means cleaning out any leftover post-eof or
> > + * CoW staging blocks for linked files.  For inodes that have been totally
> > + * unlinked, this means unmapping data/attr/cow blocks, removing the inode
> > + * from the unlinked buckets, and marking it free in the inobt and inode table.
> > + *
> > + * This process can generate many metadata updates, which shows up as close()
> > + * and unlink() calls that take a long time.  We defer all that work to a
> > + * per-AG workqueue which means that we can batch a lot of work and do it in
> > + * inode order for better performance.  Furthermore, we can control the
> > + * workqueue, which means that we can avoid doing inactivation work at a bad
> > + * time, such as when the fs is frozen.
> > + *
> > + * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and
> > + * INACTIVATING) and adds a new INACTIVE radix tree tag for fast access.  We
> > + * maintain separate perag counters for both types, and move counts as inodes
> > + * wander the state machine, which now works as follows:
> > + *
> > + * If the inode needs inactivation, we:
> > + *   - Set the NEED_INACTIVE inode flag
> > + *   - Increment the per-AG inactive count
> > + *   - Set the INACTIVE tag in the per-AG inode tree
> > + *   - Set the INACTIVE tag in the per-fs AG tree
> > + *   - Schedule background inode inactivation
> > + *
> > + * If the inode does not need inactivation, we:
> > + *   - Set the RECLAIMABLE inode flag
> > + *   - Increment the per-AG reclaim count
> > + *   - Set the RECLAIM tag in the per-AG inode tree
> > + *   - Set the RECLAIM tag in the per-fs AG tree
> > + *   - Schedule background inode reclamation
> > + *
> > + * When it is time for background inode inactivation, we:
> > + *   - Set the INACTIVATING inode flag
> > + *   - Make all the on-disk updates
> > + *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
> > + *   - Decrement the per-AG inactive count
> > + *   - Clear the INACTIVE tag in the per-AG inode tree
> > + *   - Clear the INACTIVE tag in the per-fs AG tree if that was the last one
> > + *   - Kick the inode into reclamation per the previous paragraph.
> 
> I suspect this needs to set the IRECLAIMABLE flag before it clears
> the INACTIVE flags so that inode_ag_walk() doesn't find it in a
> transient state. Hmmm - that may be why you factored the reclaim
> flag setting functions?

Yes and yes.

> > + *
> > + * When it is time for background inode reclamation, we:
> > + *   - Set the IRECLAIM inode flag
> > + *   - Detach all the resources and remove the inode from the per-AG inode tree
> > + *   - Clear both IRECLAIM and RECLAIMABLE inode flags
> > + *   - Decrement the per-AG reclaim count
> > + *   - Clear the RECLAIM tag from the per-AG inode tree
> > + *   - Clear the RECLAIM tag from the per-fs AG tree if there are no more
> > + *     inodes waiting for reclamation or inactivation
> > + *
> > + * Note that xfs_inodegc_queue and xfs_inactive_grab are further up in
> > + * the source code so that we avoid static function declarations.
> > + */
> > +
> > +/* Inactivate this inode. */
> > +STATIC int
> 
> static
> 
> > +xfs_inactive_inode(
> > +	struct xfs_inode	*ip,
> > +	void			*args)
> > +{
> > +	struct xfs_eofblocks	*eofb = args;
> > +	struct xfs_perag	*pag;
> > +
> > +	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
> 
> What condition is this trying to catch? It's something to do with
> freeze, but you haven't documented what happens to inodes with
> pending inactivation when a freeze is started....

Inactivation creates transactions, which means that we should never be
running this at FREEZE_FS time.  IOWs, it's a check that we can never
stall a kernel thread indefinitely because the fs is frozen.

We can continue to queue inodes for inactivation on a frozen filesystem,
and I was trying to avoid touching the umount lock in
xfs_perag_set_inactive_tag to find out if the fs is actually frozen and
therefore we shouldn't call xfs_inodegc_queue.

> > +
> > +	/*
> > +	 * Not a match for our passed in scan filter?  Put it back on the shelf
> > +	 * and move on.
> > +	 */
> > +	spin_lock(&ip->i_flags_lock);
> > +	if (!xfs_inode_matches_eofb(ip, eofb)) {
> > +		ip->i_flags &= ~XFS_INACTIVATING;
> > +		spin_unlock(&ip->i_flags_lock);
> > +		return 0;
> > +	}
> > +	spin_unlock(&ip->i_flags_lock);
> 
> IDGI. What do EOF blocks have to do with running inode inactivation
> on this inode?

This enables foreground threads that hit EDQUOT to look for inodes to
inactivate in order to free up quota'd resources.

> > +
> > +	trace_xfs_inode_inactivating(ip);
> > +
> > +	xfs_inactive(ip);
> > +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> > +
> > +	/*
> > +	 * Clear the inactive state flags and schedule a reclaim run once
> > +	 * we're done with the inactivations.  We must ensure that the inode
> > +	 * smoothly transitions from inactivating to reclaimable so that iget
> > +	 * cannot see either data structure midway through the transition.
> > +	 */
> > +	pag = xfs_perag_get(ip->i_mount,
> > +			XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino));
> > +	spin_lock(&pag->pag_ici_lock);
> > +	spin_lock(&ip->i_flags_lock);
> > +
> > +	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> > +	xfs_inode_clear_inactive_tag(pag, ip->i_ino);
> > +
> > +	__xfs_inode_set_reclaim_tag(pag, ip);
> > +
> > +	spin_unlock(&ip->i_flags_lock);
> > +	spin_unlock(&pag->pag_ici_lock);
> > +	xfs_perag_put(pag);
> > +
> > +	return 0;
> > +}
> 
> /me wonders if we really need a separate radix tree tag for
> inactivation.

No, we don't.  I only used a separate one to keep this separate from the
reclaim tag because you thought you might remove ICI_RECLAIM the last
time you and I talked about inactivation at the last LSFMM we both went
to.

> > +/*
> > + * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
> > + * corrupted, we still need to clear the INACTIVE iflag so that we can move
> > + * on to reclaiming the inode.
> > + */
> > +static int
> > +xfs_inodegc_free_space(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_eofblocks	*eofb)
> > +{
> > +	return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
> > +			xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG);
> > +}
> 
> This could call the unreferenced reclaim AG walker now that all the reclaim
> throttling stuff has been removed from it...

Yep.  I could probably combine all three of the walkers into one
function since the series before this one shifts the usage model to the
same basic loop with switch() statements to figure out which functions
to call.

> > +/* Try to get inode inactivation moving. */
> > +void
> > +xfs_inodegc_worker(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_mount	*mp = container_of(to_delayed_work(work),
> > +					struct xfs_mount, m_inodegc_work);
> > +	int			error;
> > +
> > +	/*
> > +	 * We want to skip inode inactivation while the filesystem is frozen
> > +	 * because we don't want the inactivation thread to block while taking
> > +	 * sb_intwrite.  Therefore, we try to take sb_write for the duration
> > +	 * of the inactive scan -- a freeze attempt will block until we're
> > +	 * done here, and if the fs is past stage 1 freeze we'll bounce out
> > +	 * until things unfreeze.  If the fs goes down while frozen we'll
> > +	 * still have log recovery to clean up after us.
> > +	 */
> > +	if (!sb_start_write_trylock(mp->m_super))
> > +		return;
> > +
> > +	error = xfs_inodegc_free_space(mp, NULL);
> > +	if (error && error != -EAGAIN)
> > +		xfs_err(mp, "inode inactivation failed, error %d", error);
> > +
> > +	sb_end_write(mp->m_super);
> > +	xfs_inodegc_queue(mp);
> 
> Ok....
> 
> The way we've done this with other workqueue based background work
> is that the freeze flushes and stops the workqueue, then restarts it
> once the filesystem is thawed. This takes all the need for the
> background work to have to run the freeze gaunlet....
> 
> > +}
> > +
> > +/* Force all queued inode inactivation work to run immediately. */
> > +void
> > +xfs_inodegc_force(
> > +	struct xfs_mount	*mp)
> > +{
> > +	/*
> > +	 * In order to reset the delay timer to run immediately, we have to
> > +	 * cancel the work item and requeue it with a zero timer value.  We
> > +	 * don't care if the worker races with our requeue, because at worst
> > +	 * we iterate the radix tree and find no inodes to inactivate.
> > +	 */
> > +	if (!cancel_delayed_work(&mp->m_inodegc_work))
> > +		return;
> 
> We do? I thought we could mod the timer. Yeah:
> 
> 	mod_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> 
> will trigger the delayed work to run immediately...
> 
> > +
> > +	queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> > +	flush_delayed_work(&mp->m_inodegc_work);
> > +}
> 
> Yeah, that's a flush operation, not a force :)
> 
> > +/* Stop all queued inactivation work. */
> > +void
> > +xfs_inodegc_stop(
> > +	struct xfs_mount	*mp)
> > +{
> > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > +}
> 
> Should this flush first? i.e. it will cancel pending work, but if
> there is work running, it will wait for it to complete. Do we want
> the queued work run before stopping, or just kill it dead?

The only caller of this is unmount and freeze, so yes, I think it's fine
to let _sync flush the work before returning.

> 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 65897cb0cf2a..f20694f220c8 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1665,6 +1665,35 @@ xfs_inactive_ifree(
> >  	return 0;
> >  }
> >  
> > +/* Prepare inode for inactivation. */
> > +void
> > +xfs_inode_inactivation_prep(
> > +	struct xfs_inode	*ip)
> > +{
> > +	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> > +		return;
> > +
> > +	/*
> > +	 * If this inode is unlinked (and now unreferenced) we need to dispose
> > +	 * of it in the on disk metadata.
> > +	 *
> > +	 * Change the generation so that the inode can't be opened by handle
> > +	 * now that the last external references has dropped.  Bulkstat won't
> > +	 * return inodes with zero nlink so nobody will ever find this inode
> > +	 * again.  Then add this inode & blocks to the counts of things that
> > +	 * will be freed during the next inactivation run.
> > +	 */
> > +	if (VFS_I(ip)->i_nlink == 0)
> > +		VFS_I(ip)->i_generation = prandom_u32();
> 
> open by handle interfaces should not be able to open inodes that
> have a zero nlink, hence I'm not sure what changing the generation
> number actually buys us here...
> 
> If we can open nlink = 0 files via handles, then I think we've got
> a bug or two to fix....

I'm pretty sure this is made redundant by the NEED_INACTIVE check in
xfs_inode_check_free_state.

> > +	/*
> > +	 * Detach dquots just in case someone tries a quotaoff while the inode
> > +	 * is waiting on the inactive list.  We'll reattach them (if needed)
> > +	 * when inactivating the inode.
> > +	 */
> > +	xfs_qm_dqdetach(ip);
> > +}
> 
> I think the dquot handling needs better documentation as it impacts
> on the life cycle and interactions of dquots...

Ok.

> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 97f31308de03..b03b127e34cc 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2792,6 +2792,13 @@ xlog_recover_process_iunlinks(
> >  		}
> >  		xfs_buf_rele(agibp);
> >  	}
> > +
> > +	/*
> > +	 * Now that we've put all the iunlink inodes on the lru, let's make
> > +	 * sure that we perform all the on-disk metadata updates to actually
> > +	 * free those inodes.
> > +	 */
> 
> What LRU are we putting these inodes on? They are evicted from cache
> immediately. A comment simply to say:
> 
> 	/*
> 	 * Flush the pending unlinked inodes to ensure they are
> 	 * fully completed on disk and can be reclaimed before we
> 	 * signal that recovery is complete.
> 	 */

Ok, will fix.

> > +	xfs_inodegc_force(mp);
> >  }
> >  
> >  STATIC void
> 
> .....
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 1c97b155a8ee..cd015e3d72fc 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -640,6 +640,10 @@ xfs_check_summary_counts(
> >   * so we need to unpin them, write them back and/or reclaim them before unmount
> >   * can proceed.
> >   *
> > + * Start the process by pushing all inodes through the inactivation process
> > + * so that all file updates to on-disk metadata can be flushed with the log.
> > + * After the AIL push, all inodes should be ready for reclamation.
> > + *
> >   * An inode cluster that has been freed can have its buffer still pinned in
> >   * memory because the transaction is still sitting in a iclog. The stale inodes
> >   * on that buffer will be pinned to the buffer until the transaction hits the
> > @@ -663,6 +667,7 @@ static void
> >  xfs_unmount_flush_inodes(
> >  	struct xfs_mount	*mp)
> >  {
> > +	xfs_inodegc_force(mp);
> >  	xfs_log_force(mp, XFS_LOG_SYNC);
> >  	xfs_extent_busy_wait_all(mp);
> >  	flush_workqueue(xfs_discard_wq);
> > @@ -670,6 +675,7 @@ xfs_unmount_flush_inodes(
> >  	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
> >  
> >  	xfs_ail_push_all_sync(mp->m_ail);
> > +	xfs_inodegc_stop(mp);
> 
> That looks wrong. Stopping the background inactivation should be
> done before we flush the AIL because bacground inactivation dirties
> inodes. So we should be stopping the inodegc the moment we've
> finished flushing out all the pending inactivations...

There shouldn't be any inactivation work queued at this point, so this
is merely a safeguard to kill the work just in case I screwed up
somewhere else. :)  It can probably go.

> Hmm. xfs_unmount_flush_inodes() doesn't ring a bell with me, and
> it's not in the current tree. Did I miss this in an earlier patch in
> this patchset, or something else?

It was added as a bugfix to 5.12-rc3 to fix a bug where we could dirty a
quota inode during mount, decide to abort the mount, and then stall
because nobody would actually force the log to flush the quota inode
changes to disk.

> >  	cancel_delayed_work_sync(&mp->m_reclaim_work);
> >  	xfs_reclaim_inodes(mp);
> >  	xfs_health_unmount(mp);
> > @@ -1095,6 +1101,13 @@ xfs_unmountfs(
> >  	uint64_t		resblks;
> >  	int			error;
> >  
> > +	/*
> > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > +	 * Since this can involve finobt updates, do it now before we lose the
> > +	 * per-AG space reservations.
> > +	 */
> > +	xfs_inodegc_force(mp);
> > +
> 
> I can't tell why this is necessary given what
> xfs_unmount_flush_inodes() does. Or, alternatively, why
> xfs_unmount_flush_inodes() can do what it does without caring about
> per-ag space reservations....
> 
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index ca1b57d291dc..0f9a1450fe0e 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
> > @@ -104,6 +104,12 @@ xfs_qm_scall_quotaoff(
> >  	uint			inactivate_flags;
> >  	struct xfs_qoff_logitem	*qoffstart = NULL;
> >  
> > +	/*
> > +	 * Clean up the inactive list before we turn quota off, to reduce the
> > +	 * amount of quotaoff work we have to do with the mutex held.
> > +	 */
> > +	xfs_inodegc_force(mp);
> > +
> 
> Hmmm. why not just stop background inactivation altogether while
> quotaoff runs? i.e. just do normal, inline inactivation when
> quotaoff is running, and then we can get rid of the whole "drop
> dquot references" issue that background inactivation has...

I suppose that would have an advantage that quotaoff could switch to
foreground inactivation, flush the pending inactivation work to release
the dquot references, and then dqflush_all to dump the dquots
altogether.

How do we add the ability to switch behaviors, though?  The usual percpu
rwsem that protects a flag?

> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index e774358383d6..8d0142487fc7 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -637,28 +637,34 @@ xfs_fs_destroy_inode(
> >  	struct inode		*inode)
> >  {
> >  	struct xfs_inode	*ip = XFS_I(inode);
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	bool			need_inactive;
> >  
> >  	trace_xfs_destroy_inode(ip);
> >  
> >  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > -	XFS_STATS_INC(ip->i_mount, vn_rele);
> > -	XFS_STATS_INC(ip->i_mount, vn_remove);
> > +	XFS_STATS_INC(mp, vn_rele);
> > +	XFS_STATS_INC(mp, vn_remove);
> >  
> > -	xfs_inactive(ip);
> > -
> > -	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
> > +	need_inactive = xfs_inode_needs_inactivation(ip);
> > +	if (need_inactive) {
> > +		trace_xfs_inode_set_need_inactive(ip);
> > +		xfs_inode_inactivation_prep(ip);
> > +	} else if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
> >  		xfs_check_delalloc(ip, XFS_DATA_FORK);
> >  		xfs_check_delalloc(ip, XFS_COW_FORK);
> >  		ASSERT(0);
> >  	}
> 
> Isn't this i_delayed_blks check still valid even for indoes that
> need background invalidation? i.e. all dirty data has been flushed
> at this point, and so i_delayed_blks should be zero for all
> inodes regardless of whether then need inactivation or not....

Hmm, I think that is true.

> 
> > -
> > -	XFS_STATS_INC(ip->i_mount, vn_reclaim);
> > +	XFS_STATS_INC(mp, vn_reclaim);
> > +	trace_xfs_inode_set_reclaimable(ip);
> >  
> >  	/*
> >  	 * We should never get here with one of the reclaim flags already set.
> >  	 */
> >  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
> >  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
> > +	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_NEED_INACTIVE));
> > +	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_INACTIVATING));
> 
> This should probably be opencoded instead of taking the flags
> spinlock 4 times...

Urk, yes.

> >  
> >  	/*
> >  	 * We always use background reclaim here because even if the inode is
> > @@ -667,7 +673,10 @@ xfs_fs_destroy_inode(
> >  	 * reclaim path handles this more efficiently than we can here, so
> >  	 * simply let background reclaim tear down all inodes.
> >  	 */
> > -	xfs_inode_set_reclaim_tag(ip);
> > +	if (need_inactive)
> > +		xfs_inode_set_inactive_tag(ip);
> > +	else
> > +		xfs_inode_set_reclaim_tag(ip);
> >  }
> >  
> >  static void
> > @@ -797,6 +806,13 @@ xfs_fs_statfs(
> >  	xfs_extlen_t		lsize;
> >  	int64_t			ffree;
> >  
> > +	/*
> > +	 * Process all the queued file and speculative preallocation cleanup so
> > +	 * that the counter values we report here do not incorporate any
> > +	 * resources that were previously deleted.
> > +	 */
> > +	xfs_inodegc_force(mp);
> 
> Same comment as for xfs_fs_counts()....
> > +
> >  	statp->f_type = XFS_SUPER_MAGIC;
> >  	statp->f_namelen = MAXNAMELEN - 1;
> >  
> > @@ -911,6 +927,18 @@ xfs_fs_unfreeze(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Before we get to stage 1 of a freeze, force all the inactivation work so
> > + * that there's less work to do if we crash during the freeze.
> > + */
> > +STATIC int
> > +xfs_fs_freeze_super(
> > +	struct super_block	*sb)
> > +{
> > +	xfs_inodegc_force(XFS_M(sb));
> > +	return freeze_super(sb);
> > +}
> 
> Yeah, definitely need a description of freeze interactions...

Flush all the pending work before we let the VFS start the freezing
process, and then we don't run inactivation after that.

> > @@ -1720,6 +1749,13 @@ xfs_remount_ro(
> >  		return error;
> >  	}
> >  
> > +	/*
> > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > +	 * Since this can involve finobt updates, do it now before we lose the
> > +	 * per-AG space reservations.
> > +	 */
> > +	xfs_inodegc_force(mp);
> 
> Should we stop background inactivation, because we can't make
> modifications anymore and hence background inactication makes little
> sense...

We don't actually stop background gc transactions or other internal
updates on readonly filesystems -- the ro part means only that we don't
let /userspace/ change anything directly.  If you open a file readonly,
unlink it, freeze the fs, and close the file, we'll still free it.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-23  4:00     ` Darrick J. Wong
@ 2021-03-23  5:19       ` Dave Chinner
  2021-03-24  2:04         ` Darrick J. Wong
  2021-03-24 17:53       ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-23  5:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 22, 2021 at 09:00:37PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 23, 2021 at 12:44:17PM +1100, Dave Chinner wrote:
> > On Wed, Mar 10, 2021 at 07:06:13PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> > > defer the inactivation phase to a separate workqueue.  With this we
> > > avoid blocking memory reclaim on filesystem metadata updates that are
> > > necessary to free an in-core inode, such as post-eof block freeing, COW
> > > staging extent freeing, and truncating and freeing unlinked inodes.  Now
> > > that work is deferred to a workqueue where we can do the freeing in
> > > batches.
> > > 
> > > We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> > > The first flag helps our worker find inodes needing inactivation, and
> > > the second flag marks inodes that are in the process of being
> > > inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> > > inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> > > 
> > > Unfortunately, deferring the inactivation has one huge downside --
> > > eventual consistency.  Since all the freeing is deferred to a worker
> > > thread, one can rm a file but the space doesn't come back immediately.
> > > This can cause some odd side effects with quota accounting and statfs,
> > > so we also force inactivation scans in order to maintain the existing
> > > behaviors, at least outwardly.
> > > 
> > > For this patch we'll set the delay to zero to mimic the old timing as
> > > much as possible; in the next patch we'll play with different delay
> > > settings.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ....
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index a2a407039227..3a3baf56198b 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -19,6 +19,8 @@
> > >  #include "xfs_log.h"
> > >  #include "xfs_ag.h"
> > >  #include "xfs_ag_resv.h"
> > > +#include "xfs_inode.h"
> > > +#include "xfs_icache.h"
> > >  
> > >  /*
> > >   * growfs operations
> > > @@ -290,6 +292,13 @@ xfs_fs_counts(
> > >  	xfs_mount_t		*mp,
> > >  	xfs_fsop_counts_t	*cnt)
> > >  {
> > > +	/*
> > > +	 * Process all the queued file and speculative preallocation cleanup so
> > > +	 * that the counter values we report here do not incorporate any
> > > +	 * resources that were previously deleted.
> > > +	 */
> > > +	xfs_inodegc_force(mp);
> > 
> > xfs_fs_counts() is supposed to be a quick, non-blocking summary of
> > the state - it can never supply userspace with accurate values
> > because they are wrong even before the ioctl returns to userspace.
> > Hence we do not attempt to make them correct, just use a fast, point
> > in time sample of the current counter values.
> > 
> > So this seems like an unnecessarily heavyweight operation
> > to add to this function....
> 
> I agree, xfs_inodegc_force is a heavyweight operation to add to statvfs
> and (further down) the quota reporting ioctl.  I added these calls to
> maintain the user-visible behavior that one can df a mount, rm -rf a
> 30T directory tree, df again, and observe a 30T difference in available
> space between the two df calls.
>
> There are a lot of fstests that require this kind of behavior to pass.
> In my internal testing without this bit applied, I also got complaints
> about breaking the user-behavior of XFS that people have gotten used to.

Yeah, that's messy, but I see a potential problem here with space
monitoring apps that poll the filesystem frequently to check space
usage. That's going to override whatever your background "do work"
setting is going to be...

> Earlier revisions of this patchset tried to maintain counts of the
> resources used by the inactivated inode so that we could adjust the
> values reported by statvfs and the quota reporting ioctl.  This meant we
> didn't have to delay either call at all, but it turns out that it's
> not feasible to maintain an accurate count of inactive resources because
> any resources that are shared at destroy_inode time cannot become part
> of this liar counter and consulting the refcountbt to decide which
> extents should be added just makes unlinking even slower.  Worse yet,
> unsharing of shared blocks attached to queued inactive inodes implies
> either that we have to update the liar counter or that we have to be ok
> with the free block count fluctuating for a while after a deletion if
> that deletion ends up freeing more space than the liar counter thinks
> we can free by flushing inactivation.

So the main problem is block accounting. Non-reflink stuff is easy
(the equivalent of delalloc accounting) but reflink is hard.

> Hmm, maybe this could maintain an approxiate liar counter and only flush
> inactivation when the liar counter would cause us to be off by more than
> some configurable amount?  The fstests that care about free space
> accounting are not going to be happy since they are measured with very
> tight tolerances.

I'd prefer something that doesn't require a magic heuristic. I don't
have any better ideas right now, so let's just go with what you have
and see what falls out...

> > > @@ -233,6 +242,94 @@ xfs_inode_clear_reclaim_tag(
> > >  	xfs_perag_clear_reclaim_tag(pag);
> > >  }
> > >  
> > > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > > +static void
> > > +xfs_inodegc_queue(
> > > +	struct xfs_mount        *mp)
> > > +{
> > > +	rcu_read_lock();
> > > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
> > > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
> > > +				2 * HZ);
> > > +	rcu_read_unlock();
> > > +}
> > 
> > Why half a second and not something referenced against the inode
> > reclaim/sync period?
> 
> It's actually 2 seconds, and the next patch adds a knob to tweak the
> default value.

Ugh, 2 * HZ != 2Hz. Stupid bad generic timer code, always trips me
over.

> The first version of this patchset from 2017 actually did just use
> (6 * xfs_syncd_centisecs / 10) like reclaim does.  This turned out to be
> pretty foolish because that meant that reclaim and inactivation would
> start at the same time, and because inactivation is slow, most of them
> would miss the reclaim window and sit around pointlessly until the
> next one.
> 
> The next iteration from mid 2019 changed this to (xfs_syncd_centisecs/5)
> which fixed that, but large deltree storms could lead to so many inodes
> being inactivated that we'd still miss the reclaim window sometimes.
> Around this time I got my djwong-dev tree hooked up to the ktest robot
> and it started complaining about performance regressions and noticeably
> higher slab usage for xfs inodes and log items.

Right, I was thinking more along the lines of "run inactivation
twice for every background inode reclaim pass". It's clear that what
you were struggling with was that the interaction between the two
running at similar periods is not good, and hence no matter what the
background reclaim period is, we should process inactivated inodes a
at least a couple of times per reclaim period...

> The next time I got back to this was shortly after Dave cleaned up the
> reclaim behavior (2020) to be driven by the AIL, which mostly fixed the
> performance complaints, except for the one about AIM7.  I was intrigued
> enough by this to instrument the patchset and fstests and the fstests
> cloud hosts <cough> to see if I could derive a reasonable default value.
> 
> I've observed through experimentation that 2 seconds seems like a good
> default value -- it's long enough to enable a lot of batching of
> inactive inodes, but short enough that the background thread can
> throttle the foreground threads by competing for the log grant heads.

Right, it ends up about 2x per reclaim period by default. :)

> I also noticed that the amount of overhead introduced by background
> inactivation (as measured by fstests run times and other <cough>
> performance tests) ranged from minimal at 0 seconds to about 20% at
> (6*xfs_syncd_centisecs/10).

Which is about 20s period. yeah, that's way too long...

> Honestly, this could just be zero.  Assuming your distro has power
> efficient workqueues enabled, the ~4-10ms delay introduced by that is
> enough to realize some batching advantage with zero noticeable effect on
> performance.

Yeah, the main benefit is moving it into the background so that the
syscall completion isn't running the entire inode inactivation pass.
That moves almost 50% of the unlink processing off to another thread
which is what we want for rm -rf workloads. Keeping the batch size
small is probably the best place to start with this - just enough
inodes to keep a CPU busy for a scheduler tick?


> > >  static void
> > >  xfs_inew_wait(
> > >  	struct xfs_inode	*ip)
> > > @@ -298,6 +395,13 @@ xfs_iget_check_free_state(
> > >  	struct xfs_inode	*ip,
> > >  	int			flags)
> > >  {
> > > +	/*
> > > +	 * Unlinked inodes awaiting inactivation must not be reused until we
> > > +	 * have a chance to clear the on-disk metadata.
> > > +	 */
> > > +	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
> > > +		return -ENOENT;
> > > +
> > >  	if (flags & XFS_IGET_CREATE) {
> > >  		/* should be a free inode */
> > >  		if (VFS_I(ip)->i_mode != 0) {
> > 
> > How do we get here with an XFS_NEED_INACTIVE inode?
> > xfs_iget_check_free_state() is only called from the cache miss path,
> 
> You added it to xfs_iget_cache_hit in 2018, commit afca6c5b2595f...

Oh, cscope fail:

  File             Function                  Line
0 xfs/xfs_icache.c xfs_iget_check_free_state 297 xfs_iget_check_free_state(
1 xfs/xfs_icache.c __releases                378 error = xfs_iget_check_free_state(ip, flags);
2 xfs/xfs_icache.c xfs_iget_cache_miss       530 error = xfs_iget_check_free_state(ip, flags);

"__releases" is a sparse annotation, so it didn't trigger that this
was actually in xfs_iget_cache_hit()...

Never mind...

> > > @@ -713,6 +904,43 @@ xfs_icache_inode_is_allocated(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Grab the inode for inactivation exclusively.
> > > + * Return true if we grabbed it.
> > > + */
> > > +static bool
> > > +xfs_inactive_grab(
> > > +	struct xfs_inode	*ip)
> > > +{
> > > +	ASSERT(rcu_read_lock_held());
> > > +
> > > +	/* quick check for stale RCU freed inode */
> > > +	if (!ip->i_ino)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * The radix tree lock here protects a thread in xfs_iget from racing
> > > +	 * with us starting reclaim on the inode.
> > > +	 *
> > > +	 * Due to RCU lookup, we may find inodes that have been freed and only
> > > +	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> > > +	 * aren't candidates for reclaim at all, so we must check the
> > > +	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> > > +	 * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode.
> > > +	 */
> > > +	spin_lock(&ip->i_flags_lock);
> > > +	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
> > > +	    (ip->i_flags & XFS_INACTIVATING)) {
> > > +		/* not a inactivation candidate. */
> > > +		spin_unlock(&ip->i_flags_lock);
> > > +		return false;
> > > +	}
> > > +
> > > +	ip->i_flags |= XFS_INACTIVATING;
> > > +	spin_unlock(&ip->i_flags_lock);
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * The inode lookup is done in batches to keep the amount of lock traffic and
> > >   * radix tree lookups to a minimum. The batch size is a trade off between
> > > @@ -736,6 +964,9 @@ xfs_inode_walk_ag_grab(
> > >  
> > >  	ASSERT(rcu_read_lock_held());
> > >  
> > > +	if (flags & XFS_INODE_WALK_INACTIVE)
> > > +		return xfs_inactive_grab(ip);
> > > +
> > 
> > Hmmm. This doesn't actually grab the inode. It's an unreferenced
> > inode walk, in a function that assumes that the grab() call returns
> > a referenced inode. Why isn't this using the inode reclaim walk
> > which is intended to walk unreferenced inodes?
> 
> Because I thought that some day you might want to rebase the inode
> reclaim cleanups from 2019 and didn't want to slow either of us down by
> forcing a gigantic rebase.  So I left the duplicative inode walk
> functions.
> 
> FWIW these are current separate functions with separate call sites in
> xfs_inode_walk_ag since the "remove indirect calls from inode walk"
> series made it more convenient to have a separate function for each tag.
> 
> As for the name ... reclaim also has a "grab" function even though it
> walks unreferenced inodes.

Sure, but the reclaim code was always a special "unreferenced"
lookup that just used the same code structure. It never mixed
"igrab()" with unreferenced inode pinning...

> > > +xfs_inactive_inode(
> > > +	struct xfs_inode	*ip,
> > > +	void			*args)
> > > +{
> > > +	struct xfs_eofblocks	*eofb = args;
> > > +	struct xfs_perag	*pag;
> > > +
> > > +	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
> > 
> > What condition is this trying to catch? It's something to do with
> > freeze, but you haven't documented what happens to inodes with
> > pending inactivation when a freeze is started....
> 
> Inactivation creates transactions, which means that we should never be
> running this at FREEZE_FS time.  IOWs, it's a check that we can never
> stall a kernel thread indefinitely because the fs is frozen.

What's the problem with doing that to a dedicated worker thread?  We
currently stall inactivation on a frozen filesystem if a transaction
is required

> We can continue to queue inodes for inactivation on a frozen filesystem,
> and I was trying to avoid touching the umount lock in
> xfs_perag_set_inactive_tag to find out if the fs is actually frozen and
> therefore we shouldn't call xfs_inodegc_queue.

I think stopping background inactivation for frozen filesystems make
more sense than this...

> > > +
> > > +	/*
> > > +	 * Not a match for our passed in scan filter?  Put it back on the shelf
> > > +	 * and move on.
> > > +	 */
> > > +	spin_lock(&ip->i_flags_lock);
> > > +	if (!xfs_inode_matches_eofb(ip, eofb)) {
> > > +		ip->i_flags &= ~XFS_INACTIVATING;
> > > +		spin_unlock(&ip->i_flags_lock);
> > > +		return 0;
> > > +	}
> > > +	spin_unlock(&ip->i_flags_lock);
> > 
> > IDGI. What do EOF blocks have to do with running inode inactivation
> > on this inode?
> 
> This enables foreground threads that hit EDQUOT to look for inodes to
> inactivate in order to free up quota'd resources.

Not very obvious - better comment, please?

> > I can't tell why this is necessary given what
> > xfs_unmount_flush_inodes() does. Or, alternatively, why
> > xfs_unmount_flush_inodes() can do what it does without caring about
> > per-ag space reservations....
> > 
> > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > index ca1b57d291dc..0f9a1450fe0e 100644
> > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > @@ -104,6 +104,12 @@ xfs_qm_scall_quotaoff(
> > >  	uint			inactivate_flags;
> > >  	struct xfs_qoff_logitem	*qoffstart = NULL;
> > >  
> > > +	/*
> > > +	 * Clean up the inactive list before we turn quota off, to reduce the
> > > +	 * amount of quotaoff work we have to do with the mutex held.
> > > +	 */
> > > +	xfs_inodegc_force(mp);
> > > +
> > 
> > Hmmm. why not just stop background inactivation altogether while
> > quotaoff runs? i.e. just do normal, inline inactivation when
> > quotaoff is running, and then we can get rid of the whole "drop
> > dquot references" issue that background inactivation has...
> 
> I suppose that would have an advantage that quotaoff could switch to
> foreground inactivation, flush the pending inactivation work to release
> the dquot references, and then dqflush_all to dump the dquots
> altogether.
> 
> How do we add the ability to switch behaviors, though?  The usual percpu
> rwsem that protects a flag?

That's overkill.  Global synchronisation doesn't need complex
structures, just a low cost reader path.

All we need is an atomic bit that we can test via test_bit().
test_bit() is not a locked operation, but it is atomic. Hence most
of the time it is a shared cacheline and hence has near zero cost to
check as it can be shared across all CPUs.

Set the flag to turn off background inactivation, then all future
inactivations will be foreground. Then flush and stop the inodegc
work queue.  When we finish processing the last inactivated inode,
the background work stops (i.e. it is not requeued).  No more
pending background work.

Clear the flag to turn background inactivation back on. The first
inode queued will restart that background work...

> > > @@ -1720,6 +1749,13 @@ xfs_remount_ro(
> > >  		return error;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > > +	 * Since this can involve finobt updates, do it now before we lose the
> > > +	 * per-AG space reservations.
> > > +	 */
> > > +	xfs_inodegc_force(mp);
> > 
> > Should we stop background inactivation, because we can't make
> > modifications anymore and hence background inactication makes little
> > sense...
> 
> We don't actually stop background gc transactions or other internal
> updates on readonly filesystems

Yes we do - that's what xfs_blockgc_stop() higher up in this
function does. xfs_log_clean() further down in the function also
stops the background log work (that covers the log when idle)
because xfs_remount_ro() leaves the log clean.

THese all get restarted in xfs_remount_rw()....

> -- the ro part means only that we don't
> let /userspace/ change anything directly.  If you open a file readonly,
> unlink it, freeze the fs, and close the file, we'll still free it.

How do you unlink the file on a RO mount?

And if it's a rw mount that is frozen, it will block on the first
transaction in the inactivation process from close(), and block
there until the filesystem is unfrozen.

It's pretty clear to me that we want frozen filesystems to
turn off background inactivation so that we can block things like
this in the syscall context and not have to deal with the complexity
of freeze or read-only mounts in the background inactivation code at
all..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/11] xfs: parallelize inode inactivation
  2021-03-11  3:06 ` [PATCH 10/11] xfs: parallelize inode inactivation Darrick J. Wong
  2021-03-15 18:55   ` Christoph Hellwig
@ 2021-03-23 22:21   ` Dave Chinner
  2021-03-24  3:52     ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-23 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split the inode inactivation work into per-AG work items so that we can
> take advantage of parallelization.

How does this scale out when we have thousands of AGs?

I'm guessing that the gc_workqueue has the default "unbound"
parallelism that means it will run up to 4 kworkers per CPU at a
time? Which means we could have hundreds of ags trying to hammer on
inactivations at the same time? And so bash hard on the log and
completely starve the syscall front end of log space?

It seems to me that this needs to bound the amount of concurrent
work to quite low numbers - even though it is per-ag, we do not want
this to swamp the system in kworkers blocked on log reservations
when such concurrency it not necessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/11] xfs: create a polled function to force inode inactivation
  2021-03-11  3:06 ` [PATCH 11/11] xfs: create a polled function to force " Darrick J. Wong
@ 2021-03-23 22:31   ` Dave Chinner
  2021-03-24  3:34     ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-23 22:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Mar 10, 2021 at 07:06:41PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a polled version of xfs_inactive_force so that we can force
> inactivation while holding a lock (usually the umount lock) without
> tripping over the softlockup timer.  This is for callers that hold vfs
> locks while calling inactivation, which is currently unmount, iunlink
> processing during mount, and rw->ro remount.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   38 +++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_icache.h |    1 +
>  fs/xfs/xfs_mount.c  |    2 +-
>  fs/xfs/xfs_mount.h  |    5 +++++
>  fs/xfs/xfs_super.c  |    3 ++-
>  5 files changed, 46 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d5f580b92e48..9db2beb4e732 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -25,6 +25,7 @@
>  #include "xfs_ialloc.h"
>  
>  #include <linux/iversion.h>
> +#include <linux/nmi.h>

This stuff goes in fs/xfs/xfs_linux.h, not here.

>  
>  /*
>   * Allocate and initialise an xfs_inode.
> @@ -2067,8 +2068,12 @@ xfs_inodegc_free_space(
>  	struct xfs_mount	*mp,
>  	struct xfs_eofblocks	*eofb)
>  {
> -	return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
> +	int			error;
> +
> +	error = xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
>  			xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG);
> +	wake_up(&mp->m_inactive_wait);
> +	return error;
>  }
>  
>  /* Try to get inode inactivation moving. */
> @@ -2138,6 +2143,37 @@ xfs_inodegc_force(
>  	flush_workqueue(mp->m_gc_workqueue);
>  }
>  
> +/*
> + * Force all inode inactivation work to run immediately, and poll until the
> + * work is complete.  Callers should only use this function if they must
> + * inactivate inodes while holding VFS locks, and must be prepared to prevent
> + * or to wait for inodes that are queued for inactivation while this runs.
> + */
> +void
> +xfs_inodegc_force_poll(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_perag	*pag;
> +	xfs_agnumber_t		agno;
> +	bool			queued = false;
> +
> +	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
> +		queued |= xfs_inodegc_force_pag(pag);
> +	if (!queued)
> +		return;
> +
> +	/*
> +	 * Touch the softlockup watchdog every 1/10th of a second while there
> +	 * are still inactivation-tagged inodes in the filesystem.
> +	 */
> +	while (!wait_event_timeout(mp->m_inactive_wait,
> +				   !radix_tree_tagged(&mp->m_perag_tree,
> +						      XFS_ICI_INACTIVE_TAG),
> +				   HZ / 10)) {
> +		touch_softlockup_watchdog();
> +	}
> +}

This looks like a deadlock waiting to be tripped over. As long as
there is something still able to queue inodes for inactivation,
that radix tree tag check will always trigger and put us back to
sleep.

Also, in terms of workqueues, this is a "sync flush" i because we
are waiting for it. e.g. the difference between cancel_work() and
cancel_work_sync() is that the later waits for all the work in
progress to complete before returning and the former doesn't wait...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-23  5:19       ` Dave Chinner
@ 2021-03-24  2:04         ` Darrick J. Wong
  2021-03-24  4:57           ` Dave Chinner
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-24  2:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 23, 2021 at 04:19:07PM +1100, Dave Chinner wrote:
> On Mon, Mar 22, 2021 at 09:00:37PM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 23, 2021 at 12:44:17PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 10, 2021 at 07:06:13PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
> > > > defer the inactivation phase to a separate workqueue.  With this we
> > > > avoid blocking memory reclaim on filesystem metadata updates that are
> > > > necessary to free an in-core inode, such as post-eof block freeing, COW
> > > > staging extent freeing, and truncating and freeing unlinked inodes.  Now
> > > > that work is deferred to a workqueue where we can do the freeing in
> > > > batches.
> > > > 
> > > > We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
> > > > The first flag helps our worker find inodes needing inactivation, and
> > > > the second flag marks inodes that are in the process of being
> > > > inactivated.  A concurrent xfs_iget on the inode can still resurrect the
> > > > inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
> > > > 
> > > > Unfortunately, deferring the inactivation has one huge downside --
> > > > eventual consistency.  Since all the freeing is deferred to a worker
> > > > thread, one can rm a file but the space doesn't come back immediately.
> > > > This can cause some odd side effects with quota accounting and statfs,
> > > > so we also force inactivation scans in order to maintain the existing
> > > > behaviors, at least outwardly.
> > > > 
> > > > For this patch we'll set the delay to zero to mimic the old timing as
> > > > much as possible; in the next patch we'll play with different delay
> > > > settings.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ....
> > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > index a2a407039227..3a3baf56198b 100644
> > > > --- a/fs/xfs/xfs_fsops.c
> > > > +++ b/fs/xfs/xfs_fsops.c
> > > > @@ -19,6 +19,8 @@
> > > >  #include "xfs_log.h"
> > > >  #include "xfs_ag.h"
> > > >  #include "xfs_ag_resv.h"
> > > > +#include "xfs_inode.h"
> > > > +#include "xfs_icache.h"
> > > >  
> > > >  /*
> > > >   * growfs operations
> > > > @@ -290,6 +292,13 @@ xfs_fs_counts(
> > > >  	xfs_mount_t		*mp,
> > > >  	xfs_fsop_counts_t	*cnt)
> > > >  {
> > > > +	/*
> > > > +	 * Process all the queued file and speculative preallocation cleanup so
> > > > +	 * that the counter values we report here do not incorporate any
> > > > +	 * resources that were previously deleted.
> > > > +	 */
> > > > +	xfs_inodegc_force(mp);
> > > 
> > > xfs_fs_counts() is supposed to be a quick, non-blocking summary of
> > > the state - it can never supply userspace with accurate values
> > > because they are wrong even before the ioctl returns to userspace.
> > > Hence we do not attempt to make them correct, just use a fast, point
> > > in time sample of the current counter values.
> > > 
> > > So this seems like an unnecessarily heavyweight operation
> > > to add to this function....
> > 
> > I agree, xfs_inodegc_force is a heavyweight operation to add to statvfs
> > and (further down) the quota reporting ioctl.  I added these calls to
> > maintain the user-visible behavior that one can df a mount, rm -rf a
> > 30T directory tree, df again, and observe a 30T difference in available
> > space between the two df calls.
> >
> > There are a lot of fstests that require this kind of behavior to pass.
> > In my internal testing without this bit applied, I also got complaints
> > about breaking the user-behavior of XFS that people have gotten used to.
> 
> Yeah, that's messy, but I see a potential problem here with space
> monitoring apps that poll the filesystem frequently to check space
> usage. That's going to override whatever your background "do work"
> setting is going to be...
> 
> > Earlier revisions of this patchset tried to maintain counts of the
> > resources used by the inactivated inode so that we could adjust the
> > values reported by statvfs and the quota reporting ioctl.  This meant we
> > didn't have to delay either call at all, but it turns out that it's
> > not feasible to maintain an accurate count of inactive resources because
> > any resources that are shared at destroy_inode time cannot become part
> > of this liar counter and consulting the refcountbt to decide which
> > extents should be added just makes unlinking even slower.  Worse yet,
> > unsharing of shared blocks attached to queued inactive inodes implies
> > either that we have to update the liar counter or that we have to be ok
> > with the free block count fluctuating for a while after a deletion if
> > that deletion ends up freeing more space than the liar counter thinks
> > we can free by flushing inactivation.
> 
> So the main problem is block accounting. Non-reflink stuff is easy
> (the equivalent of delalloc accounting) but reflink is hard.
> 
> > Hmm, maybe this could maintain an approxiate liar counter and only flush
> > inactivation when the liar counter would cause us to be off by more than
> > some configurable amount?  The fstests that care about free space
> > accounting are not going to be happy since they are measured with very
> > tight tolerances.
> 
> I'd prefer something that doesn't require a magic heuristic. I don't
> have any better ideas right now, so let's just go with what you have
> and see what falls out...

Ok.  I'll leave a comment to this effect.

> > > > @@ -233,6 +242,94 @@ xfs_inode_clear_reclaim_tag(
> > > >  	xfs_perag_clear_reclaim_tag(pag);
> > > >  }
> > > >  
> > > > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > > > +static void
> > > > +xfs_inodegc_queue(
> > > > +	struct xfs_mount        *mp)
> > > > +{
> > > > +	rcu_read_lock();
> > > > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
> > > > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
> > > > +				2 * HZ);
> > > > +	rcu_read_unlock();
> > > > +}
> > > 
> > > Why half a second and not something referenced against the inode
> > > reclaim/sync period?
> > 
> > It's actually 2 seconds, and the next patch adds a knob to tweak the
> > default value.
> 
> Ugh, 2 * HZ != 2Hz. Stupid bad generic timer code, always trips me
> over.
> 
> > The first version of this patchset from 2017 actually did just use
> > (6 * xfs_syncd_centisecs / 10) like reclaim does.  This turned out to be
> > pretty foolish because that meant that reclaim and inactivation would
> > start at the same time, and because inactivation is slow, most of them
> > would miss the reclaim window and sit around pointlessly until the
> > next one.
> > 
> > The next iteration from mid 2019 changed this to (xfs_syncd_centisecs/5)
> > which fixed that, but large deltree storms could lead to so many inodes
> > being inactivated that we'd still miss the reclaim window sometimes.
> > Around this time I got my djwong-dev tree hooked up to the ktest robot
> > and it started complaining about performance regressions and noticeably
> > higher slab usage for xfs inodes and log items.
> 
> Right, I was thinking more along the lines of "run inactivation
> twice for every background inode reclaim pass". It's clear that what
> you were struggling with was that the interaction between the two
> running at similar periods is not good, and hence no matter what the
> background reclaim period is, we should process inactivated inodes a
> at least a couple of times per reclaim period...
> 
> > The next time I got back to this was shortly after Dave cleaned up the
> > reclaim behavior (2020) to be driven by the AIL, which mostly fixed the
> > performance complaints, except for the one about AIM7.  I was intrigued
> > enough by this to instrument the patchset and fstests and the fstests
> > cloud hosts <cough> to see if I could derive a reasonable default value.
> > 
> > I've observed through experimentation that 2 seconds seems like a good
> > default value -- it's long enough to enable a lot of batching of
> > inactive inodes, but short enough that the background thread can
> > throttle the foreground threads by competing for the log grant heads.
> 
> Right, it ends up about 2x per reclaim period by default. :)
> 
> > I also noticed that the amount of overhead introduced by background
> > inactivation (as measured by fstests run times and other <cough>
> > performance tests) ranged from minimal at 0 seconds to about 20% at
> > (6*xfs_syncd_centisecs/10).
> 
> Which is about 20s period. yeah, that's way too long...
> 
> > Honestly, this could just be zero.  Assuming your distro has power
> > efficient workqueues enabled, the ~4-10ms delay introduced by that is
> > enough to realize some batching advantage with zero noticeable effect on
> > performance.
> 
> Yeah, the main benefit is moving it into the background so that the
> syscall completion isn't running the entire inode inactivation pass.
> That moves almost 50% of the unlink processing off to another thread
> which is what we want for rm -rf workloads. Keeping the batch size
> small is probably the best place to start with this - just enough
> inodes to keep a CPU busy for a scheduler tick?

Yeah, I'll set it to a tick ... in the next patch, when we actually set
a real delay.

> 
> > > >  static void
> > > >  xfs_inew_wait(
> > > >  	struct xfs_inode	*ip)
> > > > @@ -298,6 +395,13 @@ xfs_iget_check_free_state(
> > > >  	struct xfs_inode	*ip,
> > > >  	int			flags)
> > > >  {
> > > > +	/*
> > > > +	 * Unlinked inodes awaiting inactivation must not be reused until we
> > > > +	 * have a chance to clear the on-disk metadata.
> > > > +	 */
> > > > +	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
> > > > +		return -ENOENT;
> > > > +
> > > >  	if (flags & XFS_IGET_CREATE) {
> > > >  		/* should be a free inode */
> > > >  		if (VFS_I(ip)->i_mode != 0) {
> > > 
> > > How do we get here with an XFS_NEED_INACTIVE inode?
> > > xfs_iget_check_free_state() is only called from the cache miss path,
> > 
> > You added it to xfs_iget_cache_hit in 2018, commit afca6c5b2595f...
> 
> Oh, cscope fail:
> 
>   File             Function                  Line
> 0 xfs/xfs_icache.c xfs_iget_check_free_state 297 xfs_iget_check_free_state(
> 1 xfs/xfs_icache.c __releases                378 error = xfs_iget_check_free_state(ip, flags);
> 2 xfs/xfs_icache.c xfs_iget_cache_miss       530 error = xfs_iget_check_free_state(ip, flags);
> 
> "__releases" is a sparse annotation, so it didn't trigger that this
> was actually in xfs_iget_cache_hit()...
> 
> Never mind...
> 
> > > > @@ -713,6 +904,43 @@ xfs_icache_inode_is_allocated(
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/*
> > > > + * Grab the inode for inactivation exclusively.
> > > > + * Return true if we grabbed it.
> > > > + */
> > > > +static bool
> > > > +xfs_inactive_grab(
> > > > +	struct xfs_inode	*ip)
> > > > +{
> > > > +	ASSERT(rcu_read_lock_held());
> > > > +
> > > > +	/* quick check for stale RCU freed inode */
> > > > +	if (!ip->i_ino)
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * The radix tree lock here protects a thread in xfs_iget from racing
> > > > +	 * with us starting reclaim on the inode.
> > > > +	 *
> > > > +	 * Due to RCU lookup, we may find inodes that have been freed and only
> > > > +	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> > > > +	 * aren't candidates for reclaim at all, so we must check the
> > > > +	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> > > > +	 * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode.
> > > > +	 */
> > > > +	spin_lock(&ip->i_flags_lock);
> > > > +	if (!(ip->i_flags & XFS_NEED_INACTIVE) ||
> > > > +	    (ip->i_flags & XFS_INACTIVATING)) {
> > > > +		/* not a inactivation candidate. */
> > > > +		spin_unlock(&ip->i_flags_lock);
> > > > +		return false;
> > > > +	}
> > > > +
> > > > +	ip->i_flags |= XFS_INACTIVATING;
> > > > +	spin_unlock(&ip->i_flags_lock);
> > > > +	return true;
> > > > +}
> > > > +
> > > >  /*
> > > >   * The inode lookup is done in batches to keep the amount of lock traffic and
> > > >   * radix tree lookups to a minimum. The batch size is a trade off between
> > > > @@ -736,6 +964,9 @@ xfs_inode_walk_ag_grab(
> > > >  
> > > >  	ASSERT(rcu_read_lock_held());
> > > >  
> > > > +	if (flags & XFS_INODE_WALK_INACTIVE)
> > > > +		return xfs_inactive_grab(ip);
> > > > +
> > > 
> > > Hmmm. This doesn't actually grab the inode. It's an unreferenced
> > > inode walk, in a function that assumes that the grab() call returns
> > > a referenced inode. Why isn't this using the inode reclaim walk
> > > which is intended to walk unreferenced inodes?
> > 
> > Because I thought that some day you might want to rebase the inode
> > reclaim cleanups from 2019 and didn't want to slow either of us down by
> > forcing a gigantic rebase.  So I left the duplicative inode walk
> > functions.
> > 
> > FWIW these are current separate functions with separate call sites in
> > xfs_inode_walk_ag since the "remove indirect calls from inode walk"
> > series made it more convenient to have a separate function for each tag.
> > 
> > As for the name ... reclaim also has a "grab" function even though it
> > walks unreferenced inodes.
> 
> Sure, but the reclaim code was always a special "unreferenced"
> lookup that just used the same code structure. It never mixed
> "igrab()" with unreferenced inode pinning...

Hmm well so long as I'm adding another patch to consolidate the reclaim
loop with xfs_inodes_walk, maybe I'll just rename it to
"selected_for_walk()" so then the code will read:

	if (done || !selected_for_walk(tag, ip))
		batch[i] = NULL;

> > > > +xfs_inactive_inode(
> > > > +	struct xfs_inode	*ip,
> > > > +	void			*args)
> > > > +{
> > > > +	struct xfs_eofblocks	*eofb = args;
> > > > +	struct xfs_perag	*pag;
> > > > +
> > > > +	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
> > > 
> > > What condition is this trying to catch? It's something to do with
> > > freeze, but you haven't documented what happens to inodes with
> > > pending inactivation when a freeze is started....
> > 
> > Inactivation creates transactions, which means that we should never be
> > running this at FREEZE_FS time.  IOWs, it's a check that we can never
> > stall a kernel thread indefinitely because the fs is frozen.
> 
> What's the problem with doing that to a dedicated worker thread?  We
> currently stall inactivation on a frozen filesystem if a transaction
> is required

It seems unnecessary to wedge a worker thread like that when I could
just cancel the work and reschedule it after the freeze...

> > We can continue to queue inodes for inactivation on a frozen filesystem,
> > and I was trying to avoid touching the umount lock in
> > xfs_perag_set_inactive_tag to find out if the fs is actually frozen and
> > therefore we shouldn't call xfs_inodegc_queue.
> 
> I think stopping background inactivation for frozen filesystems make
> more sense than this...

...oh hey, you seem to have reached the same conclusion. :)

> > > > +
> > > > +	/*
> > > > +	 * Not a match for our passed in scan filter?  Put it back on the shelf
> > > > +	 * and move on.
> > > > +	 */
> > > > +	spin_lock(&ip->i_flags_lock);
> > > > +	if (!xfs_inode_matches_eofb(ip, eofb)) {
> > > > +		ip->i_flags &= ~XFS_INACTIVATING;
> > > > +		spin_unlock(&ip->i_flags_lock);
> > > > +		return 0;
> > > > +	}
> > > > +	spin_unlock(&ip->i_flags_lock);
> > > 
> > > IDGI. What do EOF blocks have to do with running inode inactivation
> > > on this inode?
> > 
> > This enables foreground threads that hit EDQUOT to look for inodes to
> > inactivate in order to free up quota'd resources.
> 
> Not very obvious - better comment, please?

	/*
	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed
	 * to pass in a eofb structure to look for inodes to inactivate
	 * immediately to free some resources.  If this inode isn't a
	 * match, put it back on the shelf and move on.
	 */

Better?

> > > I can't tell why this is necessary given what
> > > xfs_unmount_flush_inodes() does. Or, alternatively, why
> > > xfs_unmount_flush_inodes() can do what it does without caring about
> > > per-ag space reservations....
> > > 
> > > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > > index ca1b57d291dc..0f9a1450fe0e 100644
> > > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > > @@ -104,6 +104,12 @@ xfs_qm_scall_quotaoff(
> > > >  	uint			inactivate_flags;
> > > >  	struct xfs_qoff_logitem	*qoffstart = NULL;
> > > >  
> > > > +	/*
> > > > +	 * Clean up the inactive list before we turn quota off, to reduce the
> > > > +	 * amount of quotaoff work we have to do with the mutex held.
> > > > +	 */
> > > > +	xfs_inodegc_force(mp);
> > > > +
> > > 
> > > Hmmm. why not just stop background inactivation altogether while
> > > quotaoff runs? i.e. just do normal, inline inactivation when
> > > quotaoff is running, and then we can get rid of the whole "drop
> > > dquot references" issue that background inactivation has...
> > 
> > I suppose that would have an advantage that quotaoff could switch to
> > foreground inactivation, flush the pending inactivation work to release
> > the dquot references, and then dqflush_all to dump the dquots
> > altogether.
> > 
> > How do we add the ability to switch behaviors, though?  The usual percpu
> > rwsem that protects a flag?
> 
> That's overkill.  Global synchronisation doesn't need complex
> structures, just a low cost reader path.
> 
> All we need is an atomic bit that we can test via test_bit().
> test_bit() is not a locked operation, but it is atomic. Hence most
> of the time it is a shared cacheline and hence has near zero cost to
> check as it can be shared across all CPUs.
> 
> Set the flag to turn off background inactivation, then all future
> inactivations will be foreground. Then flush and stop the inodegc
> work queue.  When we finish processing the last inactivated inode,
> the background work stops (i.e. it is not requeued).  No more
> pending background work.
> 
> Clear the flag to turn background inactivation back on. The first
> inode queued will restart that background work...
> 
> > > > @@ -1720,6 +1749,13 @@ xfs_remount_ro(
> > > >  		return error;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > > > +	 * Since this can involve finobt updates, do it now before we lose the
> > > > +	 * per-AG space reservations.
> > > > +	 */
> > > > +	xfs_inodegc_force(mp);
> > > 
> > > Should we stop background inactivation, because we can't make
> > > modifications anymore and hence background inactication makes little
> > > sense...
> > 
> > We don't actually stop background gc transactions or other internal
> > updates on readonly filesystems
> 
> Yes we do - that's what xfs_blockgc_stop() higher up in this
> function does. xfs_log_clean() further down in the function also
> stops the background log work (that covers the log when idle)
> because xfs_remount_ro() leaves the log clean.
> 
> THese all get restarted in xfs_remount_rw()....
> 
> > -- the ro part means only that we don't
> > let /userspace/ change anything directly.  If you open a file readonly,
> > unlink it, freeze the fs, and close the file, we'll still free it.
> 
> How do you unlink the file on a RO mount?

I got confused here.  If you open a file readonly on a rw mount, unlink
it, remount the fs readonly, and close the file, we'll still free it.

> And if it's a rw mount that is frozen, it will block on the first
> transaction in the inactivation process from close(), and block
> there until the filesystem is unfrozen.
> 
> It's pretty clear to me that we want frozen filesystems to
> turn off background inactivation so that we can block things like
> this in the syscall context and not have to deal with the complexity
> of freeze or read-only mounts in the background inactivation code at
> all..

Ok, will do.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/11] xfs: create a polled function to force inode inactivation
  2021-03-23 22:31   ` Dave Chinner
@ 2021-03-24  3:34     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-24  3:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 24, 2021 at 09:31:58AM +1100, Dave Chinner wrote:
> On Wed, Mar 10, 2021 at 07:06:41PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Create a polled version of xfs_inactive_force so that we can force
> > inactivation while holding a lock (usually the umount lock) without
> > tripping over the softlockup timer.  This is for callers that hold vfs
> > locks while calling inactivation, which is currently unmount, iunlink
> > processing during mount, and rw->ro remount.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |   38 +++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_icache.h |    1 +
> >  fs/xfs/xfs_mount.c  |    2 +-
> >  fs/xfs/xfs_mount.h  |    5 +++++
> >  fs/xfs/xfs_super.c  |    3 ++-
> >  5 files changed, 46 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index d5f580b92e48..9db2beb4e732 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -25,6 +25,7 @@
> >  #include "xfs_ialloc.h"
> >  
> >  #include <linux/iversion.h>
> > +#include <linux/nmi.h>
> 
> This stuff goes in fs/xfs/xfs_linux.h, not here.
> 
> >  
> >  /*
> >   * Allocate and initialise an xfs_inode.
> > @@ -2067,8 +2068,12 @@ xfs_inodegc_free_space(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_eofblocks	*eofb)
> >  {
> > -	return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
> > +	int			error;
> > +
> > +	error = xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE,
> >  			xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG);
> > +	wake_up(&mp->m_inactive_wait);
> > +	return error;
> >  }
> >  
> >  /* Try to get inode inactivation moving. */
> > @@ -2138,6 +2143,37 @@ xfs_inodegc_force(
> >  	flush_workqueue(mp->m_gc_workqueue);
> >  }
> >  
> > +/*
> > + * Force all inode inactivation work to run immediately, and poll until the
> > + * work is complete.  Callers should only use this function if they must
> > + * inactivate inodes while holding VFS locks, and must be prepared to prevent
> > + * or to wait for inodes that are queued for inactivation while this runs.
> > + */
> > +void
> > +xfs_inodegc_force_poll(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_perag	*pag;
> > +	xfs_agnumber_t		agno;
> > +	bool			queued = false;
> > +
> > +	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
> > +		queued |= xfs_inodegc_force_pag(pag);
> > +	if (!queued)
> > +		return;
> > +
> > +	/*
> > +	 * Touch the softlockup watchdog every 1/10th of a second while there
> > +	 * are still inactivation-tagged inodes in the filesystem.
> > +	 */
> > +	while (!wait_event_timeout(mp->m_inactive_wait,
> > +				   !radix_tree_tagged(&mp->m_perag_tree,
> > +						      XFS_ICI_INACTIVE_TAG),
> > +				   HZ / 10)) {
> > +		touch_softlockup_watchdog();
> > +	}
> > +}
> 
> This looks like a deadlock waiting to be tripped over. As long as
> there is something still able to queue inodes for inactivation,
> that radix tree tag check will always trigger and put us back to
> sleep.

Yes, I know this is a total livelock vector.  This ugly function exists
to avoid stall warnings when the VFS has called us with s_umount held
and there are a lot of inodes to process.

As the function comment points out, callers must prevent anyone else
from inactivating inodes or be prepared to deal with the consequences,
which the current callers are prepared to do.

I can't think of a better way to handle this, since we need to bail out
of the workqueue flush periodically to make the softlockup thing happy.
Alternately we could just let the stall warnings happen and deal with
the people who file a bug for every stack trace they see the kernel emit.

> Also, in terms of workqueues, this is a "sync flush" i because we
> are waiting for it. e.g. the difference between cancel_work() and
> cancel_work_sync() is that the later waits for all the work in
> progress to complete before returning and the former doesn't wait...

Yeah, I'll change all the xfs_inodegc_force-* -> xfs_inodegc_flush_*.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/11] xfs: parallelize inode inactivation
  2021-03-23 22:21   ` Dave Chinner
@ 2021-03-24  3:52     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-24  3:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 24, 2021 at 09:21:52AM +1100, Dave Chinner wrote:
> On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Split the inode inactivation work into per-AG work items so that we can
> > take advantage of parallelization.
> 
> How does this scale out when we have thousands of AGs?

Welllll... :)

> I'm guessing that the gc_workqueue has the default "unbound"
> parallelism that means it will run up to 4 kworkers per CPU at a
> time? Which means we could have hundreds of ags trying to hammer on
> inactivations at the same time? And so bash hard on the log and
> completely starve the syscall front end of log space?

Yep.  This is a blunt instrument to throttle the frontend when the
backend has too much queued.

> It seems to me that this needs to bound the amount of concurrent
> work to quite low numbers - even though it is per-ag, we do not want
> this to swamp the system in kworkers blocked on log reservations
> when such concurrency it not necessary.

Two months ago, I /did/ propose limiting the parallelism of those
unbound workqueues to an estimate of what the data device could
handle[1], and you said on IRC[2]:

[1] https://lore.kernel.org/linux-xfs/161040739544.1582286.11068012972712089066.stgit@magnolia/T/#ma0cd1bf1447ccfb66d615cab624c8df67d17f9b0

[2] (14:01:26) dchinner: "Assume parallelism is equal to number of disks"?

(14:02:22) dchinner: For spinning disks we want more parallelism than
that to hide seek latency - we want multiple IOs per disk so that the
elevator can re-order them and minimise seek distances across a set of
IOs

(14:02:37) dchinner: that can't be done if we are only issuing a single
IO per disk at a time

(14:03:30) djwong: 2 per spinning disk?

(14:03:32) dchinner: The more IO you can throw at spinning disks, the
lower the average seek penalty for any given IO....

(14:04:01) dchinner: ANd then there is hardware raid with caches and NVRAM....

(14:04:25) dchinner: This is why I find this sort of knob "misguided"

(14:05:01) dchinner: the "best value" is going to change according to
workload, storage stack config and hardware

(14:05:48) dchinner: Even for SSDs, a thread per CPU is not enough
parallelism if we are doing blocking IO in each thread

(14:07:07) dchinner: The device concurrency is actually the CTQ depth of
the underlying hardware, because that's how many IOs we can keep in
flight at once...

(14:08:06) dchinner: so, yeah, I'm not a fan of having knobs to "tune"
concurrency

(14:09:55) dchinner: As long as we have "enough" for decent performance
on a majority of setups, even if it is "too much" for some cases, that
is better than trying to find some magic optimal number for everyone....

(14:10:16) djwong: so do we simply let the workqueues spawn however many
threads and keep the bottleneck at the storage?

(14:10:39) djwong: (or log grant)

(14:11:08) dchinner: That's the idea - the concurrency backs up at the
serialisation point in the stack

(14:11:23) djwong: for blockgc and inactivation i don't think that's a
huge deal since we're probably going to run out of AGs or log space
anyway

(14:11:25) dchinner: that's actually valuable information if you are
doing perofrmance evaluation

(14:11:51) dchinner: we know immediately where the concurrency
bottleneck is....

(14:13:15) dchinner: backing up in xfs-conv indicates that we're either
running out of log space, the IO completion is contending on inode locks
with concurrent IO submission, etc

(14:14:13) dchinner: and if it's teh xfs-buf kworkers that are going
crazy, we know it's metadata IO rather than user data IO that is having
problems....

(14:15:27) dchinner: seeing multiple active xfs-cil worker threads
indicates pipelined concurrent pushes being active, implying either the
CIL is filling faster than it can be pushed or there are lots of
fsync()s being issued

(14:16:57) dchinner: so, yeah, actually being able to see excessive
concurrency at the kworker level just from teh process listing tells us
a lot from an analysis POV....

---

Now we have unrestricted unbound workqueues, and I'm definitely
getting to collect data on contention bottlenecks -- when there are a
lot of small files, AFAICT we mostly end up contending on the grant
heads, and when we have heavily fragmented images to kill off then it
tends to shift to the AG buffer locks.

So how do we estimate a reasonable upper bound on the number of workers?
Given that most of the gc workers will be probably be contending on
AG[FI] buffer locks I guess we could say min(agcount, nrcpus)?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-24  2:04         ` Darrick J. Wong
@ 2021-03-24  4:57           ` Dave Chinner
  2021-03-25  4:20             ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-03-24  4:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Mar 23, 2021 at 07:04:07PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 23, 2021 at 04:19:07PM +1100, Dave Chinner wrote:
> > On Mon, Mar 22, 2021 at 09:00:37PM -0700, Darrick J. Wong wrote:
> > > On Tue, Mar 23, 2021 at 12:44:17PM +1100, Dave Chinner wrote:
> > > > On Wed, Mar 10, 2021 at 07:06:13PM -0800, Darrick J. Wong wrote:
> > > > > +	/*
> > > > > +	 * Not a match for our passed in scan filter?  Put it back on the shelf
> > > > > +	 * and move on.
> > > > > +	 */
> > > > > +	spin_lock(&ip->i_flags_lock);
> > > > > +	if (!xfs_inode_matches_eofb(ip, eofb)) {
> > > > > +		ip->i_flags &= ~XFS_INACTIVATING;
> > > > > +		spin_unlock(&ip->i_flags_lock);
> > > > > +		return 0;
> > > > > +	}
> > > > > +	spin_unlock(&ip->i_flags_lock);
> > > > 
> > > > IDGI. What do EOF blocks have to do with running inode inactivation
> > > > on this inode?
> > > 
> > > This enables foreground threads that hit EDQUOT to look for inodes to
> > > inactivate in order to free up quota'd resources.
> > 
> > Not very obvious - better comment, please?
> 
> 	/*
> 	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed
> 	 * to pass in a eofb structure to look for inodes to inactivate
> 	 * immediately to free some resources.  If this inode isn't a
> 	 * match, put it back on the shelf and move on.
> 	 */
> 
> Better?

Yes.

> > > > > +	/*
> > > > > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > > > > +	 * Since this can involve finobt updates, do it now before we lose the
> > > > > +	 * per-AG space reservations.
> > > > > +	 */
> > > > > +	xfs_inodegc_force(mp);
> > > > 
> > > > Should we stop background inactivation, because we can't make
> > > > modifications anymore and hence background inactication makes little
> > > > sense...
> > > 
> > > We don't actually stop background gc transactions or other internal
> > > updates on readonly filesystems
> > 
> > Yes we do - that's what xfs_blockgc_stop() higher up in this
> > function does. xfs_log_clean() further down in the function also
> > stops the background log work (that covers the log when idle)
> > because xfs_remount_ro() leaves the log clean.
> > 
> > THese all get restarted in xfs_remount_rw()....
> > 
> > > -- the ro part means only that we don't
> > > let /userspace/ change anything directly.  If you open a file readonly,
> > > unlink it, freeze the fs, and close the file, we'll still free it.
> > 
> > How do you unlink the file on a RO mount?
> 
> I got confused here.  If you open a file readonly on a rw mount, unlink
> it, remount the fs readonly, and close the file, we'll still free it.

Not even that way. :)

You can't remount-ro while there are open-but-unlinked files. See
sb->s_remove_count. It's incremented when drop_link() drops the link
count to zero in the unlink() syscall, then decremented when
__destroy_inode() is called during inode eviction when the final
reference goes away. Hence while we have open but unlinked inodes in
active use, that superblock counter is non-zero.

In sb_prepare_remount_readonly() we have:

	if (atomic_long_read(&sb->s_remove_count))
		return -EBUSY;

So a remount-ro will fail with -EBUSY while there are open but
unlinked files.

Except, of course, if you are doing an emergency remount-ro from
sysrq, in which case these open-but-unlinked checks are not done,
but when we are forcing the fs to be read-only this way, it's not
being done for correctness (i.e the system is about to be shot down)
so we don't really care...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-23  4:00     ` Darrick J. Wong
  2021-03-23  5:19       ` Dave Chinner
@ 2021-03-24 17:53       ` Christoph Hellwig
  2021-03-25  4:26         ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-24 17:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Mon, Mar 22, 2021 at 09:00:37PM -0700, Darrick J. Wong wrote:
> Hmm, maybe this could maintain an approxiate liar counter and only flush
> inactivation when the liar counter would cause us to be off by more than
> some configurable amount?  The fstests that care about free space
> accounting are not going to be happy since they are measured with very
> tight tolerances.

Yes, I think some kind of fuzzy logic instead of the heavy weight flush
on supposedly light weight operations.

> > static void
> > xfs_inode_clear_tag(
> > 	struct xfs_perag	*pag,
> > 	xfs_ino_t		ino,
> > 	int			tag)
> > {
> > 	struct xfs_mount	*mp = pag->pag_mount;
> > 
> > 	lockdep_assert_held(&pag->pag_ici_lock);
> > 	radix_tree_tag_clear(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino),
> > 				tag);
> > 	switch(tag) {
> > 	case XFS_ICI_INACTIVE_TAG:
> > 		if (--pag->pag_ici_inactive)
> > 			return;
> > 		break;
> > 	case XFS_ICI_RECLAIM_TAG:
> > 		if (--pag->pag_ici_reclaim)
> > 			return;
> > 		break;
> > 	default:
> > 		ASSERT(0);
> > 		return;
> > 	}
> > 
> > 	spin_lock(&mp->m_perag_lock);
> > 	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno, tag);
> > 	spin_unlock(&mp->m_perag_lock);
> > }
> > 
> > As a followup patch? The set tag case looks similarly easy to make
> > generic...
> 
> Yeah.  At this point I might as well just clean all of this up for the
> next revision of this series, because as I said earlier I had thought
> that you were still working on a second rework of reclaim.  Now that I
> know you're not, I'll hack away at this twisty pile too.

If the separate tags aren't going to disappear entirely: it would be nice
to move the counters (or any other duplicated variable) into an array
index by the tax, which would clean the above and similar code even more.

> We don't actually stop background gc transactions or other internal
> updates on readonly filesystems -- the ro part means only that we don't
> let /userspace/ change anything directly.  If you open a file readonly,
> unlink it, freeze the fs, and close the file, we'll still free it.

Note that there are two different read-only concepts in Linux:

 1) the read-only mount, as reflected in the vfsmount.  For this your
    description above is spot-on
 2) the read-only superblock, as indicated by the sb flag.  This is
    usually due to an read-only block device, and we must not write
    anything to the device, as that typically will lead to an I/O error.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-24  4:57           ` Dave Chinner
@ 2021-03-25  4:20             ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-25  4:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 24, 2021 at 03:57:06PM +1100, Dave Chinner wrote:
> On Tue, Mar 23, 2021 at 07:04:07PM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 23, 2021 at 04:19:07PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 22, 2021 at 09:00:37PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Mar 23, 2021 at 12:44:17PM +1100, Dave Chinner wrote:
> > > > > On Wed, Mar 10, 2021 at 07:06:13PM -0800, Darrick J. Wong wrote:
> > > > > > +	/*
> > > > > > +	 * Not a match for our passed in scan filter?  Put it back on the shelf
> > > > > > +	 * and move on.
> > > > > > +	 */
> > > > > > +	spin_lock(&ip->i_flags_lock);
> > > > > > +	if (!xfs_inode_matches_eofb(ip, eofb)) {
> > > > > > +		ip->i_flags &= ~XFS_INACTIVATING;
> > > > > > +		spin_unlock(&ip->i_flags_lock);
> > > > > > +		return 0;
> > > > > > +	}
> > > > > > +	spin_unlock(&ip->i_flags_lock);
> > > > > 
> > > > > IDGI. What do EOF blocks have to do with running inode inactivation
> > > > > on this inode?
> > > > 
> > > > This enables foreground threads that hit EDQUOT to look for inodes to
> > > > inactivate in order to free up quota'd resources.
> > > 
> > > Not very obvious - better comment, please?
> > 
> > 	/*
> > 	 * Foreground threads that have hit ENOSPC or EDQUOT are allowed
> > 	 * to pass in a eofb structure to look for inodes to inactivate
> > 	 * immediately to free some resources.  If this inode isn't a
> > 	 * match, put it back on the shelf and move on.
> > 	 */
> > 
> > Better?
> 
> Yes.
> 
> > > > > > +	/*
> > > > > > +	 * Perform all on-disk metadata updates required to inactivate inodes.
> > > > > > +	 * Since this can involve finobt updates, do it now before we lose the
> > > > > > +	 * per-AG space reservations.
> > > > > > +	 */
> > > > > > +	xfs_inodegc_force(mp);
> > > > > 
> > > > > Should we stop background inactivation, because we can't make
> > > > > modifications anymore and hence background inactication makes little
> > > > > sense...

Ahhh, now I remember why the blockgc and inodegc workers call
sb_start_write before running any transactions.  We don't want the
threads to stall on transaction allocation when the fs is at FREEZE_FS,
which means that we have to cancel the work before we get there.  That
means it's too late to cancel the work items in xfs_fs_freeze.

We can't cancel the work items from a ->freeze_super handler before
calling freeze_super(), because we haven't taken any locks yet, and
we're still unfrozen.

For blockgc I solved this problem by making the worker get FREEZE_WRITE
protection so that we can't freeze the fs until the work is done.  Then
we don't have to care that much about ensuring that the worker threads
cannot run while the fs is frozen.  But that's a bit sloppy, since
they're still consuming CPU time.

I could solve this problem by observing that freeze_super calls
sync_filesystem when the fs is in FREEZE_PAGEFAULTS and is about to move
to FREEZE_FS, but that seems ugly and hacky.

> > > > 
> > > > We don't actually stop background gc transactions or other internal
> > > > updates on readonly filesystems
> > > 
> > > Yes we do - that's what xfs_blockgc_stop() higher up in this
> > > function does. xfs_log_clean() further down in the function also
> > > stops the background log work (that covers the log when idle)
> > > because xfs_remount_ro() leaves the log clean.
> > > 
> > > THese all get restarted in xfs_remount_rw()....
> > > 
> > > > -- the ro part means only that we don't
> > > > let /userspace/ change anything directly.  If you open a file readonly,
> > > > unlink it, freeze the fs, and close the file, we'll still free it.
> > > 
> > > How do you unlink the file on a RO mount?
> > 
> > I got confused here.  If you open a file readonly on a rw mount, unlink
> > it, remount the fs readonly, and close the file, we'll still free it.
> 
> Not even that way. :)
> 
> You can't remount-ro while there are open-but-unlinked files. See
> sb->s_remove_count. It's incremented when drop_link() drops the link
> count to zero in the unlink() syscall, then decremented when
> __destroy_inode() is called during inode eviction when the final
> reference goes away. Hence while we have open but unlinked inodes in
> active use, that superblock counter is non-zero.
> 
> In sb_prepare_remount_readonly() we have:
> 
> 	if (atomic_long_read(&sb->s_remove_count))
> 		return -EBUSY;
> 
> So a remount-ro will fail with -EBUSY while there are open but
> unlinked files.

Ah, ok.

> Except, of course, if you are doing an emergency remount-ro from
> sysrq, in which case these open-but-unlinked checks are not done,
> but when we are forcing the fs to be read-only this way, it's not
> being done for correctness (i.e the system is about to be shot down)
> so we don't really care...

Well yes, most bets are off during emergency ro-remounts. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/11] xfs: deferred inode inactivation
  2021-03-24 17:53       ` Christoph Hellwig
@ 2021-03-25  4:26         ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-03-25  4:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Wed, Mar 24, 2021 at 05:53:11PM +0000, Christoph Hellwig wrote:
> On Mon, Mar 22, 2021 at 09:00:37PM -0700, Darrick J. Wong wrote:
> > Hmm, maybe this could maintain an approxiate liar counter and only flush
> > inactivation when the liar counter would cause us to be off by more than
> > some configurable amount?  The fstests that care about free space
> > accounting are not going to be happy since they are measured with very
> > tight tolerances.
> 
> Yes, I think some kind of fuzzy logic instead of the heavy weight flush
> on supposedly light weight operations.

Any suggestions?  I'll try adding a ratelimit to see if that shuts up
fstests while preventing userspace from pounding too hard on
inactivation.

> > > static void
> > > xfs_inode_clear_tag(
> > > 	struct xfs_perag	*pag,
> > > 	xfs_ino_t		ino,
> > > 	int			tag)
> > > {
> > > 	struct xfs_mount	*mp = pag->pag_mount;
> > > 
> > > 	lockdep_assert_held(&pag->pag_ici_lock);
> > > 	radix_tree_tag_clear(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino),
> > > 				tag);
> > > 	switch(tag) {
> > > 	case XFS_ICI_INACTIVE_TAG:
> > > 		if (--pag->pag_ici_inactive)
> > > 			return;
> > > 		break;
> > > 	case XFS_ICI_RECLAIM_TAG:
> > > 		if (--pag->pag_ici_reclaim)
> > > 			return;
> > > 		break;
> > > 	default:
> > > 		ASSERT(0);
> > > 		return;
> > > 	}
> > > 
> > > 	spin_lock(&mp->m_perag_lock);
> > > 	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno, tag);
> > > 	spin_unlock(&mp->m_perag_lock);
> > > }
> > > 
> > > As a followup patch? The set tag case looks similarly easy to make
> > > generic...
> > 
> > Yeah.  At this point I might as well just clean all of this up for the
> > next revision of this series, because as I said earlier I had thought
> > that you were still working on a second rework of reclaim.  Now that I
> > know you're not, I'll hack away at this twisty pile too.
> 
> If the separate tags aren't going to disappear entirely: it would be nice
> to move the counters (or any other duplicated variable) into an array
> index by the tax, which would clean the above and similar code even more.

Ok done.

I refactored xfs_perag_{clear,set}_reclaim_tag into a generic helper
that sets an ICI tag on the inode radix tree and the perag radix tree.
This cleaned up a bunch of redundant code, and enabled me to trim down
the inactivation patch quite a bit.  Now each function that wants to set
inode flags does so directly (after taking the locks) and calls the ICI
helper to deal with the radix trees.

Also, refactoring reclaim to use xfs_inode_walk was pretty simple, and I
even integrated (rather heavily modified) code from the "void *args" ->
"eofb" and the "get rid of iter_flags" patches you posted.

> > We don't actually stop background gc transactions or other internal
> > updates on readonly filesystems -- the ro part means only that we don't
> > let /userspace/ change anything directly.  If you open a file readonly,
> > unlink it, freeze the fs, and close the file, we'll still free it.
> 
> Note that there are two different read-only concepts in Linux:
> 
>  1) the read-only mount, as reflected in the vfsmount.  For this your
>     description above is spot-on
>  2) the read-only superblock, as indicated by the sb flag.  This is
>     usually due to an read-only block device, and we must not write
>     anything to the device, as that typically will lead to an I/O error.

<nod> I /think/ we handle this properly, but it's late...

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-03-25  4:27 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11  3:05 [PATCHSET v3 00/11] xfs: deferred inode inactivation Darrick J. Wong
2021-03-11  3:05 ` [PATCH 01/11] xfs: prevent metadata files from being inactivated Darrick J. Wong
2021-03-11 13:05   ` Christoph Hellwig
2021-03-22 23:13   ` Dave Chinner
2021-03-11  3:05 ` [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
2021-03-11 13:09   ` Christoph Hellwig
2021-03-15 18:46   ` Christoph Hellwig
2021-03-18  4:33     ` Darrick J. Wong
2021-03-19  1:48       ` Darrick J. Wong
2021-03-11  3:05 ` [PATCH 03/11] xfs: don't reclaim dquots with incore reservations Darrick J. Wong
2021-03-15 18:29   ` Christoph Hellwig
2021-03-22 23:31   ` Dave Chinner
2021-03-23  0:01     ` Darrick J. Wong
2021-03-23  1:48       ` Dave Chinner
2021-03-11  3:06 ` [PATCH 04/11] xfs: decide if inode needs inactivation Darrick J. Wong
2021-03-15 18:47   ` Christoph Hellwig
2021-03-15 19:06     ` Darrick J. Wong
2021-03-11  3:06 ` [PATCH 05/11] xfs: rename the blockgc workqueue Darrick J. Wong
2021-03-15 18:49   ` Christoph Hellwig
2021-03-11  3:06 ` [PATCH 06/11] xfs: deferred inode inactivation Darrick J. Wong
2021-03-16  7:27   ` Christoph Hellwig
2021-03-16 15:47     ` Darrick J. Wong
2021-03-17 15:21       ` Christoph Hellwig
2021-03-17 15:49         ` Darrick J. Wong
2021-03-22 23:46           ` Dave Chinner
2021-03-22 23:37       ` Dave Chinner
2021-03-23  0:24         ` Darrick J. Wong
2021-03-23  1:44   ` Dave Chinner
2021-03-23  4:00     ` Darrick J. Wong
2021-03-23  5:19       ` Dave Chinner
2021-03-24  2:04         ` Darrick J. Wong
2021-03-24  4:57           ` Dave Chinner
2021-03-25  4:20             ` Darrick J. Wong
2021-03-24 17:53       ` Christoph Hellwig
2021-03-25  4:26         ` Darrick J. Wong
2021-03-11  3:06 ` [PATCH 07/11] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
2021-03-11  3:06 ` [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
2021-03-15 18:54   ` Christoph Hellwig
2021-03-15 19:06     ` Darrick J. Wong
2021-03-11  3:06 ` [PATCH 09/11] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
2021-03-11  3:06 ` [PATCH 10/11] xfs: parallelize inode inactivation Darrick J. Wong
2021-03-15 18:55   ` Christoph Hellwig
2021-03-15 19:03     ` Darrick J. Wong
2021-03-23 22:21   ` Dave Chinner
2021-03-24  3:52     ` Darrick J. Wong
2021-03-11  3:06 ` [PATCH 11/11] xfs: create a polled function to force " Darrick J. Wong
2021-03-23 22:31   ` Dave Chinner
2021-03-24  3:34     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.