All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] xfs: deferred inode inactivation
@ 2019-01-01  2:16 Darrick J. Wong
  2019-01-01  2:16 ` [PATCH 01/12] xfs: free COW staging extents when freezing filesystem Darrick J. Wong
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:16 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This is a new patch series implementing deferred inode inactivation.
Inactivation is the process of updating all on-disk metadata when a file
is deleted -- freeing the data/attr/COW fork extent allocations,
removing the inode from the unlinked hash, marking the inode record
itself free, and updating the inode btrees so that they show the inode
as not being in use.

Currently, all this inactivation is performed during in-core inode
reclaim, which creates two big headaches: first, this makes direct
memory reclamation /really/ slow, and second, it prohibits us from
partially freezing the filesystem for online fsck activity because scrub
can hit direct memory reclaim.  It's ok for scrub to fail with ENOMEM,
but it's not ok for scrub to deadlock memory reclaim. :)

The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation.  When inodes need inactivation, we set iflags,
set the RECLAIM radix tree tag, update a count of how many resources
will be freed by the pending inactivations, and schedule a deferred work
item.  The deferred work item scans the inode radix tree for inodes to
inactivate, and does all the on-disk metadata updates.  Once the inode
has been inactivated, it is left in the reclaim state and the background
reclaim worker (or direct reclaim) will get to it eventually.

Patch 1 fixes fs freeze to clean out COW extents when possible.

Patch 2-3 refactor some of the inactivation predicates.

Patches 4-5 implement the count of blocks/quota that can be freed by
running inactivation; this is necessary to preserve the behavior where
you rm a file and the fs counters update immediately.

Patches 6-7 refactor more inode reclaim code so that we can reuse some
of it for inactivation.

Patch 8 delivers the core of the inactivation changes by altering the
inode lifetime state machine to include the new inode flags and
background workers.

Patches 9-10 makes it so that if an allocation attempt hits ENOSPC it
will force inactivation to free resources and try again.

Patch 11 converts the per-fs inactivation scanner to be tracked on a
per-AG basis so that we can be more targeted in our inactivation.

Patch 12 makes it so that a process deleting a directory tree or a very
fragmented file will wait for inactivation to happen so that a deltree
cannot flood the system with inactive inodes.

If you're going to start using this mess, you probably ought to just
pull from my git trees.  The kernel patches[1] should apply against
4.20.  xfsprogs[2] and xfstests[3] can be found in their usual
places.  The git trees contain all four series' worth of changes.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/12] xfs: free COW staging extents when freezing filesystem
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
@ 2019-01-01  2:16 ` Darrick J. Wong
  2019-01-11 16:28   ` Brian Foster
  2019-01-01  2:17 ` [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:16 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we're freezing the filesystem, free all the COW staging extents
before we shut the log down so that we can minimize the amount of
recovery work that will be necessary during the next mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c |   17 ++++++++++++-----
 fs/xfs/xfs_super.c  |   18 ++++++++++++++++++
 2 files changed, 30 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 245483cc282b..36d986087abb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1741,6 +1741,7 @@ xfs_inode_free_cowblocks(
 	void			*args)
 {
 	struct xfs_eofblocks	*eofb = args;
+	uint			lock_mode = 0;
 	int			match;
 	int			ret = 0;
 
@@ -1761,9 +1762,15 @@ xfs_inode_free_cowblocks(
 			return 0;
 	}
 
-	/* Free the CoW blocks */
-	xfs_ilock(ip, XFS_IOLOCK_EXCL);
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	/*
+	 * Free the CoW blocks.  We don't need to lock the inode if we're in
+	 * the process of freezing the filesystem because we've already locked
+	 * out writes and page faults.
+	 */
+	if (ip->i_mount->m_super->s_writers.frozen == SB_UNFROZEN)
+		lock_mode = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	if (lock_mode)
+		xfs_ilock(ip, lock_mode);
 
 	/*
 	 * Check again, nobody else should be able to dirty blocks or change
@@ -1772,8 +1779,8 @@ xfs_inode_free_cowblocks(
 	if (xfs_prep_free_cowblocks(ip))
 		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
 
-	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
-	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	if (lock_mode)
+		xfs_iunlock(ip, lock_mode);
 
 	return ret;
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 6903b36eac5d..bb4953cfd683 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1102,6 +1102,24 @@ xfs_fs_sync_fs(
 	if (!wait)
 		return 0;
 
+	/*
+	 * If we're in the middle of freezing the filesystem, this is our last
+	 * chance to run regular transactions.  Clear all the COW staging
+	 * extents so that we can freeze the filesystem with as little recovery
+	 * work to do at the next mount as possible.  It's safe to do this
+	 * without locking inodes because the freezer code flushed all the
+	 * dirty data from the page cache and locked out writes and page faults.
+	 */
+	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
+		int		error;
+
+		error = xfs_icache_free_cowblocks(mp, NULL);
+		if (error) {
+			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			return error;
+		}
+	}
+
 	xfs_log_force(mp, XFS_LOG_SYNC);
 	if (laptop_mode) {
 		/*

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
  2019-01-01  2:16 ` [PATCH 01/12] xfs: free COW staging extents when freezing filesystem Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-11 19:05   ` Brian Foster
  2019-01-01  2:17 ` [PATCH 03/12] xfs: decide if inode needs inactivation Darrick J. Wong
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the part of _free_eofblocks that decides if it's really going
to truncate post-EOF blocks into a separate helper function.  The
upcoming deferred inode inactivation patch requires us to be able to
decide this prior to actual inactivation.  No functionality changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  101 ++++++++++++++++++++----------------------------
 fs/xfs/xfs_inode.c     |   32 +++++++++++++++
 fs/xfs/xfs_inode.h     |    1 
 3 files changed, 75 insertions(+), 59 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1ee8c5539fa4..aa7f7fe241ec 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -781,78 +781,61 @@ xfs_free_eofblocks(
 	struct xfs_inode	*ip)
 {
 	struct xfs_trans	*tp;
-	int			error;
-	xfs_fileoff_t		end_fsb;
-	xfs_fileoff_t		last_fsb;
-	xfs_filblks_t		map_len;
-	int			nimaps;
-	struct xfs_bmbt_irec	imap;
 	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
 	/*
-	 * Figure out if there are any blocks beyond the end
-	 * of the file.  If not, then there is nothing to do.
+	 * If there are blocks after the end of file, truncate the file to its
+	 * current size to free them up.
 	 */
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
-	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
-	if (last_fsb <= end_fsb)
+	if (!xfs_inode_has_posteof_blocks(ip))
 		return 0;
-	map_len = last_fsb - end_fsb;
-
-	nimaps = 1;
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	/*
-	 * If there are blocks after the end of file, truncate the file to its
-	 * current size to free them up.
+	 * Attach the dquots to the inode up front.
 	 */
-	if (!error && (nimaps != 0) &&
-	    (imap.br_startblock != HOLESTARTBLOCK ||
-	     ip->i_delayed_blks)) {
-		/*
-		 * Attach the dquots to the inode up front.
-		 */
-		error = xfs_qm_dqattach(ip);
-		if (error)
-			return error;
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
 
-		/* wait on dio to ensure i_size has settled */
-		inode_dio_wait(VFS_I(ip));
+	/* wait on dio to ensure i_size has settled */
+	inode_dio_wait(VFS_I(ip));
 
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
-				&tp);
-		if (error) {
-			ASSERT(XFS_FORCED_SHUTDOWN(mp));
-			return error;
-		}
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error) {
+		ASSERT(XFS_FORCED_SHUTDOWN(mp));
+		return error;
+	}
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, ip, 0);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
 
-		/*
-		 * Do not update the on-disk file size.  If we update the
-		 * on-disk file size and then the system crashes before the
-		 * contents of the file are flushed to disk then the files
-		 * may be full of holes (ie NULL files bug).
-		 */
-		error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
-					XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
-		if (error) {
-			/*
-			 * If we get an error at this point we simply don't
-			 * bother truncating the file.
-			 */
-			xfs_trans_cancel(tp);
-		} else {
-			error = xfs_trans_commit(tp);
-			if (!error)
-				xfs_inode_clear_eofblocks_tag(ip);
-		}
+	/*
+	 * Do not update the on-disk file size.  If we update the
+	 * on-disk file size and then the system crashes before the
+	 * contents of the file are flushed to disk then the files
+	 * may be full of holes (ie NULL files bug).
+	 */
+	error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
+				XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
+	if (error)
+		goto err_cancel;
 
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	}
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_unlock;
+
+	xfs_inode_clear_eofblocks_tag(ip);
+	goto out_unlock;
+
+err_cancel:
+	/*
+	 * If we get an error at this point we simply don't
+	 * bother truncating the file.
+	 */
+	xfs_trans_cancel(tp);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c8bf02be0003..662ee537ffb5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3568,3 +3568,35 @@ xfs_irele(
 	trace_xfs_irele(ip, _RET_IP_);
 	iput(VFS_I(ip));
 }
+
+/*
+ * Decide if this inode have post-EOF blocks.  The caller is responsible
+ * for knowing / caring about the PREALLOC/APPEND flags.
+ */
+bool
+xfs_inode_has_posteof_blocks(
+	struct xfs_inode	*ip)
+{
+	struct xfs_bmbt_irec	imap;
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		end_fsb;
+	xfs_fileoff_t		last_fsb;
+	xfs_filblks_t		map_len;
+	int			nimaps;
+	int			error;
+
+	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
+	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
+	if (last_fsb <= end_fsb)
+		return false;
+	map_len = last_fsb - end_fsb;
+
+	nimaps = 1;
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
+	return !error && (nimaps != 0) &&
+	       (imap.br_startblock != HOLESTARTBLOCK ||
+	        ip->i_delayed_blks);
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index be2014520155..02a938661ba8 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -499,5 +499,6 @@ extern struct kmem_zone	*xfs_inode_zone;
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
+bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
 
 #endif	/* __XFS_INODE_H__ */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/12] xfs: decide if inode needs inactivation
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
  2019-01-01  2:16 ` [PATCH 01/12] xfs: free COW staging extents when freezing filesystem Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 04/12] xfs: track unlinked inactive inode fs summary counters Darrick J. Wong
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a predicate function to decide if an inode needs (deferred)
inactivation.  Any file that has been unlinked or has speculative
preallocations either for post-EOF writes or for CoW qualifies.
This function will also be used by the upcoming deferred inactivation
patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h |    1 +
 2 files changed, 53 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 662ee537ffb5..4121523659e3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1784,6 +1784,58 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/*
+ * Returns true if we need to update the on-disk metadata before we can free
+ * the memory used by this inode.  Updates include freeing post-eof
+ * preallocations; freeing COW staging extents; and marking the inode free in
+ * the inobt if it is on the unlinked list.
+ */
+bool
+xfs_inode_needs_inactivation(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*cow_ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+
+	/*
+	 * If the inode is already free, then there can be nothing
+	 * to clean up here.
+	 */
+	if (VFS_I(ip)->i_mode == 0)
+		return false;
+
+	/* If this is a read-only mount, don't do this (would generate I/O) */
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return false;
+
+	/* Try to clean out the cow blocks if there are any. */
+	if (cow_ifp && cow_ifp->if_bytes > 0)
+		return true;
+
+	if (VFS_I(ip)->i_nlink != 0) {
+		/*
+		 * force is true because we are evicting an inode from the
+		 * cache. Post-eof blocks must be freed, lest we end up with
+		 * broken free space accounting.
+		 *
+		 * Note: don't bother with iolock here since lockdep complains
+		 * about acquiring it in reclaim context. We have the only
+		 * reference to the inode at this point anyways.
+		 */
+		if (xfs_can_free_eofblocks(ip, true) &&
+		    xfs_inode_has_posteof_blocks(ip))
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * Link count dropped to zero, which means we have to mark the inode
+	 * free on disk and remove it from the AGI unlinked list.
+	 */
+	return true;
+}
+
 /*
  * xfs_inactive
  *
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 02a938661ba8..c7d638aaec1b 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -500,5 +500,6 @@ extern struct kmem_zone	*xfs_inode_zone;
 
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
 bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
+bool xfs_inode_needs_inactivation(struct xfs_inode *ip);
 
 #endif	/* __XFS_INODE_H__ */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/12] xfs: track unlinked inactive inode fs summary counters
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (2 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 03/12] xfs: decide if inode needs inactivation Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 05/12] xfs: track unlinked inactive inode quota counters Darrick J. Wong
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Set up counters to track the number of inodes and blocks that will be
freed from inactivating unlinked inodes.  We'll use this in the deferred
inactivation patch to hide the effects of deferred processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h |    3 +++
 fs/xfs/xfs_super.c |   31 +++++++++++++++++++++++++++++-
 3 files changed, 87 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4121523659e3..9689479ae67d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1784,6 +1784,60 @@ xfs_inactive_ifree(
 	return 0;
 }
 
+/*
+ * Play some accounting tricks with deferred inactivation of unlinked inodes so
+ * that it looks like the inode got freed immediately.  The superblock
+ * maintains counts of the number of inodes, data blocks, and rt blocks that
+ * would be freed if we were to force inode inactivation.  These counts are
+ * added to the statfs free counters outside of the regular fdblocks/ifree
+ * counters.  If userspace actually demands those "free" resources we'll force
+ * an inactivation scan to free things for real.
+ *
+ * Note that we can safely skip the block accounting trickery for complicated
+ * situations (inode with blocks on both devices, inode block counts that seem
+ * wrong) since the worst that happens is that statfs resource usage decreases
+ * more slowly.
+ *
+ * Positive @direction means we're setting up the accounting trick and
+ * negative undoes it.
+ */
+static inline void
+xfs_inode_iadjust(
+	struct xfs_inode	*ip,
+	int			direction)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_filblks_t		iblocks;
+	int64_t			inodes = 0;
+	int64_t			dblocks = 0;
+	int64_t			rblocks = 0;
+
+	ASSERT(direction != 0);
+
+	if (VFS_I(ip)->i_nlink == 0) {
+		inodes = 1;
+
+		iblocks = max_t(int64_t, 0, ip->i_d.di_nblocks +
+					    ip->i_delayed_blks);
+		if (!XFS_IS_REALTIME_INODE(ip))
+			dblocks = iblocks;
+		else if (!XFS_IFORK_Q(ip) ||
+			 XFS_IFORK_FORMAT(ip, XFS_ATTR_FORK) ==
+					XFS_DINODE_FMT_LOCAL)
+			rblocks = iblocks;
+	}
+
+	if (direction < 0) {
+		inodes = -inodes;
+		dblocks = -dblocks;
+		rblocks = -rblocks;
+	}
+
+	percpu_counter_add(&mp->m_iinactive, inodes);
+	percpu_counter_add(&mp->m_dinactive, dblocks);
+	percpu_counter_add(&mp->m_rinactive, rblocks);
+}
+
 /*
  * Returns true if we need to update the on-disk metadata before we can free
  * the memory used by this inode.  Updates include freeing post-eof
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e344b1dfde63..ff1a94b633cf 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -67,6 +67,9 @@ typedef struct xfs_mount {
 	struct percpu_counter	m_icount;	/* allocated inodes counter */
 	struct percpu_counter	m_ifree;	/* free inodes counter */
 	struct percpu_counter	m_fdblocks;	/* free block counter */
+	struct percpu_counter	m_iinactive;	/* inodes waiting for inactivation */
+	struct percpu_counter	m_dinactive;	/* data blocks waiting for inode inactivation */
+	struct percpu_counter	m_rinactive;	/* rt blocks waiting for inode inactivation */
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_fsname;	/* filesystem name */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bb4953cfd683..24eb35b8fe5a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1145,6 +1145,8 @@ xfs_fs_statfs(
 	uint64_t		icount;
 	uint64_t		ifree;
 	uint64_t		fdblocks;
+	uint64_t		iinactive;
+	uint64_t		binactive;
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
@@ -1158,6 +1160,7 @@ xfs_fs_statfs(
 	icount = percpu_counter_sum(&mp->m_icount);
 	ifree = percpu_counter_sum(&mp->m_ifree);
 	fdblocks = percpu_counter_sum(&mp->m_fdblocks);
+	iinactive = percpu_counter_sum(&mp->m_iinactive);
 
 	spin_lock(&mp->m_sb_lock);
 	statp->f_bsize = sbp->sb_blocksize;
@@ -1181,7 +1184,7 @@ xfs_fs_statfs(
 					sbp->sb_icount);
 
 	/* make sure statp->f_ffree does not underflow */
-	ffree = statp->f_files - (icount - ifree);
+	ffree = statp->f_files - (icount - ifree) + iinactive;
 	statp->f_ffree = max_t(int64_t, ffree, 0);
 
 
@@ -1195,7 +1198,12 @@ xfs_fs_statfs(
 		statp->f_blocks = sbp->sb_rblocks;
 		statp->f_bavail = statp->f_bfree =
 			sbp->sb_frextents * sbp->sb_rextsize;
+		binactive = percpu_counter_sum(&mp->m_rinactive);
+	} else {
+		binactive = percpu_counter_sum(&mp->m_dinactive);
 	}
+	statp->f_bavail += binactive;
+	statp->f_bfree += binactive;
 
 	return 0;
 }
@@ -1564,8 +1572,26 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_ifree;
 
+	error = percpu_counter_init(&mp->m_iinactive, 0, GFP_KERNEL);
+	if (error)
+		goto free_fdblocks;
+
+	error = percpu_counter_init(&mp->m_dinactive, 0, GFP_KERNEL);
+	if (error)
+		goto free_iinactive;
+
+	error = percpu_counter_init(&mp->m_rinactive, 0, GFP_KERNEL);
+	if (error)
+		goto free_dinactive;
+
 	return 0;
 
+free_dinactive:
+	percpu_counter_destroy(&mp->m_dinactive);
+free_iinactive:
+	percpu_counter_destroy(&mp->m_iinactive);
+free_fdblocks:
+	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
 	percpu_counter_destroy(&mp->m_ifree);
 free_icount:
@@ -1589,6 +1615,9 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_icount);
 	percpu_counter_destroy(&mp->m_ifree);
 	percpu_counter_destroy(&mp->m_fdblocks);
+	percpu_counter_destroy(&mp->m_iinactive);
+	percpu_counter_destroy(&mp->m_dinactive);
+	percpu_counter_destroy(&mp->m_rinactive);
 }
 
 static struct xfs_mount *

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/12] xfs: track unlinked inactive inode quota counters
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (3 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 04/12] xfs: track unlinked inactive inode fs summary counters Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes Darrick J. Wong
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Set up quota counters to track the number of inodes and blocks that will
be freed from inactivating unlinked inodes.  We'll use this in the
deferred inactivation patch to hide the effects of deferred processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_dquot.c       |   45 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dquot.h       |    4 ++++
 fs/xfs/xfs_inode.c       |    4 ++++
 fs/xfs/xfs_qm.c          |   13 +++++++++++++
 fs/xfs/xfs_qm_syscalls.c |    7 ++++---
 fs/xfs/xfs_quota.h       |    4 +++-
 fs/xfs/xfs_trace.h       |    1 +
 7 files changed, 74 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 87e6dd5326d5..14e2b732213e 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1270,3 +1270,48 @@ xfs_qm_dqiterate(
 
 	return error;
 }
+
+/* Update dquot pending-inactivation counters. */
+STATIC void
+xfs_dquot_adjust(
+	struct xfs_dquot	*dqp,
+	int			direction,
+	int64_t			inodes,
+	int64_t			dblocks,
+	int64_t			rblocks)
+{
+	xfs_dqlock(dqp);
+	dqp->q_ina_total += direction;
+	dqp->q_ina_icount += inodes;
+	dqp->q_ina_bcount += dblocks;
+	dqp->q_ina_rtbcount += rblocks;
+	xfs_dqunlock(dqp);
+}
+
+/* Update pending-inactivation counters for all dquots attach to inode. */
+void
+xfs_qm_iadjust(
+	struct xfs_inode	*ip,
+	int			direction,
+	int64_t			inodes,
+	int64_t			dblocks,
+	int64_t			rblocks)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp) ||
+	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
+		return;
+
+	if (XFS_IS_UQUOTA_ON(mp) && ip->i_udquot)
+		xfs_dquot_adjust(ip->i_udquot, direction, inodes, dblocks,
+				rblocks);
+
+	if (XFS_IS_GQUOTA_ON(mp) && ip->i_gdquot)
+		xfs_dquot_adjust(ip->i_gdquot, direction, inodes, dblocks,
+				rblocks);
+
+	if (XFS_IS_PQUOTA_ON(mp) && ip->i_pdquot)
+		xfs_dquot_adjust(ip->i_pdquot, direction, inodes, dblocks,
+				rblocks);
+}
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 64bd8640f6e8..4be05d44fa4a 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -45,6 +45,10 @@ typedef struct xfs_dquot {
 	xfs_qcnt_t	 q_res_bcount;	/* total regular nblks used+reserved */
 	xfs_qcnt_t	 q_res_icount;	/* total inos allocd+reserved */
 	xfs_qcnt_t	 q_res_rtbcount;/* total realtime blks used+reserved */
+	uint64_t	 q_ina_total;	/* inactive inodes attached here */
+	xfs_qcnt_t	 q_ina_bcount;	/* inactive regular nblks used+reserved */
+	xfs_qcnt_t	 q_ina_icount;	/* inactive inos allocd+reserved */
+	xfs_qcnt_t	 q_ina_rtbcount;/* inactive realtime blks used+reserved */
 	xfs_qcnt_t	 q_prealloc_lo_wmark;/* prealloc throttle wmark */
 	xfs_qcnt_t	 q_prealloc_hi_wmark;/* prealloc disabled wmark */
 	int64_t		 q_low_space[XFS_QLOWSP_MAX];
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9689479ae67d..78c88a8c947a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -41,6 +41,8 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_dir2_priv.h"
+#include "xfs_dquot_item.h"
+#include "xfs_dquot.h"
 
 kmem_zone_t *xfs_inode_zone;
 
@@ -1836,6 +1838,8 @@ xfs_inode_iadjust(
 	percpu_counter_add(&mp->m_iinactive, inodes);
 	percpu_counter_add(&mp->m_dinactive, dblocks);
 	percpu_counter_add(&mp->m_rinactive, rblocks);
+
+	xfs_qm_iadjust(ip, direction, inodes, dblocks, rblocks);
 }
 
 /*
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 52ed7904df10..116b24c980fd 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -426,6 +426,19 @@ xfs_qm_dquot_isolate(
 	if (!xfs_dqlock_nowait(dqp))
 		goto out_miss_busy;
 
+	/*
+	 * An inode is on the inactive list waiting to release its resources,
+	 * so remove this dquot from the freelist and try again.  We detached
+	 * the dquot from the NEEDS_INACTIVE inode so that quotaoff won't
+	 * deadlock on inactive inodes holding dquots.
+	 */
+	if (dqp->q_ina_total > 0) {
+		xfs_dqunlock(dqp);
+		trace_xfs_dqreclaim_inactive(dqp);
+		list_lru_isolate(lru, &dqp->q_lru);
+		return LRU_REMOVED;
+	}
+
 	/*
 	 * This dquot has acquired a reference in the meantime remove it from
 	 * the freelist and try again.
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index b3190890f096..ba058b8da8a8 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -626,8 +626,8 @@ xfs_qm_scall_getquota_fill_qc(
 		XFS_FSB_TO_B(mp, be64_to_cpu(dqp->q_core.d_blk_softlimit));
 	dst->d_ino_hardlimit = be64_to_cpu(dqp->q_core.d_ino_hardlimit);
 	dst->d_ino_softlimit = be64_to_cpu(dqp->q_core.d_ino_softlimit);
-	dst->d_space = XFS_FSB_TO_B(mp, dqp->q_res_bcount);
-	dst->d_ino_count = dqp->q_res_icount;
+	dst->d_space = XFS_FSB_TO_B(mp, dqp->q_res_bcount - dqp->q_ina_bcount);
+	dst->d_ino_count = dqp->q_res_icount - dqp->q_ina_icount;
 	dst->d_spc_timer = be32_to_cpu(dqp->q_core.d_btimer);
 	dst->d_ino_timer = be32_to_cpu(dqp->q_core.d_itimer);
 	dst->d_ino_warns = be16_to_cpu(dqp->q_core.d_iwarns);
@@ -636,7 +636,8 @@ xfs_qm_scall_getquota_fill_qc(
 		XFS_FSB_TO_B(mp, be64_to_cpu(dqp->q_core.d_rtb_hardlimit));
 	dst->d_rt_spc_softlimit =
 		XFS_FSB_TO_B(mp, be64_to_cpu(dqp->q_core.d_rtb_softlimit));
-	dst->d_rt_space = XFS_FSB_TO_B(mp, dqp->q_res_rtbcount);
+	dst->d_rt_space =
+		XFS_FSB_TO_B(mp, dqp->q_res_rtbcount - dqp->q_ina_rtbcount);
 	dst->d_rt_spc_timer = be32_to_cpu(dqp->q_core.d_rtbtimer);
 	dst->d_rt_spc_warns = be16_to_cpu(dqp->q_core.d_rtbwarns);
 
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 55b798265ef7..bd95a0738086 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -103,7 +103,8 @@ extern int xfs_qm_newmount(struct xfs_mount *, uint *, uint *);
 extern void xfs_qm_mount_quotas(struct xfs_mount *);
 extern void xfs_qm_unmount(struct xfs_mount *);
 extern void xfs_qm_unmount_quotas(struct xfs_mount *);
-
+extern void xfs_qm_iadjust(struct xfs_inode *ip, int direction, int64_t inodes,
+			   int64_t dblocks, int64_t rblocks);
 #else
 static inline int
 xfs_qm_vop_dqalloc(struct xfs_inode *ip, xfs_dqid_t uid, xfs_dqid_t gid,
@@ -145,6 +146,7 @@ static inline int xfs_trans_reserve_quota_bydquots(struct xfs_trans *tp,
 #define xfs_qm_mount_quotas(mp)
 #define xfs_qm_unmount(mp)
 #define xfs_qm_unmount_quotas(mp)
+#define xfs_qm_iadjust(ip, dir, inodes, dblocks, rblocks)
 #endif /* CONFIG_XFS_QUOTA */
 
 #define xfs_trans_unreserve_quota_nblks(tp, ip, nblks, ninos, flags) \
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6fcc893dfc91..5127f91ce3fd 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -901,6 +901,7 @@ DEFINE_EVENT(xfs_dquot_class, name, \
 	TP_PROTO(struct xfs_dquot *dqp), \
 	TP_ARGS(dqp))
 DEFINE_DQUOT_EVENT(xfs_dqadjust);
+DEFINE_DQUOT_EVENT(xfs_dqreclaim_inactive);
 DEFINE_DQUOT_EVENT(xfs_dqreclaim_want);
 DEFINE_DQUOT_EVENT(xfs_dqreclaim_dirty);
 DEFINE_DQUOT_EVENT(xfs_dqreclaim_busy);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (4 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 05/12] xfs: track unlinked inactive inode quota counters Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-11 19:06   ` Brian Foster
  2019-01-01  2:17 ` [PATCH 07/12] xfs: refactor eofblocks inode match code Darrick J. Wong
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the code that walks reclaim-tagged inodes so that we can reuse
the same loop in a subsequent patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c |  170 +++++++++++++++++++++++++++++----------------------
 1 file changed, 97 insertions(+), 73 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 36d986087abb..7e031eb6f048 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1004,9 +1004,9 @@ xfs_inode_ag_iterator_tag(
 
 /*
  * Grab the inode for reclaim exclusively.
- * Return 0 if we grabbed it, non-zero otherwise.
+ * Return true if we grabbed it, false otherwise.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode_grab(
 	struct xfs_inode	*ip,
 	int			flags)
@@ -1015,7 +1015,7 @@ xfs_reclaim_inode_grab(
 
 	/* quick check for stale RCU freed inode */
 	if (!ip->i_ino)
-		return 1;
+		return false;
 
 	/*
 	 * If we are asked for non-blocking operation, do unlocked checks to
@@ -1024,7 +1024,7 @@ xfs_reclaim_inode_grab(
 	 */
 	if ((flags & SYNC_TRYLOCK) &&
 	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return 1;
+		return false;
 
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
@@ -1041,11 +1041,11 @@ xfs_reclaim_inode_grab(
 	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
 		/* not a reclaim candidate. */
 		spin_unlock(&ip->i_flags_lock);
-		return 1;
+		return false;
 	}
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
-	return 0;
+	return true;
 }
 
 /*
@@ -1223,6 +1223,90 @@ xfs_reclaim_inode(
 	return 0;
 }
 
+/*
+ * Walk the RECLAIM tagged inodes in this AG looking for inodes to inactivate.
+ */
+STATIC int
+xfs_walk_ag_reclaim_inos(
+	struct xfs_perag	*pag,
+	int			sync_flags,
+	bool			(*grab_fn)(struct xfs_inode *ip,
+					   int sync_flags),
+	int			(*execute_fn)(struct xfs_inode *ip,
+					      struct xfs_perag *pag,
+					      int sync_flags),
+	int			*nr_to_scan,
+	bool			*done)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	unsigned long		first_index = 0;
+	int			nr_found = 0;
+	int			last_error = 0;
+	int			error;
+
+	do {
+		struct xfs_inode *batch[XFS_LOOKUP_BATCH];
+		int	i;
+
+		rcu_read_lock();
+		nr_found = radix_tree_gang_lookup_tag(&pag->pag_ici_root,
+				(void **)batch, first_index, XFS_LOOKUP_BATCH,
+				XFS_ICI_RECLAIM_TAG);
+		if (!nr_found) {
+			*done = true;
+			rcu_read_unlock();
+			break;
+		}
+
+		/*
+		 * Grab the inodes before we drop the lock. if we found
+		 * nothing, nr == 0 and the loop will be skipped.
+		 */
+		for (i = 0; i < nr_found; i++) {
+			struct xfs_inode *ip = batch[i];
+
+			if (*done || !grab_fn(ip, sync_flags))
+				batch[i] = NULL;
+
+			/*
+			 * Update the index for the next lookup. Catch
+			 * overflows into the next AG range which can occur if
+			 * we have inodes in the last block of the AG and we
+			 * are currently pointing to the last inode.
+			 *
+			 * Because we may see inodes that are from the wrong AG
+			 * due to RCU freeing and reallocation, only update the
+			 * index if it lies in this AG. It was a race that lead
+			 * us to see this inode, so another lookup from the
+			 * same index will not find it again.
+			 */
+			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
+				continue;
+			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
+			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
+				*done = true;
+		}
+
+		/* unlock now we've grabbed the inodes. */
+		rcu_read_unlock();
+
+		for (i = 0; i < nr_found; i++) {
+			if (!batch[i])
+				continue;
+			error = execute_fn(batch[i], pag, sync_flags);
+			if (error && last_error != -EFSCORRUPTED)
+				last_error = error;
+		}
+
+		*nr_to_scan -= XFS_LOOKUP_BATCH;
+
+		cond_resched();
+
+	} while (nr_found && !*done && *nr_to_scan > 0);
+
+	return last_error;
+}
+
 /*
  * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
  * corrupted, we still want to try to reclaim all the inodes. If we don't,
@@ -1236,8 +1320,8 @@ xfs_reclaim_inodes_ag(
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
 	int			last_error = 0;
+	int			error;
 	xfs_agnumber_t		ag;
 	int			trylock = flags & SYNC_TRYLOCK;
 	int			skipped;
@@ -1247,8 +1331,7 @@ xfs_reclaim_inodes_ag(
 	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
-		int		done = 0;
-		int		nr_found = 0;
+		bool		done = false;
 
 		ag = pag->pag_agno + 1;
 
@@ -1262,70 +1345,11 @@ xfs_reclaim_inodes_ag(
 		} else
 			mutex_lock(&pag->pag_ici_reclaim_lock);
 
-		do {
-			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
-			int	i;
-
-			rcu_read_lock();
-			nr_found = radix_tree_gang_lookup_tag(
-					&pag->pag_ici_root,
-					(void **)batch, first_index,
-					XFS_LOOKUP_BATCH,
-					XFS_ICI_RECLAIM_TAG);
-			if (!nr_found) {
-				done = 1;
-				rcu_read_unlock();
-				break;
-			}
-
-			/*
-			 * Grab the inodes before we drop the lock. if we found
-			 * nothing, nr == 0 and the loop will be skipped.
-			 */
-			for (i = 0; i < nr_found; i++) {
-				struct xfs_inode *ip = batch[i];
-
-				if (done || xfs_reclaim_inode_grab(ip, flags))
-					batch[i] = NULL;
-
-				/*
-				 * Update the index for the next lookup. Catch
-				 * overflows into the next AG range which can
-				 * occur if we have inodes in the last block of
-				 * the AG and we are currently pointing to the
-				 * last inode.
-				 *
-				 * Because we may see inodes that are from the
-				 * wrong AG due to RCU freeing and
-				 * reallocation, only update the index if it
-				 * lies in this AG. It was a race that lead us
-				 * to see this inode, so another lookup from
-				 * the same index will not find it again.
-				 */
-				if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
-								pag->pag_agno)
-					continue;
-				first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
-				if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
-					done = 1;
-			}
-
-			/* unlock now we've grabbed the inodes. */
-			rcu_read_unlock();
-
-			for (i = 0; i < nr_found; i++) {
-				if (!batch[i])
-					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
-			}
-
-			*nr_to_scan -= XFS_LOOKUP_BATCH;
-
-			cond_resched();
-
-		} while (nr_found && !done && *nr_to_scan > 0);
+		error = xfs_walk_ag_reclaim_inos(pag, flags,
+				xfs_reclaim_inode_grab, xfs_reclaim_inode,
+				nr_to_scan, &done);
+		if (error && last_error != -EFSCORRUPTED)
+			last_error = error;
 
 		if (trylock && !done)
 			pag->pag_ici_reclaim_cursor = first_index;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/12] xfs: refactor eofblocks inode match code
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (5 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-02  9:50   ` Nikolay Borisov
  2019-01-01  2:17 ` [PATCH 08/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the code that determines if an inode matches an eofblocks
structure into a helper, since we already use it twice and we're about
to use it a third time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c |  146 ++++++++++++++++++++++++++-------------------------
 1 file changed, 74 insertions(+), 72 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7e031eb6f048..c1b457ba1b7b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -27,6 +27,76 @@
 #include <linux/freezer.h>
 #include <linux/iversion.h>
 
+STATIC int
+xfs_inode_match_id(
+	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb)
+{
+	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
+	    !uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
+		return 0;
+
+	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
+	    !gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
+		return 0;
+
+	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
+	    xfs_get_projid(ip) != eofb->eof_prid)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * A union-based inode filtering algorithm. Process the inode if any of the
+ * criteria match. This is for global/internal scans only.
+ */
+STATIC int
+xfs_inode_match_id_union(
+	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb)
+{
+	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
+	    uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
+		return 1;
+
+	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
+	    gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
+		return 1;
+
+	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
+	    xfs_get_projid(ip) == eofb->eof_prid)
+		return 1;
+
+	return 0;
+}
+
+/* Does this inode match the given parameters? */
+STATIC bool
+xfs_inode_matches_eofb(
+	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb)
+{
+	int			match;
+
+	if (!eofb)
+		return true;
+
+	if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
+		match = xfs_inode_match_id_union(ip, eofb);
+	else
+		match = xfs_inode_match_id(ip, eofb);
+	if (!match)
+		return false;
+
+	/* skip the inode if the file size is too small */
+	if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
+	    XFS_ISIZE(ip) < eofb->eof_min_file_size)
+		return false;
+
+	return true;
+}
+
 /*
  * Allocate and initialise an xfs_inode.
  */
@@ -1424,50 +1494,6 @@ xfs_reclaim_inodes_count(
 	return reclaimable;
 }
 
-STATIC int
-xfs_inode_match_id(
-	struct xfs_inode	*ip,
-	struct xfs_eofblocks	*eofb)
-{
-	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
-	    !uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
-		return 0;
-
-	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
-	    !gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
-		return 0;
-
-	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
-	    xfs_get_projid(ip) != eofb->eof_prid)
-		return 0;
-
-	return 1;
-}
-
-/*
- * A union-based inode filtering algorithm. Process the inode if any of the
- * criteria match. This is for global/internal scans only.
- */
-STATIC int
-xfs_inode_match_id_union(
-	struct xfs_inode	*ip,
-	struct xfs_eofblocks	*eofb)
-{
-	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
-	    uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
-		return 1;
-
-	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
-	    gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
-		return 1;
-
-	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
-	    xfs_get_projid(ip) == eofb->eof_prid)
-		return 1;
-
-	return 0;
-}
-
 STATIC int
 xfs_inode_free_eofblocks(
 	struct xfs_inode	*ip,
@@ -1476,7 +1502,6 @@ xfs_inode_free_eofblocks(
 {
 	int ret = 0;
 	struct xfs_eofblocks *eofb = args;
-	int match;
 
 	if (!xfs_can_free_eofblocks(ip, false)) {
 		/* inode could be preallocated or append-only */
@@ -1493,19 +1518,8 @@ xfs_inode_free_eofblocks(
 	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY))
 		return 0;
 
-	if (eofb) {
-		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
-			match = xfs_inode_match_id_union(ip, eofb);
-		else
-			match = xfs_inode_match_id(ip, eofb);
-		if (!match)
-			return 0;
-
-		/* skip the inode if the file size is too small */
-		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
-		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
-			return 0;
-	}
+	if (!xfs_inode_matches_eofb(ip, eofb))
+		return 0;
 
 	/*
 	 * If the caller is waiting, return -EAGAIN to keep the background
@@ -1766,25 +1780,13 @@ xfs_inode_free_cowblocks(
 {
 	struct xfs_eofblocks	*eofb = args;
 	uint			lock_mode = 0;
-	int			match;
 	int			ret = 0;
 
 	if (!xfs_prep_free_cowblocks(ip))
 		return 0;
 
-	if (eofb) {
-		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
-			match = xfs_inode_match_id_union(ip, eofb);
-		else
-			match = xfs_inode_match_id(ip, eofb);
-		if (!match)
-			return 0;
-
-		/* skip the inode if the file size is too small */
-		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
-		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
-			return 0;
-	}
+	if (!xfs_inode_matches_eofb(ip, eofb))
+		return 0;
 
 	/*
 	 * Free the CoW blocks.  We don't need to lock the inode if we're in

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/12] xfs: deferred inode inactivation
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (6 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 07/12] xfs: refactor eofblocks inode match code Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 09/12] xfs: retry fs writes when there isn't space Darrick J. Wong
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue.  With this we
avoid blocking memory reclaim on filesystem metadata updates that are
necessary to free an in-core inode, such as post-eof block freeing, COW
staging extent freeing, and truncating and freeing unlinked inodes.  Now
that work is deferred to a workqueue where we can do the freeing in
batches.

We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated.  A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).

Unfortunately, deferring the inactivation has one huge downside --
eventual consistency.  Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we also force inactivation scans in order to maintain the existing
behaviors, at least outwardly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c      |  309 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h      |    8 +
 fs/xfs/xfs_inode.c       |   70 ++++++++++
 fs/xfs/xfs_inode.h       |   15 ++
 fs/xfs/xfs_iomap.c       |    1 
 fs/xfs/xfs_log_recover.c |    7 +
 fs/xfs/xfs_mount.c       |   16 ++
 fs/xfs/xfs_mount.h       |    5 +
 fs/xfs/xfs_qm_syscalls.c |    6 +
 fs/xfs/xfs_super.c       |   59 +++++++--
 fs/xfs/xfs_trace.h       |   11 +-
 11 files changed, 479 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c1b457ba1b7b..2386a2f3e1d0 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -225,6 +225,19 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
+/* Queue a new inode inactivation pass if there are reclaimable inodes. */
+static void
+xfs_inactive_work_queue(
+	struct xfs_mount        *mp)
+{
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG))
+		queue_delayed_work(mp->m_inactive_workqueue,
+				&mp->m_inactive_work,
+				msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
+	rcu_read_unlock();
+}
+
 /*
  * This is a fast pass over the inode cache to try to get reclaim moving on as
  * many inodes as possible in a short period of time. It kicks itself every few
@@ -243,9 +256,85 @@ xfs_reclaim_worker(
 	xfs_reclaim_work_queue(mp);
 }
 
+/*
+ * Set the per-ag "inodes awaiting inactivation" tag.  This isn't a real tag;
+ * we overload the RECLAIM tag to cover both inactive and reclaimable inodes.
+ * We maintain separate perag counters for both types, and move counts as inodes
+ * wander the state machine.
+ *
+ * When an inode hits zero refcount, we:
+ *   - Set the RECLAIMABLE inode flag
+ *   - Set the RECLAIM tag in the per-AG inode tree
+ *   - Set the RECLAIM tag in the per-fs AG tree
+ *
+ * If the inode needs inactivation, we:
+ *   - Set the NEED_INACTIVE inode flag
+ *   - Increment the per-AG inactive count
+ *   - Schedule background inode inactivation
+ *
+ * If the inode did not need inactivation, we:
+ *   - Increment the per-AG reclaim count
+ *   - Schedule background inode reclamation
+ *
+ * When it is time for background inode inactivation, we:
+ *   - Set the INACTIVATING inode flag
+ *   - Make all the on-disk updates
+ *   - Clear both INACTIVATING and NEED_INACTIVE inode flags
+ *   - Decrement the per-AG inactive count
+ *   - Increment the per-AG reclaim count
+ *   - Schedule background inode reclamation
+ *
+ * When it is time for background inode reclamation, we:
+ *   - Set the IRECLAIM inode flag
+ *   - Detach all the resources and remove the inode from the per-AG inode tree
+ *   - Clear both IRECLAIM and RECLAIMABLE inode flags
+ *   - Decrement the per-AG reclaim count
+ *   - Clear the RECLAIM tag from the per-AG inode tree
+ *   - Clear the RECLAIM tag from the per-fs AG tree if there are no more
+ *     inodes waiting for reclamation or inactivation
+ */
 static void
-xfs_perag_set_reclaim_tag(
+xfs_perag_set_inactive_tag(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	lockdep_assert_held(&pag->pag_ici_lock);
+	if (pag->pag_ici_inactive++ == 0) {
+		/* propagate the reclaim tag up into the perag radix tree */
+		spin_lock(&mp->m_perag_lock);
+		radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
+				   XFS_ICI_RECLAIM_TAG);
+		spin_unlock(&mp->m_perag_lock);
+	}
+
+	/*
+	 * Schedule periodic background inode inactivation.  Inactivation can
+	 * take a while, so we allow the deferral of an already-scheduled
+	 * inactivation on the grounds that xfs_fs_destroy_inode has a better
+	 * idea of when it ought to force inactivation, and in the mean time
+	 * we prefer batching.
+	 */
+	xfs_inactive_work_queue(mp);
+
+	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
+}
+
+/* Move an inode from inactive to reclaim. */
+static void
+xfs_perag_clear_inactive_tag(
 	struct xfs_perag	*pag)
+{
+	lockdep_assert_held(&pag->pag_ici_lock);
+	pag->pag_ici_inactive--;
+	pag->pag_ici_reclaimable++;
+}
+
+static void
+xfs_perag_set_reclaim_tag(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = pag->pag_mount;
 
@@ -272,7 +361,7 @@ xfs_perag_clear_reclaim_tag(
 	struct xfs_mount	*mp = pag->pag_mount;
 
 	lockdep_assert_held(&pag->pag_ici_lock);
-	if (--pag->pag_ici_reclaimable)
+	if (--pag->pag_ici_reclaimable || pag->pag_ici_inactive > 0)
 		return;
 
 	/* clear the reclaim tag from the perag radix tree */
@@ -291,10 +380,12 @@ xfs_perag_clear_reclaim_tag(
  */
 void
 xfs_inode_set_reclaim_tag(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	bool			need_inactive)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	unsigned long		iflags = need_inactive ? XFS_NEED_INACTIVE : 0;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
@@ -302,8 +393,11 @@ xfs_inode_set_reclaim_tag(
 
 	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
 			   XFS_ICI_RECLAIM_TAG);
-	xfs_perag_set_reclaim_tag(pag);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	if (need_inactive)
+		xfs_perag_set_inactive_tag(pag, ip);
+	else
+		xfs_perag_set_reclaim_tag(pag, ip);
+	__xfs_iflags_set(ip, XFS_IRECLAIMABLE | iflags);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -382,6 +476,13 @@ xfs_iget_check_free_state(
 	struct xfs_inode	*ip,
 	int			flags)
 {
+	/*
+	 * Unlinked inodes awaiting inactivation must not be reused until we
+	 * have a chance to clear the on-disk metadata.
+	 */
+	if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE))
+		return -ENOENT;
+
 	if (flags & XFS_IGET_CREATE) {
 		/* should be a free inode */
 		if (VFS_I(ip)->i_mode != 0) {
@@ -441,14 +542,14 @@ xfs_iget_cache_hit(
 	/*
 	 * If we are racing with another cache hit that is currently
 	 * instantiating this inode or currently recycling it out of
-	 * reclaimabe state, wait for the initialisation to complete
+	 * reclaimable state, wait for the initialisation to complete
 	 * before continuing.
 	 *
 	 * XXX(hch): eventually we should do something equivalent to
 	 *	     wait_on_inode to wait for these flags to be cleared
 	 *	     instead of polling for it.
 	 */
-	if (ip->i_flags & (XFS_INEW|XFS_IRECLAIM)) {
+	if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING)) {
 		trace_xfs_iget_skip(ip);
 		XFS_STATS_INC(mp, xs_ig_frecycle);
 		error = -EAGAIN;
@@ -468,6 +569,8 @@ xfs_iget_cache_hit(
 	 * Need to carefully get it back into useable state.
 	 */
 	if (ip->i_flags & XFS_IRECLAIMABLE) {
+		bool	needed_inactive;
+
 		trace_xfs_iget_reclaim(ip);
 
 		if (flags & XFS_IGET_INCORE) {
@@ -475,17 +578,34 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
+		/*
+		 * If we played inactivation accounting tricks with this inode
+		 * we have to undo them prior to resurrecting this inode.
+		 */
+		needed_inactive = (ip->i_flags & XFS_NEED_INACTIVE);
+
 		/*
 		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
 		 * from stomping over us while we recycle the inode.  We can't
 		 * clear the radix tree reclaimable tag yet as it requires
 		 * pag_ici_lock to be held exclusive.
+		 *
+		 * Clear NEED_INACTIVE so that the inactive worker won't
+		 * touch this inode now that we're trying to resurrect it.
 		 */
 		ip->i_flags |= XFS_IRECLAIM;
+		ip->i_flags &= ~XFS_NEED_INACTIVE;
 
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
+		if (needed_inactive) {
+			xfs_inode_inactivation_cleanup(ip);
+			spin_lock(&pag->pag_ici_lock);
+			xfs_perag_clear_inactive_tag(pag);
+			spin_unlock(&pag->pag_ici_lock);
+		}
+
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
 			bool wake;
@@ -1079,6 +1199,7 @@ xfs_inode_ag_iterator_tag(
 STATIC bool
 xfs_reclaim_inode_grab(
 	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb,
 	int			flags)
 {
 	ASSERT(rcu_read_lock_held());
@@ -1108,7 +1229,8 @@ xfs_reclaim_inode_grab(
 	 */
 	spin_lock(&ip->i_flags_lock);
 	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
-	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
+	    __xfs_iflags_test(ip, XFS_IRECLAIM) ||
+	    __xfs_iflags_test(ip, XFS_NEED_INACTIVE)) {
 		/* not a reclaim candidate. */
 		spin_unlock(&ip->i_flags_lock);
 		return false;
@@ -1167,6 +1289,8 @@ xfs_reclaim_inode(
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 	int			error;
 
+	trace_xfs_inode_reclaiming(ip);
+
 restart:
 	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
@@ -1299,8 +1423,10 @@ xfs_reclaim_inode(
 STATIC int
 xfs_walk_ag_reclaim_inos(
 	struct xfs_perag	*pag,
+	struct xfs_eofblocks	*eofb,
 	int			sync_flags,
 	bool			(*grab_fn)(struct xfs_inode *ip,
+					   struct xfs_eofblocks *eofb,
 					   int sync_flags),
 	int			(*execute_fn)(struct xfs_inode *ip,
 					      struct xfs_perag *pag,
@@ -1335,7 +1461,7 @@ xfs_walk_ag_reclaim_inos(
 		for (i = 0; i < nr_found; i++) {
 			struct xfs_inode *ip = batch[i];
 
-			if (*done || !grab_fn(ip, sync_flags))
+			if (*done || !grab_fn(ip, eofb, sync_flags))
 				batch[i] = NULL;
 
 			/*
@@ -1415,7 +1541,7 @@ xfs_reclaim_inodes_ag(
 		} else
 			mutex_lock(&pag->pag_ici_reclaim_lock);
 
-		error = xfs_walk_ag_reclaim_inos(pag, flags,
+		error = xfs_walk_ag_reclaim_inos(pag, NULL, flags,
 				xfs_reclaim_inode_grab, xfs_reclaim_inode,
 				nr_to_scan, &done);
 		if (error && last_error != -EFSCORRUPTED)
@@ -1494,6 +1620,161 @@ xfs_reclaim_inodes_count(
 	return reclaimable;
 }
 
+/*
+ * Grab the inode for inactivation exclusively.
+ * Return true if we grabbed it.
+ */
+STATIC bool
+xfs_inactive_inode_grab(
+	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb,
+	int			sync_flags)
+{
+	ASSERT(rcu_read_lock_held());
+
+	/* quick check for stale RCU freed inode */
+	if (!ip->i_ino)
+		return false;
+
+	/*
+	 * The radix tree lock here protects a thread in xfs_iget from racing
+	 * with us starting reclaim on the inode.
+	 *
+	 * Due to RCU lookup, we may find inodes that have been freed and only
+	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
+	 * aren't candidates for reclaim at all, so we must check the
+	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
+	 * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (!(ip->i_flags & XFS_IRECLAIMABLE) ||
+	    !(ip->i_flags & XFS_NEED_INACTIVE) ||
+	    (ip->i_flags & XFS_INACTIVATING)) {
+		/* not a inactivation candidate. */
+		spin_unlock(&ip->i_flags_lock);
+		return false;
+	}
+
+	if (!xfs_inode_matches_eofb(ip, eofb)) {
+		spin_unlock(&ip->i_flags_lock);
+		return false;
+	}
+
+	ip->i_flags |= XFS_INACTIVATING;
+	spin_unlock(&ip->i_flags_lock);
+	return true;
+}
+
+/* Inactivate this inode. */
+STATIC int
+xfs_inactive_inode(
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	int			sync_flags)
+{
+	ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS);
+
+	trace_xfs_inode_inactivating(ip);
+
+	/* Update metadata prior to freeing inode. */
+	xfs_inode_inactivation_cleanup(ip);
+	xfs_inactive(ip);
+	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
+	spin_lock(&pag->pag_ici_lock);
+	xfs_perag_clear_inactive_tag(pag);
+	spin_unlock(&pag->pag_ici_lock);
+
+	/* Kick the inactive inode to reclaim now that we've made updates. */
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
+	ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
+	xfs_reclaim_work_queue(ip->i_mount);
+	spin_unlock(&ip->i_flags_lock);
+	return 0;
+}
+
+/*
+ * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
+ * corrupted, we still need to clear the INACTIVE iflag so that we can move
+ * on to reclaiming the inode.
+ */
+int
+xfs_inactive_inodes(
+	struct xfs_mount	*mp,
+	struct xfs_eofblocks	*eofb)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			last_error = 0;
+	int			error;
+
+	/*
+	 * We want to skip inode inactivation while the filesystem is frozen
+	 * because we don't want the inactivation thread to block while taking
+	 * sb_intwrite.  Therefore, we try to take sb_write for the duration
+	 * of the inactive scan -- a freeze attempt will block until we're
+	 * done here, and if the fs is past stage 1 freeze we'll bounce out
+	 * until things unfreeze.  If the fs goes down while frozen we'll
+	 * still have log recovery to clean up after us.
+	 */
+	if (!sb_start_write_trylock(mp->m_super))
+		return -EAGAIN;
+
+	agno = 0;
+	while ((pag = xfs_perag_get_tag(mp, agno, XFS_ICI_RECLAIM_TAG))) {
+		int		nr_to_scan = INT_MAX;
+		bool		done = false;
+
+		agno = pag->pag_agno + 1;
+		error = xfs_walk_ag_reclaim_inos(pag, eofb, 0,
+				xfs_inactive_inode_grab, xfs_inactive_inode,
+				&nr_to_scan, &done);
+		if (error && last_error != -EFSCORRUPTED)
+			last_error = error;
+		xfs_perag_put(pag);
+	}
+
+	sb_end_write(mp->m_super);
+	return last_error;
+}
+
+/* Try to get inode inactivation moving. */
+void
+xfs_inactive_worker(
+	struct work_struct	*work)
+{
+	struct xfs_mount	*mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_inactive_work);
+	int			error;
+
+	error = xfs_inactive_inodes(mp, NULL);
+	if (error && error != -EAGAIN)
+		xfs_err(mp, "inode inactivation failed, error %d", error);
+	xfs_inactive_work_queue(mp);
+}
+
+/* Flush all inode inactivation work that might be queued. */
+void
+xfs_inactive_force(
+	struct xfs_mount	*mp)
+{
+	queue_delayed_work(mp->m_inactive_workqueue, &mp->m_inactive_work, 0);
+	flush_delayed_work(&mp->m_inactive_work);
+}
+
+/*
+ * Flush all inode inactivation work that might be queued and make sure the
+ * delayed work item is not queued.
+ */
+void
+xfs_inactive_deactivate(
+	struct xfs_mount	*mp)
+{
+	cancel_delayed_work_sync(&mp->m_inactive_work);
+	flush_workqueue(mp->m_inactive_workqueue);
+	xfs_inactive_inodes(mp, NULL);
+}
+
 STATIC int
 xfs_inode_free_eofblocks(
 	struct xfs_inode	*ip,
@@ -1608,6 +1889,14 @@ __xfs_inode_free_quota_eofblocks(
 	return scan;
 }
 
+/* Flush any inode with the same quota as this inode. */
+int
+xfs_inactive_free_quota(
+	struct xfs_inode	*ip)
+{
+	return __xfs_inode_free_quota_eofblocks(ip, xfs_inactive_inodes);
+}
+
 int
 xfs_inode_free_quota_eofblocks(
 	struct xfs_inode *ip)
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 26c0626f1f75..fd4073debd6e 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -55,7 +55,7 @@ int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
-void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
+void xfs_inode_set_reclaim_tag(struct xfs_inode *ip, bool need_inactive);
 
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
@@ -122,4 +122,10 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_icache_disable_reclaim(struct xfs_mount *mp);
 void xfs_icache_enable_reclaim(struct xfs_mount *mp);
 
+void xfs_inactive_worker(struct work_struct *work);
+int xfs_inactive_inodes(struct xfs_mount *mp, struct xfs_eofblocks *eofb);
+void xfs_inactive_force(struct xfs_mount *mp);
+void xfs_inactive_deactivate(struct xfs_mount *mp);
+int xfs_inactive_free_quota(struct xfs_inode *ip);
+
 #endif
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 78c88a8c947a..26c8181f5760 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1842,6 +1842,62 @@ xfs_inode_iadjust(
 	xfs_qm_iadjust(ip, direction, inodes, dblocks, rblocks);
 }
 
+/* Clean up inode inactivation. */
+void
+xfs_inode_inactivation_cleanup(
+	struct xfs_inode	*ip)
+{
+	int			ret;
+
+	/*
+	 * Undo the pending-inactivation counter updates since we're bringing
+	 * this inode back to life.
+	 */
+	ret = xfs_qm_dqattach(ip);
+	if (ret)
+		xfs_err(ip->i_mount, "error %d reactivating inode quota", ret);
+
+	xfs_inode_iadjust(ip, -1);
+}
+
+/* Prepare inode for inactivation. */
+void
+xfs_inode_inactivation_prep(
+	struct xfs_inode	*ip)
+{
+	int			ret;
+
+	/*
+	 * If this inode is unlinked (and now unreferenced) we need to dispose
+	 * of it in the on disk metadata.
+	 *
+	 * Bump generation so that the inode can't be opened by handle now that
+	 * the last external references has dropped.  Bulkstat won't return
+	 * inodes with zero nlink so nobody will ever find this inode again.
+	 * Then add this inode & blocks to the counts of things that will be
+	 * freed during the next inactivation run.
+	 */
+	if (VFS_I(ip)->i_nlink == 0)
+		VFS_I(ip)->i_generation++;
+
+	/*
+	 * Increase the pending-inactivation counters so that the fs looks like
+	 * it's free.
+	 */
+	ret = xfs_qm_dqattach(ip);
+	if (ret)
+		xfs_err(ip->i_mount, "error %d inactivating inode quota", ret);
+
+	xfs_inode_iadjust(ip, 1);
+
+	/*
+	 * Detach dquots just in case someone tries a quotaoff while
+	 * the inode is waiting on the inactive list.  We'll reattach
+	 * them (if needed) when inactivating the inode.
+	 */
+	xfs_qm_dqdetach(ip);
+}
+
 /*
  * Returns true if we need to update the on-disk metadata before we can free
  * the memory used by this inode.  Updates include freeing post-eof
@@ -1926,6 +1982,16 @@ xfs_inactive(
 	if (mp->m_flags & XFS_MOUNT_RDONLY)
 		return;
 
+	/*
+	 * Re-attach dquots prior to freeing EOF blocks or CoW staging extents.
+	 * We dropped the dquot prior to inactivation (because quotaoff can't
+	 * resurrect inactive inodes to force-drop the dquot) so we /must/
+	 * do this before touching any block mappings.
+	 */
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return;
+
 	/* Try to clean out the cow blocks if there are any. */
 	if (xfs_inode_has_cow_data(ip))
 		xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true);
@@ -1951,10 +2017,6 @@ xfs_inactive(
 	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0))
 		truncate = 1;
 
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		return;
-
 	if (S_ISLNK(VFS_I(ip)->i_mode))
 		error = xfs_inactive_symlink(ip);
 	else if (truncate)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c7d638aaec1b..e95e30a94243 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -214,6 +214,7 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
 #define XFS_IRECLAIMABLE	(1 << 2) /* inode can be reclaimed */
 #define __XFS_INEW_BIT		3	 /* inode has just been allocated */
 #define XFS_INEW		(1 << __XFS_INEW_BIT)
+#define XFS_NEED_INACTIVE	(1 << 4) /* see XFS_INACTIVATING below */
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
 #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
 #define __XFS_IFLOCK_BIT	7	 /* inode is being flushed right now */
@@ -230,6 +231,15 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 #define XFS_ICOWBLOCKS		(1 << 12)/* has the cowblocks tag set */
 
+/*
+ * If we need to update on-disk metadata before this IRECLAIMABLE inode can be
+ * freed, then NEED_INACTIVE will be set.  Once we start the updates, the
+ * INACTIVATING bit will be set to keep iget away from this inode.  After the
+ * inactivation completes, both flags will be cleared and the inode is a
+ * plain old IRECLAIMABLE inode.
+ */
+#define XFS_INACTIVATING	(1 << 13)
+
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
@@ -237,7 +247,8 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
+	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_INACTIVATING)
 
 /*
  * Synchronize processes attempting to flush the in-core inode back to disk.
@@ -501,5 +512,7 @@ extern struct kmem_zone	*xfs_inode_zone;
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
 bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
 bool xfs_inode_needs_inactivation(struct xfs_inode *ip);
+void xfs_inode_inactivation_prep(struct xfs_inode *ip);
+void xfs_inode_inactivation_cleanup(struct xfs_inode *ip);
 
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 27c93b5f029d..59772a7c7586 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -449,6 +449,7 @@ xfs_iomap_prealloc_size(
 				       alloc_blocks);
 
 	freesp = percpu_counter_read_positive(&mp->m_fdblocks);
+	freesp += percpu_counter_read_positive(&mp->m_dinactive);
 	if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) {
 		shift = 2;
 		if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT])
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ff9a27834c50..0f432bc33826 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -5162,6 +5162,13 @@ xlog_recover_process_iunlinks(
 		}
 		xfs_buf_rele(agibp);
 	}
+
+	/*
+	 * Now that we've put all the iunlink inodes on the lru, let's make
+	 * sure that we perform all the on-disk metadata updates to actually
+	 * free those inodes.
+	 */
+	xfs_inactive_force(mp);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 10be706ec72e..6d629e1379a0 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1066,6 +1066,7 @@ xfs_mountfs(
 	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
 	 * quota inodes.
 	 */
+	xfs_inactive_deactivate(mp);
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp, SYNC_WAIT);
  out_log_dealloc:
@@ -1104,7 +1105,15 @@ xfs_unmountfs(
 	uint64_t		resblks;
 	int			error;
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes.
+	 * Since this can involve finobt updates, do it now before we lose the
+	 * per-AG space reservations.
+	 */
+	xfs_inactive_force(mp);
+
 	xfs_icache_disable_reclaim(mp);
+
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
 	xfs_rtunmount_inodes(mp);
@@ -1153,6 +1162,13 @@ xfs_unmountfs(
 
 	xfs_qm_unmount(mp);
 
+	/*
+	 * Kick off inode inactivation again to push the metadata inodes past
+	 * INACTIVE into RECLAIM.  We also have to deactivate the inactivation
+	 * worker.
+	 */
+	xfs_inactive_deactivate(mp);
+
 	/*
 	 * Unreserve any blocks we have so that when we unmount we don't account
 	 * the reserved free space as used. This is really only necessary for
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ff1a94b633cf..91391fd43e87 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -153,6 +153,7 @@ typedef struct xfs_mount {
 						     trimming */
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
+	struct delayed_work	m_inactive_work; /* background inode inactive */
 	bool			m_update_sb;	/* sb needs update in mount */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 						/* low free space thresholds */
@@ -167,6 +168,7 @@ typedef struct xfs_mount {
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
+	struct workqueue_struct	*m_inactive_workqueue;
 	struct workqueue_struct	*m_log_workqueue;
 	struct workqueue_struct *m_eofblocks_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
@@ -372,7 +374,8 @@ typedef struct xfs_perag {
 
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
-	int		pag_ici_reclaimable;	/* reclaimable inodes */
+	unsigned int	pag_ici_reclaimable;	/* reclaimable inodes */
+	unsigned int	pag_ici_inactive;	/* inactive inodes */
 	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index ba058b8da8a8..d96f146e3fe6 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -47,6 +47,12 @@ xfs_qm_scall_quotaoff(
 	uint			inactivate_flags;
 	xfs_qoff_logitem_t	*qoffstart;
 
+	/*
+	 * Clean up the inactive list before we turn quota off, to reduce the
+	 * amount of quotaoff work we have to do with the mutex held.
+	 */
+	xfs_inactive_force(mp);
+
 	/*
 	 * No file system can have quotas enabled on disk but not in core.
 	 * Note that quota utilities (like quotaoff) _expect_
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 24eb35b8fe5a..aa10df744a2a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -874,8 +874,15 @@ xfs_init_mount_workqueues(
 	if (!mp->m_sync_workqueue)
 		goto out_destroy_eofb;
 
+	mp->m_inactive_workqueue = alloc_workqueue("xfs-inactive/%s",
+			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
+	if (!mp->m_inactive_workqueue)
+		goto out_destroy_sync;
+
 	return 0;
 
+out_destroy_sync:
+	destroy_workqueue(mp->m_inactive_workqueue);
 out_destroy_eofb:
 	destroy_workqueue(mp->m_eofblocks_workqueue);
 out_destroy_log:
@@ -898,6 +905,7 @@ STATIC void
 xfs_destroy_mount_workqueues(
 	struct xfs_mount	*mp)
 {
+	destroy_workqueue(mp->m_inactive_workqueue);
 	destroy_workqueue(mp->m_sync_workqueue);
 	destroy_workqueue(mp->m_eofblocks_workqueue);
 	destroy_workqueue(mp->m_log_workqueue);
@@ -970,28 +978,34 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	bool			need_inactive;
 
 	trace_xfs_destroy_inode(ip);
 
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-	XFS_STATS_INC(ip->i_mount, vn_rele);
-	XFS_STATS_INC(ip->i_mount, vn_remove);
-
-	xfs_inactive(ip);
-
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
+	XFS_STATS_INC(mp, vn_rele);
+	XFS_STATS_INC(mp, vn_remove);
+
+	need_inactive = xfs_inode_needs_inactivation(ip);
+	if (need_inactive) {
+		trace_xfs_inode_set_need_inactive(ip);
+		xfs_inode_inactivation_prep(ip);
+	} else if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
 		ASSERT(0);
 	}
-
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
+	XFS_STATS_INC(mp, vn_reclaim);
+	trace_xfs_inode_set_reclaimable(ip);
 
 	/*
 	 * We should never get here with one of the reclaim flags already set.
 	 */
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_NEED_INACTIVE));
+	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_INACTIVATING));
 
 	/*
 	 * We always use background reclaim here because even if the
@@ -1000,7 +1014,7 @@ xfs_fs_destroy_inode(
 	 * this more efficiently than we can here, so simply let background
 	 * reclaim tear down all inodes.
 	 */
-	xfs_inode_set_reclaim_tag(ip);
+	xfs_inode_set_reclaim_tag(ip, need_inactive);
 }
 
 static void
@@ -1252,6 +1266,12 @@ xfs_quiesce_attr(
 	while (atomic_read(&mp->m_active_trans) > 0)
 		delay(100);
 
+	/*
+	 * Perform all on-disk metadata updates required to inactivate inodes.
+	 * This has to be done before we force the log out.
+	 */
+	xfs_inactive_deactivate(mp);
+
 	/* force the log to unpin objects from the now complete transactions */
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
@@ -1425,6 +1445,13 @@ xfs_fs_remount(
 			return error;
 		}
 
+		/*
+		 * Perform all on-disk metadata updates required to inactivate
+		 * inodes.  Since this can involve finobt updates, do it now
+		 * before we lose the per-AG space reservations.
+		 */
+		xfs_inactive_force(mp);
+
 		/* Free the per-AG metadata reservation pool. */
 		error = xfs_fs_unreserve_ag_blocks(mp);
 		if (error) {
@@ -1478,6 +1505,18 @@ xfs_fs_unfreeze(
 	return 0;
 }
 
+/*
+ * Before we get to stage 1 of a freeze, force all the inactivation work so
+ * that there's less work to do if we crash during the freeze.
+ */
+STATIC int
+xfs_fs_freeze_super(
+	struct super_block	*sb)
+{
+	xfs_inactive_force(XFS_M(sb));
+	return freeze_super(sb);
+}
+
 STATIC int
 xfs_fs_show_options(
 	struct seq_file		*m,
@@ -1640,6 +1679,7 @@ xfs_mount_alloc(
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
 	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
+	INIT_DELAYED_WORK(&mp->m_inactive_work, xfs_inactive_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	return mp;
 }
@@ -1900,6 +1940,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.freeze_super		= xfs_fs_freeze_super,
 };
 
 static struct file_system_type xfs_fs_type = {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 5127f91ce3fd..d2e5e6a794b5 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -588,14 +588,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
 	),
-	TP_printk("dev %d:%d ino 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino)
+		  __entry->ino,
+		  __entry->iflags)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -639,6 +642,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
+DEFINE_INODE_EVENT(xfs_inode_reclaiming);
+DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
+DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/12] xfs: retry fs writes when there isn't space
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (7 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 08/12] xfs: deferred inode inactivation Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 10/12] xfs: force inactivation before fallocate when space is low Darrick J. Wong
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Any time we try a file write that fails due to ENOSPC or EDQUOT, force
inactivation work to free up some resources and try one more time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c  |   11 +++++++++++
 fs/xfs/xfs_inode.c |    1 +
 fs/xfs/xfs_iomap.c |   10 +++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..7ba3b799702d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -652,6 +652,9 @@ xfs_file_buffered_aio_write(
 		if (enospc)
 			goto write_retry;
 		enospc = xfs_inode_free_quota_cowblocks(ip);
+		if (enospc)
+			goto write_retry;
+		enospc = xfs_inactive_free_quota(ip);
 		if (enospc)
 			goto write_retry;
 		iolock = 0;
@@ -665,6 +668,7 @@ xfs_file_buffered_aio_write(
 		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
 		xfs_icache_free_cowblocks(ip->i_mount, &eofb);
+		xfs_inactive_force(ip->i_mount);
 		goto write_retry;
 	}
 
@@ -936,6 +940,7 @@ xfs_file_remap_range(
 	struct xfs_mount	*mp = src->i_mount;
 	loff_t			remapped = 0;
 	xfs_extlen_t		cowextsize;
+	bool			can_retry = true;
 	int			ret;
 
 	if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
@@ -955,8 +960,14 @@ xfs_file_remap_range(
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
+retry:
 	ret = xfs_reflink_remap_blocks(src, pos_in, dest, pos_out, len,
 			&remapped);
+	if ((ret == -EDQUOT || ret == -ENOSPC) && can_retry) {
+		can_retry = false;
+		xfs_inactive_force(mp);
+		goto retry;
+	}
 	if (ret)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 26c8181f5760..2c7f8aeffa92 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1181,6 +1181,7 @@ xfs_create(
 	if (error == -ENOSPC) {
 		/* flush outstanding delalloc blocks and retry */
 		xfs_flush_inodes(mp);
+		xfs_inactive_force(mp);
 		error = xfs_trans_alloc(mp, tres, resblks, 0, 0, &tp);
 	}
 	if (error)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 59772a7c7586..302ba859dd32 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -532,12 +532,13 @@ xfs_file_iomap_begin_delay(
 	struct xfs_bmbt_irec	got;
 	struct xfs_iext_cursor	icur;
 	xfs_fsblock_t		prealloc_blocks = 0;
+	bool			flush_inactive = true;
 
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 	ASSERT(!xfs_get_extsz_hint(ip));
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
+start_over:
 	if (unlikely(XFS_TEST_ERROR(
 	    (XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_EXTENTS &&
 	     XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_BTREE),
@@ -632,6 +633,13 @@ xfs_file_iomap_begin_delay(
 		break;
 	case -ENOSPC:
 	case -EDQUOT:
+		if (flush_inactive) {
+			flush_inactive = false;
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			xfs_inactive_force(mp);
+			xfs_ilock(ip, XFS_ILOCK_EXCL);
+			goto start_over;
+		}
 		/* retry without any preallocation */
 		trace_xfs_delalloc_enospc(ip, offset, count);
 		if (prealloc_blocks) {

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/12] xfs: force inactivation before fallocate when space is low
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (8 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 09/12] xfs: retry fs writes when there isn't space Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:17 ` [PATCH 11/12] xfs: parallelize inode inactivation Darrick J. Wong
  2019-01-01  2:18 ` [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes Darrick J. Wong
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

If we think that inactivation will free enough blocks to make it easier
to satisfy an fallocate request, force inactivation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index aa7f7fe241ec..aa7a678d4874 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -33,6 +33,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 #include "xfs_refcount.h"
+#include "xfs_sb.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -839,6 +840,38 @@ xfs_free_eofblocks(
 	return error;
 }
 
+/*
+ * If we suspect that the target device isn't going to be able to satisfy the
+ * entire request, try forcing inode inactivation to free up space.
+ */
+static void
+xfs_alloc_reclaim_inactive_space(
+	struct xfs_mount	*mp,
+	bool			is_rt,
+	xfs_filblks_t		allocatesize_fsb)
+{
+	struct xfs_perag	*pag;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	xfs_extlen_t		free;
+	xfs_agnumber_t		agno;
+
+	if (is_rt) {
+		if (sbp->sb_frextents * sbp->sb_rextsize >= allocatesize_fsb)
+			return;
+	} else {
+		for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+			pag = xfs_perag_get(mp, agno);
+			free = pag->pagf_freeblks;
+			xfs_perag_put(pag);
+
+			if (free >= allocatesize_fsb)
+				return;
+		}
+	}
+
+	xfs_inactive_force(mp);
+}
+
 int
 xfs_alloc_file_space(
 	struct xfs_inode	*ip,
@@ -925,6 +958,8 @@ xfs_alloc_file_space(
 			quota_flag = XFS_QMOPT_RES_REGBLKS;
 		}
 
+		xfs_alloc_reclaim_inactive_space(mp, rt, allocatesize_fsb);
+
 		/*
 		 * Allocate and setup the transaction.
 		 */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/12] xfs: parallelize inode inactivation
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (9 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 10/12] xfs: force inactivation before fallocate when space is low Darrick J. Wong
@ 2019-01-01  2:17 ` Darrick J. Wong
  2019-01-01  2:18 ` [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes Darrick J. Wong
  11 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c |  108 ++++++++++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_mount.c  |    3 +
 fs/xfs/xfs_mount.h  |    4 +-
 fs/xfs/xfs_super.c  |    3 -
 4 files changed, 95 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2386a2f3e1d0..e1210beb9d0b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -228,12 +228,12 @@ xfs_reclaim_work_queue(
 /* Queue a new inode inactivation pass if there are reclaimable inodes. */
 static void
 xfs_inactive_work_queue(
-	struct xfs_mount        *mp)
+	struct xfs_perag	*pag)
 {
 	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG))
-		queue_delayed_work(mp->m_inactive_workqueue,
-				&mp->m_inactive_work,
+	if (pag->pag_ici_inactive)
+		queue_delayed_work(pag->pag_mount->m_inactive_workqueue,
+				&pag->pag_inactive_work,
 				msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
 	rcu_read_unlock();
 }
@@ -316,7 +316,7 @@ xfs_perag_set_inactive_tag(
 	 * idea of when it ought to force inactivation, and in the mean time
 	 * we prefer batching.
 	 */
-	xfs_inactive_work_queue(mp);
+	xfs_inactive_work_queue(pag);
 
 	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
 }
@@ -1693,6 +1693,37 @@ xfs_inactive_inode(
 	return 0;
 }
 
+/*
+ * Inactivate the inodes in an AG. Even if the filesystem is corrupted, we
+ * still need to clear the INACTIVE iflag so that we can move on to reclaiming
+ * the inode.
+ */
+int
+xfs_inactive_inodes_ag(
+	struct xfs_perag	*pag,
+	struct xfs_eofblocks	*eofb)
+{
+	int			nr_to_scan = INT_MAX;
+	bool			done = false;
+
+	return xfs_walk_ag_reclaim_inos(pag, eofb, 0, xfs_inactive_inode_grab,
+			xfs_inactive_inode, &nr_to_scan, &done);
+}
+
+/* Does this pag have inactive inodes? */
+static inline bool
+xfs_pag_has_inactive(
+	struct xfs_perag	*pag)
+{
+	unsigned int		inactive;
+
+	spin_lock(&pag->pag_ici_lock);
+	inactive = pag->pag_ici_inactive;
+	spin_unlock(&pag->pag_ici_lock);
+
+	return inactive > 0;
+}
+
 /*
  * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
  * corrupted, we still need to clear the INACTIVE iflag so that we can move
@@ -1722,15 +1753,12 @@ xfs_inactive_inodes(
 
 	agno = 0;
 	while ((pag = xfs_perag_get_tag(mp, agno, XFS_ICI_RECLAIM_TAG))) {
-		int		nr_to_scan = INT_MAX;
-		bool		done = false;
-
 		agno = pag->pag_agno + 1;
-		error = xfs_walk_ag_reclaim_inos(pag, eofb, 0,
-				xfs_inactive_inode_grab, xfs_inactive_inode,
-				&nr_to_scan, &done);
-		if (error && last_error != -EFSCORRUPTED)
-			last_error = error;
+		if (xfs_pag_has_inactive(pag)) {
+			error = xfs_inactive_inodes_ag(pag, eofb);
+			if (error && last_error != -EFSCORRUPTED)
+				last_error = error;
+		}
 		xfs_perag_put(pag);
 	}
 
@@ -1743,14 +1771,29 @@ void
 xfs_inactive_worker(
 	struct work_struct	*work)
 {
-	struct xfs_mount	*mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_inactive_work);
+	struct xfs_perag	*pag = container_of(to_delayed_work(work),
+					struct xfs_perag, pag_inactive_work);
+	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
-	error = xfs_inactive_inodes(mp, NULL);
+	/*
+	 * We want to skip inode inactivation while the filesystem is frozen
+	 * because we don't want the inactivation thread to block while taking
+	 * sb_intwrite.  Therefore, we try to take sb_write for the duration
+	 * of the inactive scan -- a freeze attempt will block until we're
+	 * done here, and if the fs is past stage 1 freeze we'll bounce out
+	 * until things unfreeze.  If the fs goes down while frozen we'll
+	 * still have log recovery to clean up after us.
+	 */
+	if (!sb_start_write_trylock(mp->m_super))
+		return;
+
+	error = xfs_inactive_inodes_ag(pag, NULL);
 	if (error && error != -EAGAIN)
 		xfs_err(mp, "inode inactivation failed, error %d", error);
-	xfs_inactive_work_queue(mp);
+
+	sb_end_write(mp->m_super);
+	xfs_inactive_work_queue(pag);
 }
 
 /* Flush all inode inactivation work that might be queued. */
@@ -1758,8 +1801,25 @@ void
 xfs_inactive_force(
 	struct xfs_mount	*mp)
 {
-	queue_delayed_work(mp->m_inactive_workqueue, &mp->m_inactive_work, 0);
-	flush_delayed_work(&mp->m_inactive_work);
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	agno = 0;
+	while ((pag = xfs_perag_get_tag(mp, agno, XFS_ICI_RECLAIM_TAG))) {
+		agno = pag->pag_agno + 1;
+		if (xfs_pag_has_inactive(pag))
+			queue_delayed_work(mp->m_inactive_workqueue,
+					&pag->pag_inactive_work, 0);
+		xfs_perag_put(pag);
+	}
+
+	agno = 0;
+	while ((pag = xfs_perag_get_tag(mp, agno, XFS_ICI_RECLAIM_TAG))) {
+		agno = pag->pag_agno + 1;
+		if (xfs_pag_has_inactive(pag))
+			flush_delayed_work(&pag->pag_inactive_work);
+		xfs_perag_put(pag);
+	}
 }
 
 /*
@@ -1770,7 +1830,15 @@ void
 xfs_inactive_deactivate(
 	struct xfs_mount	*mp)
 {
-	cancel_delayed_work_sync(&mp->m_inactive_work);
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno = 0;
+
+	while ((pag = xfs_perag_get_tag(mp, agno, XFS_ICI_RECLAIM_TAG))) {
+		agno = pag->pag_agno + 1;
+		cancel_delayed_work_sync(&pag->pag_inactive_work);
+		xfs_perag_put(pag);
+	}
+
 	flush_workqueue(mp->m_inactive_workqueue);
 	xfs_inactive_inodes(mp, NULL);
 }
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 6d629e1379a0..0bcab017b12b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -129,6 +129,7 @@ __xfs_free_perag(
 {
 	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
 
+	ASSERT(!delayed_work_pending(&pag->pag_inactive_work));
 	ASSERT(atomic_read(&pag->pag_ref) == 0);
 	kmem_free(pag);
 }
@@ -149,6 +150,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
+		cancel_delayed_work_sync(&pag->pag_inactive_work);
 		xfs_buf_hash_destroy(pag);
 		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
@@ -203,6 +205,7 @@ xfs_initialize_perag(
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
 		mutex_init(&pag->pag_ici_reclaim_lock);
+		INIT_DELAYED_WORK(&pag->pag_inactive_work, xfs_inactive_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 91391fd43e87..1096ea61a427 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -153,7 +153,6 @@ typedef struct xfs_mount {
 						     trimming */
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
-	struct delayed_work	m_inactive_work; /* background inode inactive */
 	bool			m_update_sb;	/* sb needs update in mount */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 						/* low free space thresholds */
@@ -392,6 +391,9 @@ typedef struct xfs_perag {
 	/* Blocks reserved for the reverse mapping btree. */
 	struct xfs_ag_resv	pag_rmapbt_resv;
 
+	/* background inode inactivation */
+	struct delayed_work	pag_inactive_work;
+
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 } xfs_perag_t;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index aa10df744a2a..b7f37a87f187 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -875,7 +875,7 @@ xfs_init_mount_workqueues(
 		goto out_destroy_eofb;
 
 	mp->m_inactive_workqueue = alloc_workqueue("xfs-inactive/%s",
-			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
+			WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
 	if (!mp->m_inactive_workqueue)
 		goto out_destroy_sync;
 
@@ -1679,7 +1679,6 @@ xfs_mount_alloc(
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
 	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
-	INIT_DELAYED_WORK(&mp->m_inactive_work, xfs_inactive_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	return mp;
 }

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes
  2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
                   ` (10 preceding siblings ...)
  2019-01-01  2:17 ` [PATCH 11/12] xfs: parallelize inode inactivation Darrick J. Wong
@ 2019-01-01  2:18 ` Darrick J. Wong
  2019-01-03 12:46   ` Dave Chinner
  11 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-01  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we've constructed a mechanism to batch background inode
inactivation work, we actually want in some cases to throttle the amount
of backlog work that the frontend can generate.  We do this by making
destroy_inode wait for inactivation when we're deleting things, assuming
that deleted inodes are dropped and destroyed in process context and not
from fs reclaim.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c |  155 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.h |   11 ++++
 fs/xfs/xfs_super.c  |   12 ++++
 fs/xfs/xfs_trace.h  |    2 +
 4 files changed, 180 insertions(+)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e1210beb9d0b..064c5de9dce3 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1822,6 +1822,23 @@ xfs_inactive_force(
 	}
 }
 
+/* Flush all inode inactivation work that might be queued for this AG. */
+static void
+xfs_inactive_force_ag(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_perag	*pag;
+
+	pag = xfs_perag_get(mp, agno);
+	if (xfs_pag_has_inactive(pag)) {
+		queue_delayed_work(mp->m_inactive_workqueue,
+				&pag->pag_inactive_work, 0);
+		flush_delayed_work(&pag->pag_inactive_work);
+	}
+	xfs_perag_put(pag);
+}
+
 /*
  * Flush all inode inactivation work that might be queued and make sure the
  * delayed work item is not queued.
@@ -1843,6 +1860,144 @@ xfs_inactive_deactivate(
 	xfs_inactive_inodes(mp, NULL);
 }
 
+/*
+ * Decide if this inode is a candidate for unlinked inactivation throttling.
+ * We have to decide this prior to setting the NEED_INACTIVE iflag because
+ * once we flag the inode for inactivation we can't access it any more.
+ */
+enum xfs_iwait
+xfs_iwait_check(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	unsigned long long	x;
+	unsigned long long	y;
+	bool			rt = XFS_IS_REALTIME_INODE(ip);
+
+	/*
+	 * Don't wait unless we're doing a deletion inactivation.  We assume
+	 * that unlinked inodes that lose all their refcount are dropped,
+	 * evicted, and destroyed immediately in the context of the unlink()ing
+	 * process.
+	 */
+	if (VFS_I(ip)->i_nlink > 0)
+		return XFS_IWAIT_NEVER;
+
+	/*
+	 * If we're being called from kswapd we're in background memory reclaim
+	 * context.  There's no point in making it wait for ondisk metadata
+	 * updates, which themselves require memory allocations.
+	 */
+	if (current->flags & PF_KSWAPD)
+		return XFS_IWAIT_NEVER;
+
+	/*
+	 * Always wait for directory removal so we clean up any files that
+	 * were in that directory.
+	 */
+	if (S_ISDIR(VFS_I(ip)->i_mode)) {
+		trace_xfs_inactive_iwait_all(ip);
+		return XFS_IWAIT_ALL;
+	}
+
+	/* Heavily fragmented files take a while to delete. */
+	x = XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK) +
+	    XFS_IFORK_NEXTENTS(ip, XFS_ATTR_FORK) +
+	    XFS_IFORK_NEXTENTS(ip, XFS_COW_FORK);
+	y = rt ? 256 : 32 * mp->m_sb.sb_agcount;
+	if (x >= y) {
+		trace_xfs_inactive_iwait_inode(ip);
+		return XFS_IWAIT_INODE;
+	}
+
+	return XFS_IWAIT_UNDECIDED;
+}
+
+/*
+ * Wait for deferred inode inactivation of an unlinked inode being destroyed.
+ *
+ * The deferred inode inactivation mechanism provides for background batching
+ * of whatever on-disk metadata updates are necessary to free an inode and all
+ * the resources it holds.  In theory this should speed up deletion by enabling
+ * us to inactivate in inode number order.
+ *
+ * However, there are a few situations where we actually /want/ to throttle
+ * unlinking.  Specifically, if we're unlinking fragmented files or removing
+ * entire directory trees, we should wait instead of allowing an enormous
+ * processing backlog that causes update storms later.
+ *
+ * We will wait for inactivation to finish under the following circumstances:
+ *  - Removing a directory
+ *  - Removing a heavily fragmented file
+ *  - A large number of blocks could be freed by inactivation
+ *  - A large number of inodes could be freed by inactivation
+ */
+void
+xfs_inactive_wait(
+	struct xfs_mount	*mp,
+	enum xfs_iwait		iwait,
+	xfs_agnumber_t		agno)
+{
+	unsigned long long	x;
+	unsigned long long	y;
+
+	switch (iwait) {
+	case XFS_IWAIT_NEVER:
+		return;
+	case XFS_IWAIT_ALL:
+	case XFS_IWAIT_INODE:
+		goto wait;
+	default:
+		break;
+	}
+
+	iwait = XFS_IWAIT_ALL;
+
+	/* More than 1/4 of an AG space could be freed by inactivation. */
+	x = percpu_counter_read_positive(&mp->m_dinactive);
+	y = mp->m_sb.sb_agblocks / 4;
+	if (x >= y)
+		goto wait;
+
+	/* Less than 1/16 of the datadev is free. */
+	x = percpu_counter_read_positive(&mp->m_fdblocks);
+	y = mp->m_sb.sb_dblocks / 16;
+	if (x <= y)
+		goto wait;
+
+	/* More than 1/4 of the rtdev could be freed by inactivation. */
+	y = mp->m_sb.sb_rblocks;
+	if (y > 0) {
+		x = percpu_counter_read_positive(&mp->m_rinactive);
+		if (x >= y / 4)
+			goto wait;
+
+		/* Less than 1/16 of the rtdev is free. */
+		x = mp->m_sb.sb_frextents * mp->m_sb.sb_rextsize;
+		if (x <= y / 16)
+			goto wait;
+	}
+
+	/* A lot of inodes could be freed by inactivation. */
+	x = percpu_counter_read_positive(&mp->m_iinactive);
+	y = XFS_INODES_PER_CHUNK * 4 * (unsigned long long)mp->m_sb.sb_agcount;
+	if (x >= y)
+		goto wait;
+
+	return;
+wait:
+	switch (iwait) {
+	case XFS_IWAIT_ALL:
+		xfs_inactive_force(mp);
+		break;
+	case XFS_IWAIT_INODE:
+		xfs_inactive_force_ag(mp, agno);
+		break;
+	default:
+		ASSERT(0);
+	}
+}
+
 STATIC int
 xfs_inode_free_eofblocks(
 	struct xfs_inode	*ip,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index fd4073debd6e..f9c917700ea5 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -128,4 +128,15 @@ void xfs_inactive_force(struct xfs_mount *mp);
 void xfs_inactive_deactivate(struct xfs_mount *mp);
 int xfs_inactive_free_quota(struct xfs_inode *ip);
 
+enum xfs_iwait {
+	XFS_IWAIT_NEVER = -1,
+	XFS_IWAIT_UNDECIDED,
+	XFS_IWAIT_ALL,
+	XFS_IWAIT_INODE,
+};
+
+enum xfs_iwait xfs_iwait_check(struct xfs_inode *ip);
+void xfs_inactive_wait(struct xfs_mount *mp, enum xfs_iwait decision,
+		       xfs_agnumber_t agno);
+
 #endif
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b7f37a87f187..1141413c53c0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -979,6 +979,8 @@ xfs_fs_destroy_inode(
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	enum xfs_iwait		caniwait = XFS_IWAIT_NEVER;
 	bool			need_inactive;
 
 	trace_xfs_destroy_inode(ip);
@@ -991,6 +993,7 @@ xfs_fs_destroy_inode(
 	if (need_inactive) {
 		trace_xfs_inode_set_need_inactive(ip);
 		xfs_inode_inactivation_prep(ip);
+		caniwait = xfs_iwait_check(ip);
 	} else if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
@@ -1015,6 +1018,15 @@ xfs_fs_destroy_inode(
 	 * reclaim tear down all inodes.
 	 */
 	xfs_inode_set_reclaim_tag(ip, need_inactive);
+
+	/*
+	 * Wait for inactivation of this inode if the inode has zero nlink.
+	 * This cannot be done in fs reclaim context, which means we assume
+	 * that unlinked inodes that lose all their refcount are dropped,
+	 * evicted, and destroyed immediately in the context of the unlink()ing
+	 * process and are never fed to the LRU for reclaim.
+	 */
+	xfs_inactive_wait(mp, caniwait, agno);
 }
 
 static void
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d2e5e6a794b5..02683ec06164 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -646,6 +646,8 @@ DEFINE_INODE_EVENT(xfs_inode_set_reclaimable);
 DEFINE_INODE_EVENT(xfs_inode_reclaiming);
 DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
 DEFINE_INODE_EVENT(xfs_inode_inactivating);
+DEFINE_INODE_EVENT(xfs_inactive_iwait_all);
+DEFINE_INODE_EVENT(xfs_inactive_iwait_inode);
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 07/12] xfs: refactor eofblocks inode match code
  2019-01-01  2:17 ` [PATCH 07/12] xfs: refactor eofblocks inode match code Darrick J. Wong
@ 2019-01-02  9:50   ` Nikolay Borisov
  2019-01-17 18:05     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Nikolay Borisov @ 2019-01-02  9:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs



On 1.01.19 г. 4:17 ч., Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor the code that determines if an inode matches an eofblocks
> structure into a helper, since we already use it twice and we're about
> to use it a third time.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_icache.c |  146 ++++++++++++++++++++++++++-------------------------
>  1 file changed, 74 insertions(+), 72 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 7e031eb6f048..c1b457ba1b7b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -27,6 +27,76 @@
>  #include <linux/freezer.h>
>  #include <linux/iversion.h>
>  
> +STATIC int
> +xfs_inode_match_id(

This is a predicate so make it bool

> +	struct xfs_inode	*ip,
> +	struct xfs_eofblocks	*eofb)
> +{
> +	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> +	    !uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> +		return 0;
> +
> +	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> +	    !gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> +		return 0;
> +
> +	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> +	    xfs_get_projid(ip) != eofb->eof_prid)
> +		return 0;
> +
> +	return 1;
> +}
> +
> +/*
> + * A union-based inode filtering algorithm. Process the inode if any of the
> + * criteria match. This is for global/internal scans only.
> + */
> +STATIC int

ditto

> +xfs_inode_match_id_union(
> +	struct xfs_inode	*ip,
> +	struct xfs_eofblocks	*eofb)
> +{
> +	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> +	    uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> +		return 1;
> +
> +	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> +	    gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> +		return 1;
> +
> +	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> +	    xfs_get_projid(ip) == eofb->eof_prid)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/* Does this inode match the given parameters? */
> +STATIC bool
> +xfs_inode_matches_eofb(
> +	struct xfs_inode	*ip,
> +	struct xfs_eofblocks	*eofb)
> +{
> +	int			match;
> +
> +	if (!eofb)
> +		return true;
> +
> +	if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
> +		match = xfs_inode_match_id_union(ip, eofb);
> +	else
> +		match = xfs_inode_match_id(ip, eofb);
> +	if (!match)
> +		return false;
> +
> +	/* skip the inode if the file size is too small */
> +	if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> +	    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Allocate and initialise an xfs_inode.
>   */
> @@ -1424,50 +1494,6 @@ xfs_reclaim_inodes_count(
>  	return reclaimable;
>  }
>  
> -STATIC int
> -xfs_inode_match_id(
> -	struct xfs_inode	*ip,
> -	struct xfs_eofblocks	*eofb)
> -{
> -	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> -	    !uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> -		return 0;
> -
> -	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> -	    !gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> -		return 0;
> -
> -	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> -	    xfs_get_projid(ip) != eofb->eof_prid)
> -		return 0;
> -
> -	return 1;
> -}
> -
> -/*
> - * A union-based inode filtering algorithm. Process the inode if any of the
> - * criteria match. This is for global/internal scans only.
> - */
> -STATIC int
> -xfs_inode_match_id_union(
> -	struct xfs_inode	*ip,
> -	struct xfs_eofblocks	*eofb)
> -{
> -	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> -	    uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> -		return 1;
> -
> -	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> -	    gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> -		return 1;
> -
> -	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> -	    xfs_get_projid(ip) == eofb->eof_prid)
> -		return 1;
> -
> -	return 0;
> -}
> -
>  STATIC int
>  xfs_inode_free_eofblocks(
>  	struct xfs_inode	*ip,
> @@ -1476,7 +1502,6 @@ xfs_inode_free_eofblocks(
>  {
>  	int ret = 0;
>  	struct xfs_eofblocks *eofb = args;
> -	int match;
>  
>  	if (!xfs_can_free_eofblocks(ip, false)) {
>  		/* inode could be preallocated or append-only */
> @@ -1493,19 +1518,8 @@ xfs_inode_free_eofblocks(
>  	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY))
>  		return 0;
>  
> -	if (eofb) {
> -		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
> -			match = xfs_inode_match_id_union(ip, eofb);
> -		else
> -			match = xfs_inode_match_id(ip, eofb);
> -		if (!match)
> -			return 0;
> -
> -		/* skip the inode if the file size is too small */
> -		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> -		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> -			return 0;
> -	}
> +	if (!xfs_inode_matches_eofb(ip, eofb))
> +		return 0;
>  
>  	/*
>  	 * If the caller is waiting, return -EAGAIN to keep the background
> @@ -1766,25 +1780,13 @@ xfs_inode_free_cowblocks(
>  {
>  	struct xfs_eofblocks	*eofb = args;
>  	uint			lock_mode = 0;
> -	int			match;
>  	int			ret = 0;
>  
>  	if (!xfs_prep_free_cowblocks(ip))
>  		return 0;
>  
> -	if (eofb) {
> -		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
> -			match = xfs_inode_match_id_union(ip, eofb);
> -		else
> -			match = xfs_inode_match_id(ip, eofb);
> -		if (!match)
> -			return 0;
> -
> -		/* skip the inode if the file size is too small */
> -		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> -		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> -			return 0;
> -	}
> +	if (!xfs_inode_matches_eofb(ip, eofb))
> +		return 0;
>  
>  	/*
>  	 * Free the CoW blocks.  We don't need to lock the inode if we're in
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes
  2019-01-01  2:18 ` [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes Darrick J. Wong
@ 2019-01-03 12:46   ` Dave Chinner
  2019-01-17 18:41     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2019-01-03 12:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Dec 31, 2018 at 06:18:04PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we've constructed a mechanism to batch background inode
> inactivation work, we actually want in some cases to throttle the amount
> of backlog work that the frontend can generate.  We do this by making
> destroy_inode wait for inactivation when we're deleting things, assuming
> that deleted inodes are dropped and destroyed in process context and not
> from fs reclaim.

This would kills performance of highly concurrent unlink
workloads.

That said, the current unlink code needs severe throttling because
the unlinked inode list management does not scale at all well - get
more than a a couple of hundred inodes into the AGI unlinked lists,
and xfs_iunlink_remove burns huge amounts of CPU.

So if it isn't throttled, it'll be just as slow, but burn huge
amounts amounts of extra CPU walking the unlinked lists.....

> +/*
> + * Decide if this inode is a candidate for unlinked inactivation throttling.
> + * We have to decide this prior to setting the NEED_INACTIVE iflag because
> + * once we flag the inode for inactivation we can't access it any more.
> + */
> +enum xfs_iwait
> +xfs_iwait_check(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	unsigned long long	x;
> +	unsigned long long	y;
> +	bool			rt = XFS_IS_REALTIME_INODE(ip);
> +
> +	/*
> +	 * Don't wait unless we're doing a deletion inactivation.  We assume
> +	 * that unlinked inodes that lose all their refcount are dropped,
> +	 * evicted, and destroyed immediately in the context of the unlink()ing
> +	 * process.
> +	 */

I think this is wrong - we want to push unlinked inode processing
into the background so we don't have to wait on it, not force
inactivation of unlinked inodes to wait for other inodes to be
inactivated.

> +	if (VFS_I(ip)->i_nlink > 0)
> +		return XFS_IWAIT_NEVER;
> +
> +	/*
> +	 * If we're being called from kswapd we're in background memory reclaim
> +	 * context.  There's no point in making it wait for ondisk metadata
> +	 * updates, which themselves require memory allocations.
> +	 */
> +	if (current->flags & PF_KSWAPD)
> +		return XFS_IWAIT_NEVER;
> +
> +	/*
> +	 * Always wait for directory removal so we clean up any files that
> +	 * were in that directory.
> +	 */
> +	if (S_ISDIR(VFS_I(ip)->i_mode)) {
> +		trace_xfs_inactive_iwait_all(ip);
> +		return XFS_IWAIT_ALL;
> +	}

why do we need to wait for directories? it's already unlinked, so if
we crash it and everything in it are not going to reappear....

> +	/* Heavily fragmented files take a while to delete. */
> +	x = XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK) +
> +	    XFS_IFORK_NEXTENTS(ip, XFS_ATTR_FORK) +
> +	    XFS_IFORK_NEXTENTS(ip, XFS_COW_FORK);
> +	y = rt ? 256 : 32 * mp->m_sb.sb_agcount;
> +	if (x >= y) {
> +		trace_xfs_inactive_iwait_inode(ip);
> +		return XFS_IWAIT_INODE;
> +	}

Why does the number of extents in the current inode determine
whether we wait for completion of all other inode inactivations in
the same AG?

> +
> +	return XFS_IWAIT_UNDECIDED;
> +}
> +
> +/*
> + * Wait for deferred inode inactivation of an unlinked inode being destroyed.
> + *
> + * The deferred inode inactivation mechanism provides for background batching
> + * of whatever on-disk metadata updates are necessary to free an inode and all
> + * the resources it holds.  In theory this should speed up deletion by enabling
> + * us to inactivate in inode number order.
> + *
> + * However, there are a few situations where we actually /want/ to throttle
> + * unlinking.  Specifically, if we're unlinking fragmented files or removing
> + * entire directory trees, we should wait instead of allowing an enormous
> + * processing backlog that causes update storms later.
> + *
> + * We will wait for inactivation to finish under the following circumstances:
> + *  - Removing a directory
> + *  - Removing a heavily fragmented file
> + *  - A large number of blocks could be freed by inactivation
> + *  - A large number of inodes could be freed by inactivation
> + */
> +void
> +xfs_inactive_wait(
> +	struct xfs_mount	*mp,
> +	enum xfs_iwait		iwait,
> +	xfs_agnumber_t		agno)
> +{
> +	unsigned long long	x;
> +	unsigned long long	y;
> +
> +	switch (iwait) {
> +	case XFS_IWAIT_NEVER:
> +		return;
> +	case XFS_IWAIT_ALL:
> +	case XFS_IWAIT_INODE:
> +		goto wait;
> +	default:
> +		break;
> +	}
> +
> +	iwait = XFS_IWAIT_ALL;
> +
> +	/* More than 1/4 of an AG space could be freed by inactivation. */
> +	x = percpu_counter_read_positive(&mp->m_dinactive);
> +	y = mp->m_sb.sb_agblocks / 4;
> +	if (x >= y)
> +		goto wait;

But that's a global counter - how does that relate to individual AG
space at all?

> +
> +	/* Less than 1/16 of the datadev is free. */
> +	x = percpu_counter_read_positive(&mp->m_fdblocks);
> +	y = mp->m_sb.sb_dblocks / 16;
> +	if (x <= y)
> +		goto wait;

urk. If I have a 100TB filesystem, that means we always throttle with 6-7TB
of free space available. That's not right.

> +	/* More than 1/4 of the rtdev could be freed by inactivation. */
> +	y = mp->m_sb.sb_rblocks;
> +	if (y > 0) {
> +		x = percpu_counter_read_positive(&mp->m_rinactive);
> +		if (x >= y / 4)
> +			goto wait;

That's even worse - this might not even be a rt inode... :/

> +
> +		/* Less than 1/16 of the rtdev is free. */
> +		x = mp->m_sb.sb_frextents * mp->m_sb.sb_rextsize;
> +		if (x <= y / 16)
> +			goto wait;
> +	}
> +
> +	/* A lot of inodes could be freed by inactivation. */
> +	x = percpu_counter_read_positive(&mp->m_iinactive);
> +	y = XFS_INODES_PER_CHUNK * 4 * (unsigned long long)mp->m_sb.sb_agcount;
> +	if (x >= y)
> +		goto wait;

And, yeah, that's just wrong once we fix the unlinked inode
scalability problem, as the patch below does.


> +	return;
> +wait:
> +	switch (iwait) {
> +	case XFS_IWAIT_ALL:
> +		xfs_inactive_force(mp);
> +		break;
> +	case XFS_IWAIT_INODE:
> +		xfs_inactive_force_ag(mp, agno);
> +		break;
> +	default:
> +		ASSERT(0);

So we assert here if we get XFS_IWAIT_UNDECIDED?

>  	 * reclaim tear down all inodes.
>  	 */
>  	xfs_inode_set_reclaim_tag(ip, need_inactive);
> +
> +	/*
> +	 * Wait for inactivation of this inode if the inode has zero nlink.
> +	 * This cannot be done in fs reclaim context, which means we assume
> +	 * that unlinked inodes that lose all their refcount are dropped,
> +	 * evicted, and destroyed immediately in the context of the unlink()ing
> +	 * process and are never fed to the LRU for reclaim.
> +	 */

Apparently overlay breaks this assumption - AFAIA it can drop the last
inode reference on unlinked inodes from it's dcache shrinker.

> +	xfs_inactive_wait(mp, caniwait, agno);

So, prototype incore unlinked inode list patch is below. It needs
log recovery help - the incore list needs to be rebuilt in log
recovery so xfs_iunlink_remove works correctly. It also
short-circuits some of the above "should wait for inactivation"
checks and so some of the performance regressions are mostly
removed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: incore unlinked inode list

From: Dave Chinner <dchinner@redhat.com>

Introduce an incore per-ag unlinked inode list to avoid needing to
walk the buffer based unlinked lists to find the inodes to modify
when removing an inode from the unlinked list.  When the unlinked
lists grow long, finding an inode on the unlinked list requires
mapping the inode number to a buffer, reading the buffer, extracting
the inode from the buffer and reading the next inode number on the
list. This is a lot of overhead, especially the buffer lookups. If
we keep an in-core copy of the unlinked list, we don't need to do
the buffer traversal to find the previous buffer in the list for
unlinking; it's just the previous inode in the incore list.

The incore list is rooted in the xfs_perag, and mirrors the same
hash structure in the AGI. The order of the lists is identical to
the ordering on disk. Hence we can look up the incore list and then
find the buffer we need to modify to remove inodes from the unlinked
list.

Discussion: in this implementation, the incore list nests inside the
AGI buffer locks, so the AGI buffer lock effectively serialises all
unlinked list operations on the AGI whether it is needed or not. If
we change the order of the locking and use the incore list as the
canonical source of unlinked inodes, then we only need to get the
AGI lock when we are modifying the AGI, not all modifications to the
unlinked list. This could allow some level of parallelisation of
unlinked list operations even within the same AG....


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |   9 ++
 fs/xfs/xfs_inode.c  | 318 +++++++++++++++++++++++++---------------------------
 fs/xfs/xfs_inode.h  |   2 +
 fs/xfs/xfs_mount.c  |   7 ++
 fs/xfs/xfs_mount.h  |   5 +
 5 files changed, 177 insertions(+), 164 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a34f6101526b..5a28173b5ecd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1952,6 +1952,7 @@ xfs_inactive_wait(
 	enum xfs_iwait		iwait,
 	xfs_agnumber_t		agno)
 {
+	struct xfs_perag	*pag;
 	unsigned long long	x;
 	unsigned long long	y;
 
@@ -1993,10 +1994,18 @@ xfs_inactive_wait(
 	}
 
 	/* A lot of inodes could be freed by inactivation. */
+#if 0
 	x = percpu_counter_read_positive(&mp->m_iinactive);
 	y = XFS_INODES_PER_CHUNK * 4 * (unsigned long long)mp->m_sb.sb_agcount;
 	if (x >= y)
 		goto wait;
+	pag = xfs_perag_get(mp, agno);
+	if (pag->pagi_unlinked_count > 10000) {
+		xfs_perag_put(pag);
+		goto wait;
+	}
+	xfs_perag_put(pag);
+#endif
 
 	return;
 wait:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2c7f8aeffa92..a4b6f3e597f8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2053,6 +2053,48 @@ xfs_inactive(
 	xfs_qm_dqdetach(ip);
 }
 
+/*
+ * Modify the next unlinked inode pointer on disk. Returns the old pointer value
+ * or a negative error.
+ *
+ * XXX: really want a new logical inode transaction flag for this so we don't
+ * ever need to touch buffers again here. That will change locking requirements,
+ * though.
+ */
+static int
+xfs_iunlink_mod(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_agino_t		next_agino,
+	xfs_agino_t		*old_agino)
+{
+	struct xfs_mount *mp = tp->t_mountp;
+	struct xfs_dinode *dip;
+	struct xfs_buf	*ibp;
+	int		offset;
+	int		error;
+
+	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0, 0);
+	if (error) {
+		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
+			__func__, error);
+		return error;
+	}
+	if (old_agino)
+		*old_agino = be32_to_cpu(dip->di_next_unlinked);
+	dip->di_next_unlinked = cpu_to_be32(next_agino);
+	offset = ip->i_imap.im_boffset +
+				offsetof(struct xfs_dinode, di_next_unlinked);
+
+	/* need to recalc the inode CRC if appropriate */
+	xfs_dinode_calc_crc(mp, dip);
+
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, (offset + sizeof(xfs_agino_t) - 1));
+	xfs_inobp_check(mp, ibp);
+	return 0;
+}
+
 /*
  * This is called when the inode's link count goes to 0 or we are creating a
  * tmpfile via O_TMPFILE. In the case of a tmpfile, @ignore_linkcount will be
@@ -2068,23 +2110,32 @@ xfs_iunlink(
 	struct xfs_trans *tp,
 	struct xfs_inode *ip)
 {
-	xfs_mount_t	*mp = tp->t_mountp;
-	xfs_agi_t	*agi;
-	xfs_dinode_t	*dip;
-	xfs_buf_t	*agibp;
-	xfs_buf_t	*ibp;
+	struct xfs_mount *mp = tp->t_mountp;
+	struct xfs_perag *pag;
+	struct xfs_agi	*agi;
+	struct xfs_buf	*agibp;
 	xfs_agino_t	agino;
+	xfs_agino_t	next_agino;
+	xfs_agnumber_t	agno;
+	struct list_head *unlinked_list;
 	short		bucket_index;
 	int		offset;
 	int		error;
 
 	ASSERT(VFS_I(ip)->i_mode != 0);
 
+	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	ASSERT(agino != 0);
+        bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
+
+	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	pag = xfs_perag_get(mp, agno);
+	unlinked_list = &pag->pagi_unlinked_list[bucket_index];
+
 	/*
-	 * Get the agi buffer first.  It ensures lock ordering
-	 * on the list.
+	 * Get the agi buffer first.  It ensures lock ordering on the list.
 	 */
-	error = xfs_read_agi(mp, tp, XFS_INO_TO_AGNO(mp, ip->i_ino), &agibp);
+	error = xfs_read_agi(mp, tp, agno, &agibp);
 	if (error)
 		return error;
 	agi = XFS_BUF_TO_AGI(agibp);
@@ -2093,48 +2144,57 @@ xfs_iunlink(
 	 * Get the index into the agi hash table for the
 	 * list this inode will go on.
 	 */
-	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	ASSERT(agino != 0);
-	bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
-	ASSERT(agi->agi_unlinked[bucket_index]);
-	ASSERT(be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino);
+	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	ASSERT(next_agino != 0);
+	ASSERT(next_agino != agino);
+
+	mutex_lock(&pag->pagi_unlink_lock);
+	if (next_agino != NULLAGINO) {
+		xfs_agino_t	old_agino;
+#ifdef DEBUG
+		struct xfs_inode *nip;
+
+		ASSERT(!list_empty(unlinked_list));
+		nip = list_first_entry(unlinked_list, struct xfs_inode,
+					i_unlinked_list);
+		ASSERT(XFS_INO_TO_AGINO(mp, nip->i_ino) == next_agino);
+#endif
 
-	if (agi->agi_unlinked[bucket_index] != cpu_to_be32(NULLAGINO)) {
 		/*
 		 * There is already another inode in the bucket we need
 		 * to add ourselves to.  Add us at the front of the list.
 		 * Here we put the head pointer into our next pointer,
 		 * and then we fall through to point the head at us.
 		 */
-		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
-				       0, 0);
-		if (error)
-			return error;
-
-		ASSERT(dip->di_next_unlinked == cpu_to_be32(NULLAGINO));
-		dip->di_next_unlinked = agi->agi_unlinked[bucket_index];
-		offset = ip->i_imap.im_boffset +
-			offsetof(xfs_dinode_t, di_next_unlinked);
-
-		/* need to recalc the inode CRC if appropriate */
-		xfs_dinode_calc_crc(mp, dip);
-
-		xfs_trans_inode_buf(tp, ibp);
-		xfs_trans_log_buf(tp, ibp, offset,
-				  (offset + sizeof(xfs_agino_t) - 1));
-		xfs_inobp_check(mp, ibp);
+		error = xfs_iunlink_mod(tp, ip, next_agino, &old_agino);
+		if (error) {
+			mutex_unlock(&pag->pagi_unlink_lock);
+			goto out;
+		}
+		ASSERT(old_agino == NULLAGINO);
+	} else {
+		ASSERT(list_empty(unlinked_list));
 	}
 
+	/*
+	* As we are always adding the inode at the head of the AGI list,
+	* it always gets added to the head here.
+	*/
+	list_add(&ip->i_unlinked_list, unlinked_list);
+	mutex_unlock(&pag->pagi_unlink_lock);
+
 	/*
 	 * Point the bucket head pointer at the inode being inserted.
 	 */
-	ASSERT(agino != 0);
 	agi->agi_unlinked[bucket_index] = cpu_to_be32(agino);
 	offset = offsetof(xfs_agi_t, agi_unlinked) +
 		(sizeof(xfs_agino_t) * bucket_index);
 	xfs_trans_log_buf(tp, agibp, offset,
 			  (offset + sizeof(xfs_agino_t) - 1));
-	return 0;
+	pag->pagi_unlinked_count++;
+out:
+	xfs_perag_put(pag);
+	return error;
 }
 
 /*
@@ -2145,81 +2205,72 @@ xfs_iunlink_remove(
 	xfs_trans_t	*tp,
 	xfs_inode_t	*ip)
 {
-	xfs_ino_t	next_ino;
-	xfs_mount_t	*mp;
-	xfs_agi_t	*agi;
-	xfs_dinode_t	*dip;
-	xfs_buf_t	*agibp;
-	xfs_buf_t	*ibp;
+	struct xfs_mount *mp = tp->t_mountp;
+	struct xfs_perag *pag;
+	struct xfs_agi	*agi;
+	struct xfs_buf	*agibp;
 	xfs_agnumber_t	agno;
 	xfs_agino_t	agino;
+	xfs_agino_t	old_agino;
 	xfs_agino_t	next_agino;
-	xfs_buf_t	*last_ibp;
-	xfs_dinode_t	*last_dip = NULL;
+	struct list_head *unlinked_list;
 	short		bucket_index;
-	int		offset, last_offset = 0;
 	int		error;
 
-	mp = tp->t_mountp;
 	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	if (!xfs_verify_agino(mp, agno, agino))
+		return -EFSCORRUPTED;
+
+	pag = xfs_perag_get(mp, agno);
+        bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
+	unlinked_list = &pag->pagi_unlinked_list[bucket_index];
 
 	/*
-	 * Get the agi buffer first.  It ensures lock ordering
-	 * on the list.
+	 * Get the agi buffer first.  It ensures lock ordering on the list.
 	 */
 	error = xfs_read_agi(mp, tp, agno, &agibp);
 	if (error)
 		return error;
-
 	agi = XFS_BUF_TO_AGI(agibp);
 
-	/*
-	 * Get the index into the agi hash table for the
-	 * list this inode will go on.
-	 */
-	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	if (!xfs_verify_agino(mp, agno, agino))
-		return -EFSCORRUPTED;
-	bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
-	if (!xfs_verify_agino(mp, agno,
-			be32_to_cpu(agi->agi_unlinked[bucket_index]))) {
+	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	if (!xfs_verify_agino(mp, agno, next_agino)) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 				agi, sizeof(*agi));
 		return -EFSCORRUPTED;
 	}
 
-	if (be32_to_cpu(agi->agi_unlinked[bucket_index]) == agino) {
+	mutex_lock(&pag->pagi_unlink_lock);
+	if (unlinked_list->next == &ip->i_unlinked_list) {
+		int	offset;
+
+		ASSERT(next_agino == agino);
 		/*
-		 * We're at the head of the list.  Get the inode's on-disk
-		 * buffer to see if there is anyone after us on the list.
-		 * Only modify our next pointer if it is not already NULLAGINO.
-		 * This saves us the overhead of dealing with the buffer when
-		 * there is no need to change it.
+		 * We're at the head of the list. Check if there is anyone
+		 * after us on the list, and if so modify the on disk pointers
+		 * to remove us from the list.
 		 */
-		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
-				       0, 0);
-		if (error) {
-			xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
-				__func__, error);
-			return error;
-		}
-		next_agino = be32_to_cpu(dip->di_next_unlinked);
-		ASSERT(next_agino != 0);
-		if (next_agino != NULLAGINO) {
-			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
-			offset = ip->i_imap.im_boffset +
-				offsetof(xfs_dinode_t, di_next_unlinked);
-
-			/* need to recalc the inode CRC if appropriate */
-			xfs_dinode_calc_crc(mp, dip);
-
-			xfs_trans_inode_buf(tp, ibp);
-			xfs_trans_log_buf(tp, ibp, offset,
-					  (offset + sizeof(xfs_agino_t) - 1));
-			xfs_inobp_check(mp, ibp);
-		} else {
-			xfs_trans_brelse(tp, ibp);
+		next_agino = NULLAGINO;
+		list_del_init(&ip->i_unlinked_list);
+		if (!list_empty(unlinked_list)) {
+			struct xfs_inode *nip;
+
+			/*
+			 * If there's more entries on the list, clear the on-disk
+			 * unlinked list pointer in the inode, then get the
+			 * pointer to the new list head.
+			 */
+			error = xfs_iunlink_mod(tp, ip, NULLAGINO, &old_agino);
+			if (error)
+				goto out_unlock;
+
+			nip = list_first_entry(unlinked_list, struct xfs_inode,
+						i_unlinked_list);
+			next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
+			ASSERT(old_agino == next_agino);
 		}
+
 		/*
 		 * Point the bucket head pointer at the next inode.
 		 */
@@ -2230,92 +2281,31 @@ xfs_iunlink_remove(
 			(sizeof(xfs_agino_t) * bucket_index);
 		xfs_trans_log_buf(tp, agibp, offset,
 				  (offset + sizeof(xfs_agino_t) - 1));
-	} else {
-		/*
-		 * We need to search the list for the inode being freed.
-		 */
-		next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
-		last_ibp = NULL;
-		while (next_agino != agino) {
-			struct xfs_imap	imap;
 
-			if (last_ibp)
-				xfs_trans_brelse(tp, last_ibp);
 
-			imap.im_blkno = 0;
-			next_ino = XFS_AGINO_TO_INO(mp, agno, next_agino);
-
-			error = xfs_imap(mp, tp, next_ino, &imap, 0);
-			if (error) {
-				xfs_warn(mp,
-	"%s: xfs_imap returned error %d.",
-					 __func__, error);
-				return error;
-			}
-
-			error = xfs_imap_to_bp(mp, tp, &imap, &last_dip,
-					       &last_ibp, 0, 0);
-			if (error) {
-				xfs_warn(mp,
-	"%s: xfs_imap_to_bp returned error %d.",
-					__func__, error);
-				return error;
-			}
-
-			last_offset = imap.im_boffset;
-			next_agino = be32_to_cpu(last_dip->di_next_unlinked);
-			if (!xfs_verify_agino(mp, agno, next_agino)) {
-				XFS_CORRUPTION_ERROR(__func__,
-						XFS_ERRLEVEL_LOW, mp,
-						last_dip, sizeof(*last_dip));
-				return -EFSCORRUPTED;
-			}
-		}
-
-		/*
-		 * Now last_ibp points to the buffer previous to us on the
-		 * unlinked list.  Pull us from the list.
-		 */
-		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
-				       0, 0);
-		if (error) {
-			xfs_warn(mp, "%s: xfs_imap_to_bp(2) returned error %d.",
-				__func__, error);
-			return error;
-		}
-		next_agino = be32_to_cpu(dip->di_next_unlinked);
-		ASSERT(next_agino != 0);
-		ASSERT(next_agino != agino);
-		if (next_agino != NULLAGINO) {
-			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
-			offset = ip->i_imap.im_boffset +
-				offsetof(xfs_dinode_t, di_next_unlinked);
-
-			/* need to recalc the inode CRC if appropriate */
-			xfs_dinode_calc_crc(mp, dip);
-
-			xfs_trans_inode_buf(tp, ibp);
-			xfs_trans_log_buf(tp, ibp, offset,
-					  (offset + sizeof(xfs_agino_t) - 1));
-			xfs_inobp_check(mp, ibp);
-		} else {
-			xfs_trans_brelse(tp, ibp);
-		}
+	} else {
 		/*
-		 * Point the previous inode on the list to the next inode.
+		 * Get the previous inode in the list, point it at the next
+		 * one in the list.
 		 */
-		last_dip->di_next_unlinked = cpu_to_be32(next_agino);
-		ASSERT(next_agino != 0);
-		offset = last_offset + offsetof(xfs_dinode_t, di_next_unlinked);
+		struct xfs_inode *lip = list_entry(ip->i_unlinked_list.prev,
+					struct xfs_inode, i_unlinked_list);
 
-		/* need to recalc the inode CRC if appropriate */
-		xfs_dinode_calc_crc(mp, last_dip);
+		list_del_init(&ip->i_unlinked_list);
+		error = xfs_iunlink_mod(tp, ip, NULLAGINO, &next_agino);
+		if (error)
+			goto out_unlock;
+
+		error = xfs_iunlink_mod(tp, lip, next_agino, &old_agino);
+		if (error)
+			goto out_unlock;
+		ASSERT(old_agino == agino);
 
-		xfs_trans_inode_buf(tp, last_ibp);
-		xfs_trans_log_buf(tp, last_ibp, offset,
-				  (offset + sizeof(xfs_agino_t) - 1));
-		xfs_inobp_check(mp, last_ibp);
 	}
+	pag->pagi_unlinked_count--;
+out_unlock:
+	mutex_unlock(&pag->pagi_unlink_lock);
+	xfs_perag_put(pag);
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e95e30a94243..d00c6e2b806b 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -47,6 +47,7 @@ typedef struct xfs_inode {
 	atomic_t		i_pincount;	/* inode pin count */
 	spinlock_t		i_flags_lock;	/* inode i_flags lock */
 	/* Miscellaneous state. */
+	struct list_head	i_unlinked_list;
 	unsigned long		i_flags;	/* see defined flags below */
 	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
 
@@ -55,6 +56,7 @@ typedef struct xfs_inode {
 	xfs_extnum_t		i_cnextents;	/* # of extents in cow fork */
 	unsigned int		i_cformat;	/* format of cow fork */
 
+
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
 } xfs_inode_t;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aa953d2018dd..591e52e93236 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -185,6 +185,7 @@ xfs_initialize_perag(
 	xfs_agnumber_t	first_initialised = NULLAGNUMBER;
 	xfs_perag_t	*pag;
 	int		error = -ENOMEM;
+	int		i;
 
 	/*
 	 * Walk the current per-ag tree so we don't try to initialise AGs
@@ -230,6 +231,12 @@ xfs_initialize_perag(
 		/* first new pag is fully initialized */
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
+
+		/* in-core inode unlinked lists */
+		mutex_init(&pag->pagi_unlink_lock);
+		for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++)
+			INIT_LIST_HEAD(&pag->pagi_unlinked_list[i]);
+
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 34f2cf96ec27..9c4e9c9ceb94 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -402,6 +402,11 @@ typedef struct xfs_perag {
 
 	/* reference count */
 	uint8_t			pagf_refcount_level;
+
+	/* inode unlinked lists */
+	struct mutex		pagi_unlink_lock;
+	struct list_head	pagi_unlinked_list[XFS_AGI_UNLINKED_BUCKETS];
+	uint32_t		pagi_unlinked_count;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] xfs: free COW staging extents when freezing filesystem
  2019-01-01  2:16 ` [PATCH 01/12] xfs: free COW staging extents when freezing filesystem Darrick J. Wong
@ 2019-01-11 16:28   ` Brian Foster
  2019-01-17 17:24     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2019-01-11 16:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Dec 31, 2018 at 06:16:57PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're freezing the filesystem, free all the COW staging extents
> before we shut the log down so that we can minimize the amount of
> recovery work that will be necessary during the next mount.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_icache.c |   17 ++++++++++++-----
>  fs/xfs/xfs_super.c  |   18 ++++++++++++++++++
>  2 files changed, 30 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 245483cc282b..36d986087abb 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1741,6 +1741,7 @@ xfs_inode_free_cowblocks(
>  	void			*args)
>  {
>  	struct xfs_eofblocks	*eofb = args;
> +	uint			lock_mode = 0;
>  	int			match;
>  	int			ret = 0;
>  
> @@ -1761,9 +1762,15 @@ xfs_inode_free_cowblocks(
>  			return 0;
>  	}
>  
> -	/* Free the CoW blocks */
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +	/*
> +	 * Free the CoW blocks.  We don't need to lock the inode if we're in
> +	 * the process of freezing the filesystem because we've already locked
> +	 * out writes and page faults.
> +	 */
> +	if (ip->i_mount->m_super->s_writers.frozen == SB_UNFROZEN)
> +		lock_mode = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;

Ok, but is there a problem with actually locking in this context? If so,
I'd expect the comment to say something like: "Free the COW blocks.
Don't lock the inode if we're in the process of freezing the filesystem
because <some bad thing happens>. This is safe because we've already
locked out writes and page faults."

> +	if (lock_mode)
> +		xfs_ilock(ip, lock_mode);
>  
>  	/*
>  	 * Check again, nobody else should be able to dirty blocks or change
> @@ -1772,8 +1779,8 @@ xfs_inode_free_cowblocks(
>  	if (xfs_prep_free_cowblocks(ip))
>  		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
>  
> -	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	if (lock_mode)
> +		xfs_iunlock(ip, lock_mode);
>  
>  	return ret;
>  }
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 6903b36eac5d..bb4953cfd683 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1102,6 +1102,24 @@ xfs_fs_sync_fs(
>  	if (!wait)
>  		return 0;
>  
> +	/*
> +	 * If we're in the middle of freezing the filesystem, this is our last
> +	 * chance to run regular transactions.  Clear all the COW staging
> +	 * extents so that we can freeze the filesystem with as little recovery
> +	 * work to do at the next mount as possible.  It's safe to do this
> +	 * without locking inodes because the freezer code flushed all the
> +	 * dirty data from the page cache and locked out writes and page faults.
> +	 */

Well, we have XFS_TRANS_NO_WRITECOUNT. It seems we could use an eofb
freeze flag or some such to notify the scan that we're in a special
freeze context and in turn use that to tweak the locking (instead of the
SB_FREEZE_* checks) and avoid blocking on transaction allocation. Would
that allow us to run this in xfs_fs_freeze() context?

I'm also a little curious about the high-level tradeoff we're making.
This patch means that a freeze, and thus something like an LVM snapshot,
would clear/reset all COW blocks in the fs, right? If so, ISTM that
could be annoying if the user had a bunch of reflinked vdisk images or
something on the fs.

Is the tradeoff just that if the user freezes and then doesn't unfreeze
before the next mount that log recovery will have to deal with COW
blocks, or are there considerations for a typical freeze/unfreeze cycle
as well? Is this more expensive at recovery time than at freeze time?

Brian

> +	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> +		int		error;
> +
> +		error = xfs_icache_free_cowblocks(mp, NULL);
> +		if (error) {
> +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +			return error;
> +		}
> +	}
> +
>  	xfs_log_force(mp, XFS_LOG_SYNC);
>  	if (laptop_mode) {
>  		/*
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks
  2019-01-01  2:17 ` [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
@ 2019-01-11 19:05   ` Brian Foster
  2019-01-17 17:33     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2019-01-11 19:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Dec 31, 2018 at 06:17:03PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor the part of _free_eofblocks that decides if it's really going
> to truncate post-EOF blocks into a separate helper function.  The
> upcoming deferred inode inactivation patch requires us to be able to
> decide this prior to actual inactivation.  No functionality changes.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_bmap_util.c |  101 ++++++++++++++++++++----------------------------
>  fs/xfs/xfs_inode.c     |   32 +++++++++++++++
>  fs/xfs/xfs_inode.h     |    1 
>  3 files changed, 75 insertions(+), 59 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c8bf02be0003..662ee537ffb5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3568,3 +3568,35 @@ xfs_irele(
>  	trace_xfs_irele(ip, _RET_IP_);
>  	iput(VFS_I(ip));
>  }
> +
> +/*
> + * Decide if this inode have post-EOF blocks.  The caller is responsible
> + * for knowing / caring about the PREALLOC/APPEND flags.
> + */
> +bool
> +xfs_inode_has_posteof_blocks(

I think something like xfs_has_eofblocks() is more consistent with the
other related function names (i.e., xfs_free_eofblocks(),
xfs_can_free_eofblocks(), etc.).

> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_bmbt_irec	imap;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fileoff_t		end_fsb;
> +	xfs_fileoff_t		last_fsb;
> +	xfs_filblks_t		map_len;
> +	int			nimaps;
> +	int			error;
> +
> +	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> +	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
> +	if (last_fsb <= end_fsb)
> +		return false;
> +	map_len = last_fsb - end_fsb;
> +
> +	nimaps = 1;
> +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> +	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
> +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +
> +	return !error && (nimaps != 0) &&
> +	       (imap.br_startblock != HOLESTARTBLOCK ||
> +	        ip->i_delayed_blks);

I don't think we should be suppressing this error. It's a divergence
from the current code at least.

Brian

> +}
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index be2014520155..02a938661ba8 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -499,5 +499,6 @@ extern struct kmem_zone	*xfs_inode_zone;
>  #define XFS_DEFAULT_COWEXTSZ_HINT 32
>  
>  bool xfs_inode_verify_forks(struct xfs_inode *ip);
> +bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
>  
>  #endif	/* __XFS_INODE_H__ */
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes
  2019-01-01  2:17 ` [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes Darrick J. Wong
@ 2019-01-11 19:06   ` Brian Foster
  2019-01-17 17:43     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2019-01-11 19:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Dec 31, 2018 at 06:17:27PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor the code that walks reclaim-tagged inodes so that we can reuse
> the same loop in a subsequent patch.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_icache.c |  170 +++++++++++++++++++++++++++++----------------------
>  1 file changed, 97 insertions(+), 73 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 36d986087abb..7e031eb6f048 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
...
> @@ -1223,6 +1223,90 @@ xfs_reclaim_inode(
>  	return 0;
>  }
>  
> +/*
> + * Walk the RECLAIM tagged inodes in this AG looking for inodes to inactivate.
> + */
> +STATIC int
> +xfs_walk_ag_reclaim_inos(

Nit: how about xfs_reclaim_inodes_pag()? Hm, maybe we can rename
xfs_reclaim_inodes_ag() to something like __xfs_reclaim_inodes() as
well.

Brian

> +	struct xfs_perag	*pag,
> +	int			sync_flags,
> +	bool			(*grab_fn)(struct xfs_inode *ip,
> +					   int sync_flags),
> +	int			(*execute_fn)(struct xfs_inode *ip,
> +					      struct xfs_perag *pag,
> +					      int sync_flags),
> +	int			*nr_to_scan,
> +	bool			*done)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	unsigned long		first_index = 0;
> +	int			nr_found = 0;
> +	int			last_error = 0;
> +	int			error;
> +
> +	do {
> +		struct xfs_inode *batch[XFS_LOOKUP_BATCH];
> +		int	i;
> +
> +		rcu_read_lock();
> +		nr_found = radix_tree_gang_lookup_tag(&pag->pag_ici_root,
> +				(void **)batch, first_index, XFS_LOOKUP_BATCH,
> +				XFS_ICI_RECLAIM_TAG);
> +		if (!nr_found) {
> +			*done = true;
> +			rcu_read_unlock();
> +			break;
> +		}
> +
> +		/*
> +		 * Grab the inodes before we drop the lock. if we found
> +		 * nothing, nr == 0 and the loop will be skipped.
> +		 */
> +		for (i = 0; i < nr_found; i++) {
> +			struct xfs_inode *ip = batch[i];
> +
> +			if (*done || !grab_fn(ip, sync_flags))
> +				batch[i] = NULL;
> +
> +			/*
> +			 * Update the index for the next lookup. Catch
> +			 * overflows into the next AG range which can occur if
> +			 * we have inodes in the last block of the AG and we
> +			 * are currently pointing to the last inode.
> +			 *
> +			 * Because we may see inodes that are from the wrong AG
> +			 * due to RCU freeing and reallocation, only update the
> +			 * index if it lies in this AG. It was a race that lead
> +			 * us to see this inode, so another lookup from the
> +			 * same index will not find it again.
> +			 */
> +			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
> +				continue;
> +			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
> +			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
> +				*done = true;
> +		}
> +
> +		/* unlock now we've grabbed the inodes. */
> +		rcu_read_unlock();
> +
> +		for (i = 0; i < nr_found; i++) {
> +			if (!batch[i])
> +				continue;
> +			error = execute_fn(batch[i], pag, sync_flags);
> +			if (error && last_error != -EFSCORRUPTED)
> +				last_error = error;
> +		}
> +
> +		*nr_to_scan -= XFS_LOOKUP_BATCH;
> +
> +		cond_resched();
> +
> +	} while (nr_found && !*done && *nr_to_scan > 0);
> +
> +	return last_error;
> +}
> +
>  /*
>   * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
>   * corrupted, we still want to try to reclaim all the inodes. If we don't,
> @@ -1236,8 +1320,8 @@ xfs_reclaim_inodes_ag(
>  	int			*nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
>  	int			last_error = 0;
> +	int			error;
>  	xfs_agnumber_t		ag;
>  	int			trylock = flags & SYNC_TRYLOCK;
>  	int			skipped;
> @@ -1247,8 +1331,7 @@ xfs_reclaim_inodes_ag(
>  	skipped = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
>  		unsigned long	first_index = 0;
> -		int		done = 0;
> -		int		nr_found = 0;
> +		bool		done = false;
>  
>  		ag = pag->pag_agno + 1;
>  
> @@ -1262,70 +1345,11 @@ xfs_reclaim_inodes_ag(
>  		} else
>  			mutex_lock(&pag->pag_ici_reclaim_lock);
>  
> -		do {
> -			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
> -			int	i;
> -
> -			rcu_read_lock();
> -			nr_found = radix_tree_gang_lookup_tag(
> -					&pag->pag_ici_root,
> -					(void **)batch, first_index,
> -					XFS_LOOKUP_BATCH,
> -					XFS_ICI_RECLAIM_TAG);
> -			if (!nr_found) {
> -				done = 1;
> -				rcu_read_unlock();
> -				break;
> -			}
> -
> -			/*
> -			 * Grab the inodes before we drop the lock. if we found
> -			 * nothing, nr == 0 and the loop will be skipped.
> -			 */
> -			for (i = 0; i < nr_found; i++) {
> -				struct xfs_inode *ip = batch[i];
> -
> -				if (done || xfs_reclaim_inode_grab(ip, flags))
> -					batch[i] = NULL;
> -
> -				/*
> -				 * Update the index for the next lookup. Catch
> -				 * overflows into the next AG range which can
> -				 * occur if we have inodes in the last block of
> -				 * the AG and we are currently pointing to the
> -				 * last inode.
> -				 *
> -				 * Because we may see inodes that are from the
> -				 * wrong AG due to RCU freeing and
> -				 * reallocation, only update the index if it
> -				 * lies in this AG. It was a race that lead us
> -				 * to see this inode, so another lookup from
> -				 * the same index will not find it again.
> -				 */
> -				if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
> -								pag->pag_agno)
> -					continue;
> -				first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
> -				if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
> -					done = 1;
> -			}
> -
> -			/* unlock now we've grabbed the inodes. */
> -			rcu_read_unlock();
> -
> -			for (i = 0; i < nr_found; i++) {
> -				if (!batch[i])
> -					continue;
> -				error = xfs_reclaim_inode(batch[i], pag, flags);
> -				if (error && last_error != -EFSCORRUPTED)
> -					last_error = error;
> -			}
> -
> -			*nr_to_scan -= XFS_LOOKUP_BATCH;
> -
> -			cond_resched();
> -
> -		} while (nr_found && !done && *nr_to_scan > 0);
> +		error = xfs_walk_ag_reclaim_inos(pag, flags,
> +				xfs_reclaim_inode_grab, xfs_reclaim_inode,
> +				nr_to_scan, &done);
> +		if (error && last_error != -EFSCORRUPTED)
> +			last_error = error;
>  
>  		if (trylock && !done)
>  			pag->pag_ici_reclaim_cursor = first_index;
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] xfs: free COW staging extents when freezing filesystem
  2019-01-11 16:28   ` Brian Foster
@ 2019-01-17 17:24     ` Darrick J. Wong
  2019-01-17 18:14       ` Brian Foster
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-17 17:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 11, 2019 at 11:28:38AM -0500, Brian Foster wrote:
> On Mon, Dec 31, 2018 at 06:16:57PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we're freezing the filesystem, free all the COW staging extents
> > before we shut the log down so that we can minimize the amount of
> > recovery work that will be necessary during the next mount.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_icache.c |   17 ++++++++++++-----
> >  fs/xfs/xfs_super.c  |   18 ++++++++++++++++++
> >  2 files changed, 30 insertions(+), 5 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 245483cc282b..36d986087abb 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1741,6 +1741,7 @@ xfs_inode_free_cowblocks(
> >  	void			*args)
> >  {
> >  	struct xfs_eofblocks	*eofb = args;
> > +	uint			lock_mode = 0;
> >  	int			match;
> >  	int			ret = 0;
> >  
> > @@ -1761,9 +1762,15 @@ xfs_inode_free_cowblocks(
> >  			return 0;
> >  	}
> >  
> > -	/* Free the CoW blocks */
> > -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> > -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > +	/*
> > +	 * Free the CoW blocks.  We don't need to lock the inode if we're in
> > +	 * the process of freezing the filesystem because we've already locked
> > +	 * out writes and page faults.
> > +	 */
> > +	if (ip->i_mount->m_super->s_writers.frozen == SB_UNFROZEN)
> > +		lock_mode = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> 
> Ok, but is there a problem with actually locking in this context? If so,
> I'd expect the comment to say something like: "Free the COW blocks.
> Don't lock the inode if we're in the process of freezing the filesystem
> because <some bad thing happens>. This is safe because we've already
> locked out writes and page faults."

Hmmm, it's now been so long since I wrote this patch I don't remember
why I did this.  Wait something's coming back...

> > +	if (lock_mode)
> > +		xfs_ilock(ip, lock_mode);
> >  
> >  	/*
> >  	 * Check again, nobody else should be able to dirty blocks or change
> > @@ -1772,8 +1779,8 @@ xfs_inode_free_cowblocks(
> >  	if (xfs_prep_free_cowblocks(ip))
> >  		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> >  
> > -	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > +	if (lock_mode)
> > +		xfs_iunlock(ip, lock_mode);
> >  
> >  	return ret;
> >  }
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 6903b36eac5d..bb4953cfd683 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1102,6 +1102,24 @@ xfs_fs_sync_fs(
> >  	if (!wait)
> >  		return 0;
> >  
> > +	/*
> > +	 * If we're in the middle of freezing the filesystem, this is our last
> > +	 * chance to run regular transactions.  Clear all the COW staging
> > +	 * extents so that we can freeze the filesystem with as little recovery
> > +	 * work to do at the next mount as possible.  It's safe to do this
> > +	 * without locking inodes because the freezer code flushed all the
> > +	 * dirty data from the page cache and locked out writes and page faults.
> > +	 */
> 
> Well, we have XFS_TRANS_NO_WRITECOUNT. It seems we could use an eofb
> freeze flag or some such to notify the scan that we're in a special
> freeze context and in turn use that to tweak the locking (instead of the
> SB_FREEZE_* checks) and avoid blocking on transaction allocation. Would
> that allow us to run this in xfs_fs_freeze() context?

...oh, right.  Another thread could have taken the io/mmap locks (say
for fallocate) before the fs entered SB_FROZEN, and is now blocked
in xfs_trans_alloc waiting for the fs to unfreeze.  I can't remember the
exact circumstances though.

> I'm also a little curious about the high-level tradeoff we're making.
> This patch means that a freeze, and thus something like an LVM snapshot,
> would clear/reset all COW blocks in the fs, right? If so, ISTM that
> could be annoying if the user had a bunch of reflinked vdisk images or
> something on the fs.

Yeah, it's sort of painful to drop all those speculative preallocations.

> Is the tradeoff just that if the user freezes and then doesn't unfreeze
> before the next mount that log recovery will have to deal with COW
> blocks, or are there considerations for a typical freeze/unfreeze cycle
> as well? Is this more expensive at recovery time than at freeze time?

It's entirely to reduce the recovery work necessary after freezing the
fs, snapshotting the fs (via lvm or whatever) and then mounting the
snapshot separately.  TBH this patch has languished for years now
because it's only necessary to optimize this one usage.

Maybe it's easier to drop this one, unless anyone feels particularly
passionate about clearing out cow staging extents before a freeze?

--D

> Brian
> 
> > +	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > +		int		error;
> > +
> > +		error = xfs_icache_free_cowblocks(mp, NULL);
> > +		if (error) {
> > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +			return error;
> > +		}
> > +	}
> > +
> >  	xfs_log_force(mp, XFS_LOG_SYNC);
> >  	if (laptop_mode) {
> >  		/*
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks
  2019-01-11 19:05   ` Brian Foster
@ 2019-01-17 17:33     ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-17 17:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 11, 2019 at 02:05:47PM -0500, Brian Foster wrote:
> On Mon, Dec 31, 2018 at 06:17:03PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor the part of _free_eofblocks that decides if it's really going
> > to truncate post-EOF blocks into a separate helper function.  The
> > upcoming deferred inode inactivation patch requires us to be able to
> > decide this prior to actual inactivation.  No functionality changes.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c |  101 ++++++++++++++++++++----------------------------
> >  fs/xfs/xfs_inode.c     |   32 +++++++++++++++
> >  fs/xfs/xfs_inode.h     |    1 
> >  3 files changed, 75 insertions(+), 59 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c8bf02be0003..662ee537ffb5 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3568,3 +3568,35 @@ xfs_irele(
> >  	trace_xfs_irele(ip, _RET_IP_);
> >  	iput(VFS_I(ip));
> >  }
> > +
> > +/*
> > + * Decide if this inode have post-EOF blocks.  The caller is responsible
> > + * for knowing / caring about the PREALLOC/APPEND flags.
> > + */
> > +bool
> > +xfs_inode_has_posteof_blocks(
> 
> I think something like xfs_has_eofblocks() is more consistent with the
> other related function names (i.e., xfs_free_eofblocks(),
> xfs_can_free_eofblocks(), etc.).

<nod>

> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_bmbt_irec	imap;
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_fileoff_t		end_fsb;
> > +	xfs_fileoff_t		last_fsb;
> > +	xfs_filblks_t		map_len;
> > +	int			nimaps;
> > +	int			error;
> > +
> > +	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> > +	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
> > +	if (last_fsb <= end_fsb)
> > +		return false;
> > +	map_len = last_fsb - end_fsb;
> > +
> > +	nimaps = 1;
> > +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > +	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
> > +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > +
> > +	return !error && (nimaps != 0) &&
> > +	       (imap.br_startblock != HOLESTARTBLOCK ||
> > +	        ip->i_delayed_blks);
> 
> I don't think we should be suppressing this error. It's a divergence
> from the current code at least.

Ooh, yeah, good catch.  I'll fix that up.

--D

> Brian
> 
> > +}
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index be2014520155..02a938661ba8 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -499,5 +499,6 @@ extern struct kmem_zone	*xfs_inode_zone;
> >  #define XFS_DEFAULT_COWEXTSZ_HINT 32
> >  
> >  bool xfs_inode_verify_forks(struct xfs_inode *ip);
> > +bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
> >  
> >  #endif	/* __XFS_INODE_H__ */
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes
  2019-01-11 19:06   ` Brian Foster
@ 2019-01-17 17:43     ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-17 17:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 11, 2019 at 02:06:06PM -0500, Brian Foster wrote:
> On Mon, Dec 31, 2018 at 06:17:27PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor the code that walks reclaim-tagged inodes so that we can reuse
> > the same loop in a subsequent patch.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_icache.c |  170 +++++++++++++++++++++++++++++----------------------
> >  1 file changed, 97 insertions(+), 73 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 36d986087abb..7e031eb6f048 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> ...
> > @@ -1223,6 +1223,90 @@ xfs_reclaim_inode(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Walk the RECLAIM tagged inodes in this AG looking for inodes to inactivate.
> > + */
> > +STATIC int
> > +xfs_walk_ag_reclaim_inos(
> 
> Nit: how about xfs_reclaim_inodes_pag()? Hm, maybe we can rename
> xfs_reclaim_inodes_ag() to something like __xfs_reclaim_inodes() as
> well.

Ok, will do.  This whole section might change dramatically if the incore
inode radix tree becomes an xarray, but we'll see.

--D

> Brian
> 
> > +	struct xfs_perag	*pag,
> > +	int			sync_flags,
> > +	bool			(*grab_fn)(struct xfs_inode *ip,
> > +					   int sync_flags),
> > +	int			(*execute_fn)(struct xfs_inode *ip,
> > +					      struct xfs_perag *pag,
> > +					      int sync_flags),
> > +	int			*nr_to_scan,
> > +	bool			*done)
> > +{
> > +	struct xfs_mount	*mp = pag->pag_mount;
> > +	unsigned long		first_index = 0;
> > +	int			nr_found = 0;
> > +	int			last_error = 0;
> > +	int			error;
> > +
> > +	do {
> > +		struct xfs_inode *batch[XFS_LOOKUP_BATCH];
> > +		int	i;
> > +
> > +		rcu_read_lock();
> > +		nr_found = radix_tree_gang_lookup_tag(&pag->pag_ici_root,
> > +				(void **)batch, first_index, XFS_LOOKUP_BATCH,
> > +				XFS_ICI_RECLAIM_TAG);
> > +		if (!nr_found) {
> > +			*done = true;
> > +			rcu_read_unlock();
> > +			break;
> > +		}
> > +
> > +		/*
> > +		 * Grab the inodes before we drop the lock. if we found
> > +		 * nothing, nr == 0 and the loop will be skipped.
> > +		 */
> > +		for (i = 0; i < nr_found; i++) {
> > +			struct xfs_inode *ip = batch[i];
> > +
> > +			if (*done || !grab_fn(ip, sync_flags))
> > +				batch[i] = NULL;
> > +
> > +			/*
> > +			 * Update the index for the next lookup. Catch
> > +			 * overflows into the next AG range which can occur if
> > +			 * we have inodes in the last block of the AG and we
> > +			 * are currently pointing to the last inode.
> > +			 *
> > +			 * Because we may see inodes that are from the wrong AG
> > +			 * due to RCU freeing and reallocation, only update the
> > +			 * index if it lies in this AG. It was a race that lead
> > +			 * us to see this inode, so another lookup from the
> > +			 * same index will not find it again.
> > +			 */
> > +			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
> > +				continue;
> > +			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
> > +			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
> > +				*done = true;
> > +		}
> > +
> > +		/* unlock now we've grabbed the inodes. */
> > +		rcu_read_unlock();
> > +
> > +		for (i = 0; i < nr_found; i++) {
> > +			if (!batch[i])
> > +				continue;
> > +			error = execute_fn(batch[i], pag, sync_flags);
> > +			if (error && last_error != -EFSCORRUPTED)
> > +				last_error = error;
> > +		}
> > +
> > +		*nr_to_scan -= XFS_LOOKUP_BATCH;
> > +
> > +		cond_resched();
> > +
> > +	} while (nr_found && !*done && *nr_to_scan > 0);
> > +
> > +	return last_error;
> > +}
> > +
> >  /*
> >   * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
> >   * corrupted, we still want to try to reclaim all the inodes. If we don't,
> > @@ -1236,8 +1320,8 @@ xfs_reclaim_inodes_ag(
> >  	int			*nr_to_scan)
> >  {
> >  	struct xfs_perag	*pag;
> > -	int			error = 0;
> >  	int			last_error = 0;
> > +	int			error;
> >  	xfs_agnumber_t		ag;
> >  	int			trylock = flags & SYNC_TRYLOCK;
> >  	int			skipped;
> > @@ -1247,8 +1331,7 @@ xfs_reclaim_inodes_ag(
> >  	skipped = 0;
> >  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> >  		unsigned long	first_index = 0;
> > -		int		done = 0;
> > -		int		nr_found = 0;
> > +		bool		done = false;
> >  
> >  		ag = pag->pag_agno + 1;
> >  
> > @@ -1262,70 +1345,11 @@ xfs_reclaim_inodes_ag(
> >  		} else
> >  			mutex_lock(&pag->pag_ici_reclaim_lock);
> >  
> > -		do {
> > -			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
> > -			int	i;
> > -
> > -			rcu_read_lock();
> > -			nr_found = radix_tree_gang_lookup_tag(
> > -					&pag->pag_ici_root,
> > -					(void **)batch, first_index,
> > -					XFS_LOOKUP_BATCH,
> > -					XFS_ICI_RECLAIM_TAG);
> > -			if (!nr_found) {
> > -				done = 1;
> > -				rcu_read_unlock();
> > -				break;
> > -			}
> > -
> > -			/*
> > -			 * Grab the inodes before we drop the lock. if we found
> > -			 * nothing, nr == 0 and the loop will be skipped.
> > -			 */
> > -			for (i = 0; i < nr_found; i++) {
> > -				struct xfs_inode *ip = batch[i];
> > -
> > -				if (done || xfs_reclaim_inode_grab(ip, flags))
> > -					batch[i] = NULL;
> > -
> > -				/*
> > -				 * Update the index for the next lookup. Catch
> > -				 * overflows into the next AG range which can
> > -				 * occur if we have inodes in the last block of
> > -				 * the AG and we are currently pointing to the
> > -				 * last inode.
> > -				 *
> > -				 * Because we may see inodes that are from the
> > -				 * wrong AG due to RCU freeing and
> > -				 * reallocation, only update the index if it
> > -				 * lies in this AG. It was a race that lead us
> > -				 * to see this inode, so another lookup from
> > -				 * the same index will not find it again.
> > -				 */
> > -				if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
> > -								pag->pag_agno)
> > -					continue;
> > -				first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
> > -				if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
> > -					done = 1;
> > -			}
> > -
> > -			/* unlock now we've grabbed the inodes. */
> > -			rcu_read_unlock();
> > -
> > -			for (i = 0; i < nr_found; i++) {
> > -				if (!batch[i])
> > -					continue;
> > -				error = xfs_reclaim_inode(batch[i], pag, flags);
> > -				if (error && last_error != -EFSCORRUPTED)
> > -					last_error = error;
> > -			}
> > -
> > -			*nr_to_scan -= XFS_LOOKUP_BATCH;
> > -
> > -			cond_resched();
> > -
> > -		} while (nr_found && !done && *nr_to_scan > 0);
> > +		error = xfs_walk_ag_reclaim_inos(pag, flags,
> > +				xfs_reclaim_inode_grab, xfs_reclaim_inode,
> > +				nr_to_scan, &done);
> > +		if (error && last_error != -EFSCORRUPTED)
> > +			last_error = error;
> >  
> >  		if (trylock && !done)
> >  			pag->pag_ici_reclaim_cursor = first_index;
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 07/12] xfs: refactor eofblocks inode match code
  2019-01-02  9:50   ` Nikolay Borisov
@ 2019-01-17 18:05     ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-17 18:05 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-xfs

On Wed, Jan 02, 2019 at 11:50:58AM +0200, Nikolay Borisov wrote:
> 
> 
> On 1.01.19 г. 4:17 ч., Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor the code that determines if an inode matches an eofblocks
> > structure into a helper, since we already use it twice and we're about
> > to use it a third time.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_icache.c |  146 ++++++++++++++++++++++++++-------------------------
> >  1 file changed, 74 insertions(+), 72 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 7e031eb6f048..c1b457ba1b7b 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -27,6 +27,76 @@
> >  #include <linux/freezer.h>
> >  #include <linux/iversion.h>
> >  
> > +STATIC int
> > +xfs_inode_match_id(
> 
> This is a predicate so make it bool
> 
> > +	struct xfs_inode	*ip,
> > +	struct xfs_eofblocks	*eofb)
> > +{
> > +	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> > +	    !uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> > +		return 0;
> > +
> > +	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> > +	    !gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> > +		return 0;
> > +
> > +	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> > +	    xfs_get_projid(ip) != eofb->eof_prid)
> > +		return 0;
> > +
> > +	return 1;
> > +}
> > +
> > +/*
> > + * A union-based inode filtering algorithm. Process the inode if any of the
> > + * criteria match. This is for global/internal scans only.
> > + */
> > +STATIC int
> 
> ditto

The originals returned int, but sure, I'll convert them to bool.

--D

> > +xfs_inode_match_id_union(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_eofblocks	*eofb)
> > +{
> > +	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> > +	    uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> > +		return 1;
> > +
> > +	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> > +	    gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> > +		return 1;
> > +
> > +	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> > +	    xfs_get_projid(ip) == eofb->eof_prid)
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Does this inode match the given parameters? */
> > +STATIC bool
> > +xfs_inode_matches_eofb(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_eofblocks	*eofb)
> > +{
> > +	int			match;
> > +
> > +	if (!eofb)
> > +		return true;
> > +
> > +	if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
> > +		match = xfs_inode_match_id_union(ip, eofb);
> > +	else
> > +		match = xfs_inode_match_id(ip, eofb);
> > +	if (!match)
> > +		return false;
> > +
> > +	/* skip the inode if the file size is too small */
> > +	if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> > +	    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * Allocate and initialise an xfs_inode.
> >   */
> > @@ -1424,50 +1494,6 @@ xfs_reclaim_inodes_count(
> >  	return reclaimable;
> >  }
> >  
> > -STATIC int
> > -xfs_inode_match_id(
> > -	struct xfs_inode	*ip,
> > -	struct xfs_eofblocks	*eofb)
> > -{
> > -	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> > -	    !uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> > -		return 0;
> > -
> > -	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> > -	    !gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> > -		return 0;
> > -
> > -	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> > -	    xfs_get_projid(ip) != eofb->eof_prid)
> > -		return 0;
> > -
> > -	return 1;
> > -}
> > -
> > -/*
> > - * A union-based inode filtering algorithm. Process the inode if any of the
> > - * criteria match. This is for global/internal scans only.
> > - */
> > -STATIC int
> > -xfs_inode_match_id_union(
> > -	struct xfs_inode	*ip,
> > -	struct xfs_eofblocks	*eofb)
> > -{
> > -	if ((eofb->eof_flags & XFS_EOF_FLAGS_UID) &&
> > -	    uid_eq(VFS_I(ip)->i_uid, eofb->eof_uid))
> > -		return 1;
> > -
> > -	if ((eofb->eof_flags & XFS_EOF_FLAGS_GID) &&
> > -	    gid_eq(VFS_I(ip)->i_gid, eofb->eof_gid))
> > -		return 1;
> > -
> > -	if ((eofb->eof_flags & XFS_EOF_FLAGS_PRID) &&
> > -	    xfs_get_projid(ip) == eofb->eof_prid)
> > -		return 1;
> > -
> > -	return 0;
> > -}
> > -
> >  STATIC int
> >  xfs_inode_free_eofblocks(
> >  	struct xfs_inode	*ip,
> > @@ -1476,7 +1502,6 @@ xfs_inode_free_eofblocks(
> >  {
> >  	int ret = 0;
> >  	struct xfs_eofblocks *eofb = args;
> > -	int match;
> >  
> >  	if (!xfs_can_free_eofblocks(ip, false)) {
> >  		/* inode could be preallocated or append-only */
> > @@ -1493,19 +1518,8 @@ xfs_inode_free_eofblocks(
> >  	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY))
> >  		return 0;
> >  
> > -	if (eofb) {
> > -		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
> > -			match = xfs_inode_match_id_union(ip, eofb);
> > -		else
> > -			match = xfs_inode_match_id(ip, eofb);
> > -		if (!match)
> > -			return 0;
> > -
> > -		/* skip the inode if the file size is too small */
> > -		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> > -		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> > -			return 0;
> > -	}
> > +	if (!xfs_inode_matches_eofb(ip, eofb))
> > +		return 0;
> >  
> >  	/*
> >  	 * If the caller is waiting, return -EAGAIN to keep the background
> > @@ -1766,25 +1780,13 @@ xfs_inode_free_cowblocks(
> >  {
> >  	struct xfs_eofblocks	*eofb = args;
> >  	uint			lock_mode = 0;
> > -	int			match;
> >  	int			ret = 0;
> >  
> >  	if (!xfs_prep_free_cowblocks(ip))
> >  		return 0;
> >  
> > -	if (eofb) {
> > -		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
> > -			match = xfs_inode_match_id_union(ip, eofb);
> > -		else
> > -			match = xfs_inode_match_id(ip, eofb);
> > -		if (!match)
> > -			return 0;
> > -
> > -		/* skip the inode if the file size is too small */
> > -		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> > -		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> > -			return 0;
> > -	}
> > +	if (!xfs_inode_matches_eofb(ip, eofb))
> > +		return 0;
> >  
> >  	/*
> >  	 * Free the CoW blocks.  We don't need to lock the inode if we're in
> > 
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] xfs: free COW staging extents when freezing filesystem
  2019-01-17 17:24     ` Darrick J. Wong
@ 2019-01-17 18:14       ` Brian Foster
  2019-01-17 20:20         ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2019-01-17 18:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 09:24:28AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 11, 2019 at 11:28:38AM -0500, Brian Foster wrote:
> > On Mon, Dec 31, 2018 at 06:16:57PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > When we're freezing the filesystem, free all the COW staging extents
> > > before we shut the log down so that we can minimize the amount of
> > > recovery work that will be necessary during the next mount.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_icache.c |   17 ++++++++++++-----
> > >  fs/xfs/xfs_super.c  |   18 ++++++++++++++++++
> > >  2 files changed, 30 insertions(+), 5 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 245483cc282b..36d986087abb 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -1741,6 +1741,7 @@ xfs_inode_free_cowblocks(
> > >  	void			*args)
> > >  {
> > >  	struct xfs_eofblocks	*eofb = args;
> > > +	uint			lock_mode = 0;
> > >  	int			match;
> > >  	int			ret = 0;
> > >  
> > > @@ -1761,9 +1762,15 @@ xfs_inode_free_cowblocks(
> > >  			return 0;
> > >  	}
> > >  
> > > -	/* Free the CoW blocks */
> > > -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> > > -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > > +	/*
> > > +	 * Free the CoW blocks.  We don't need to lock the inode if we're in
> > > +	 * the process of freezing the filesystem because we've already locked
> > > +	 * out writes and page faults.
> > > +	 */
> > > +	if (ip->i_mount->m_super->s_writers.frozen == SB_UNFROZEN)
> > > +		lock_mode = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > 
> > Ok, but is there a problem with actually locking in this context? If so,
> > I'd expect the comment to say something like: "Free the COW blocks.
> > Don't lock the inode if we're in the process of freezing the filesystem
> > because <some bad thing happens>. This is safe because we've already
> > locked out writes and page faults."
> 
> Hmmm, it's now been so long since I wrote this patch I don't remember
> why I did this.  Wait something's coming back...
> 
> > > +	if (lock_mode)
> > > +		xfs_ilock(ip, lock_mode);
> > >  
> > >  	/*
> > >  	 * Check again, nobody else should be able to dirty blocks or change
> > > @@ -1772,8 +1779,8 @@ xfs_inode_free_cowblocks(
> > >  	if (xfs_prep_free_cowblocks(ip))
> > >  		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > >  
> > > -	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > > +	if (lock_mode)
> > > +		xfs_iunlock(ip, lock_mode);
> > >  
> > >  	return ret;
> > >  }
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 6903b36eac5d..bb4953cfd683 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1102,6 +1102,24 @@ xfs_fs_sync_fs(
> > >  	if (!wait)
> > >  		return 0;
> > >  
> > > +	/*
> > > +	 * If we're in the middle of freezing the filesystem, this is our last
> > > +	 * chance to run regular transactions.  Clear all the COW staging
> > > +	 * extents so that we can freeze the filesystem with as little recovery
> > > +	 * work to do at the next mount as possible.  It's safe to do this
> > > +	 * without locking inodes because the freezer code flushed all the
> > > +	 * dirty data from the page cache and locked out writes and page faults.
> > > +	 */
> > 
> > Well, we have XFS_TRANS_NO_WRITECOUNT. It seems we could use an eofb
> > freeze flag or some such to notify the scan that we're in a special
> > freeze context and in turn use that to tweak the locking (instead of the
> > SB_FREEZE_* checks) and avoid blocking on transaction allocation. Would
> > that allow us to run this in xfs_fs_freeze() context?
> 
> ...oh, right.  Another thread could have taken the io/mmap locks (say
> for fallocate) before the fs entered SB_FROZEN, and is now blocked
> in xfs_trans_alloc waiting for the fs to unfreeze.  I can't remember the
> exact circumstances though.
> 

Ok, so we'd basically deadlock on the freeze IIUC. That's exactly what
we'd want to document in the comment. :P

> > I'm also a little curious about the high-level tradeoff we're making.
> > This patch means that a freeze, and thus something like an LVM snapshot,
> > would clear/reset all COW blocks in the fs, right? If so, ISTM that
> > could be annoying if the user had a bunch of reflinked vdisk images or
> > something on the fs.
> 
> Yeah, it's sort of painful to drop all those speculative preallocations.
> 
> > Is the tradeoff just that if the user freezes and then doesn't unfreeze
> > before the next mount that log recovery will have to deal with COW
> > blocks, or are there considerations for a typical freeze/unfreeze cycle
> > as well? Is this more expensive at recovery time than at freeze time?
> 
> It's entirely to reduce the recovery work necessary after freezing the
> fs, snapshotting the fs (via lvm or whatever) and then mounting the
> snapshot separately.  TBH this patch has languished for years now
> because it's only necessary to optimize this one usage.
> 

So the snapshot will run log recovery and end up freeing all of these
extents "the hard way," right?

> Maybe it's easier to drop this one, unless anyone feels particularly
> passionate about clearing out cow staging extents before a freeze?
> 

Not really.

On principle, it seems more appropriate to me to put the work onto the
snapshot rather than disrupt the primary fs. I could see doing something
like this without much thought if the blocks were already freed or
something and the change was thus invisible to a user, but ISTM that
losing the preallocation can have a significant affect on the primary
workload.

Then again, this doesn't affect me much and I'm not terribly familiar
with the use case or whether there have been complaints or anything. I
don't feel strongly about it so long as the tradeoffs are documented
somewhere so it's clear what needs fixing if this behavior became
problematic down the road.

Brian

> --D
> 
> > Brian
> > 
> > > +	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > +		int		error;
> > > +
> > > +		error = xfs_icache_free_cowblocks(mp, NULL);
> > > +		if (error) {
> > > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > +			return error;
> > > +		}
> > > +	}
> > > +
> > >  	xfs_log_force(mp, XFS_LOG_SYNC);
> > >  	if (laptop_mode) {
> > >  		/*
> > > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes
  2019-01-03 12:46   ` Dave Chinner
@ 2019-01-17 18:41     ` Darrick J. Wong
  2019-01-17 22:21       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-17 18:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jan 03, 2019 at 11:46:44PM +1100, Dave Chinner wrote:
> On Mon, Dec 31, 2018 at 06:18:04PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Now that we've constructed a mechanism to batch background inode
> > inactivation work, we actually want in some cases to throttle the amount
> > of backlog work that the frontend can generate.  We do this by making
> > destroy_inode wait for inactivation when we're deleting things, assuming
> > that deleted inodes are dropped and destroyed in process context and not
> > from fs reclaim.
> 
> This would kills performance of highly concurrent unlink
> workloads.
> 
> That said, the current unlink code needs severe throttling because
> the unlinked inode list management does not scale at all well - get
> more than a a couple of hundred inodes into the AGI unlinked lists,
> and xfs_iunlink_remove burns huge amounts of CPU.
> 
> So if it isn't throttled, it'll be just as slow, but burn huge
> amounts amounts of extra CPU walking the unlinked lists.....

...so I've refactored the iunlink code to record "who points to this
inode?" references, which speeds up iunlink_remove substantially.  I'll
put that series out for review after letting it run on some real
workloads.

I've also dropped this patch from the series entirely, just to see what
happens.  The tradeoff here is that allocations see increased latency
upon hitting ENOSPC because we now force inactivation to see if we can
clean things out, but OTOH if we never hit ENOSPC then the rest of the
fs runs considerably faster.

> > +/*
> > + * Decide if this inode is a candidate for unlinked inactivation throttling.
> > + * We have to decide this prior to setting the NEED_INACTIVE iflag because
> > + * once we flag the inode for inactivation we can't access it any more.
> > + */
> > +enum xfs_iwait
> > +xfs_iwait_check(
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	unsigned long long	x;
> > +	unsigned long long	y;
> > +	bool			rt = XFS_IS_REALTIME_INODE(ip);
> > +
> > +	/*
> > +	 * Don't wait unless we're doing a deletion inactivation.  We assume
> > +	 * that unlinked inodes that lose all their refcount are dropped,
> > +	 * evicted, and destroyed immediately in the context of the unlink()ing
> > +	 * process.
> > +	 */
> 
> I think this is wrong - we want to push unlinked inode processing
> into the background so we don't have to wait on it, not force
> inactivation of unlinked inodes to wait for other inodes to be
> inactivated.

The original idea behind most of this was that we'd try to slow down a
rm -rf so that the fs doesn't find itself facing a gigantic flood of
inactivation work, particularly if there were a lot of extents to free
or a lot of inodes to free.  Under this scheme we don't ever wait for
inactivation if we're just closing a linked file, but we could do so for
deletions.

However, it is difficult to quantify what "gigantic" means here.  The
situation I was trying to avoid is where the system gets bogged down
with background processing work for a long time after the userspace
process terminates, but perhaps it's better not to bother. :)

> > +	if (VFS_I(ip)->i_nlink > 0)
> > +		return XFS_IWAIT_NEVER;
> > +
> > +	/*
> > +	 * If we're being called from kswapd we're in background memory reclaim
> > +	 * context.  There's no point in making it wait for ondisk metadata
> > +	 * updates, which themselves require memory allocations.
> > +	 */
> > +	if (current->flags & PF_KSWAPD)
> > +		return XFS_IWAIT_NEVER;
> > +
> > +	/*
> > +	 * Always wait for directory removal so we clean up any files that
> > +	 * were in that directory.
> > +	 */
> > +	if (S_ISDIR(VFS_I(ip)->i_mode)) {
> > +		trace_xfs_inactive_iwait_all(ip);
> > +		return XFS_IWAIT_ALL;
> > +	}
> 
> why do we need to wait for directories? it's already unlinked, so if
> we crash it and everything in it are not going to reappear....

Same reasoning as above.

> > +	/* Heavily fragmented files take a while to delete. */
> > +	x = XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK) +
> > +	    XFS_IFORK_NEXTENTS(ip, XFS_ATTR_FORK) +
> > +	    XFS_IFORK_NEXTENTS(ip, XFS_COW_FORK);
> > +	y = rt ? 256 : 32 * mp->m_sb.sb_agcount;
> > +	if (x >= y) {
> > +		trace_xfs_inactive_iwait_inode(ip);
> > +		return XFS_IWAIT_INODE;
> > +	}
> 
> Why does the number of extents in the current inode determine
> whether we wait for completion of all other inode inactivations in
> the same AG?

<urk> This might've grown into a way to force inactivation of a single
inode, but for now it's dropped.

> > +
> > +	return XFS_IWAIT_UNDECIDED;
> > +}
> > +
> > +/*
> > + * Wait for deferred inode inactivation of an unlinked inode being destroyed.
> > + *
> > + * The deferred inode inactivation mechanism provides for background batching
> > + * of whatever on-disk metadata updates are necessary to free an inode and all
> > + * the resources it holds.  In theory this should speed up deletion by enabling
> > + * us to inactivate in inode number order.
> > + *
> > + * However, there are a few situations where we actually /want/ to throttle
> > + * unlinking.  Specifically, if we're unlinking fragmented files or removing
> > + * entire directory trees, we should wait instead of allowing an enormous
> > + * processing backlog that causes update storms later.
> > + *
> > + * We will wait for inactivation to finish under the following circumstances:
> > + *  - Removing a directory
> > + *  - Removing a heavily fragmented file
> > + *  - A large number of blocks could be freed by inactivation
> > + *  - A large number of inodes could be freed by inactivation
> > + */
> > +void
> > +xfs_inactive_wait(
> > +	struct xfs_mount	*mp,
> > +	enum xfs_iwait		iwait,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	unsigned long long	x;
> > +	unsigned long long	y;
> > +
> > +	switch (iwait) {
> > +	case XFS_IWAIT_NEVER:
> > +		return;
> > +	case XFS_IWAIT_ALL:
> > +	case XFS_IWAIT_INODE:
> > +		goto wait;
> > +	default:
> > +		break;
> > +	}
> > +
> > +	iwait = XFS_IWAIT_ALL;
> > +
> > +	/* More than 1/4 of an AG space could be freed by inactivation. */
> > +	x = percpu_counter_read_positive(&mp->m_dinactive);
> > +	y = mp->m_sb.sb_agblocks / 4;
> > +	if (x >= y)
> > +		goto wait;
> 
> But that's a global counter - how does that relate to individual AG
> space at all?
> 
> > +
> > +	/* Less than 1/16 of the datadev is free. */
> > +	x = percpu_counter_read_positive(&mp->m_fdblocks);
> > +	y = mp->m_sb.sb_dblocks / 16;
> > +	if (x <= y)
> > +		goto wait;
> 
> urk. If I have a 100TB filesystem, that means we always throttle with 6-7TB
> of free space available. That's not right.
> 
> > +	/* More than 1/4 of the rtdev could be freed by inactivation. */
> > +	y = mp->m_sb.sb_rblocks;
> > +	if (y > 0) {
> > +		x = percpu_counter_read_positive(&mp->m_rinactive);
> > +		if (x >= y / 4)
> > +			goto wait;
> 
> That's even worse - this might not even be a rt inode... :/

<nod> I think I've managed to flush out all the places where we bail out
to userspace with ENOSPC and amend them to force inactivation and try
once more, so these shouldn't be necessary.

> > +
> > +		/* Less than 1/16 of the rtdev is free. */
> > +		x = mp->m_sb.sb_frextents * mp->m_sb.sb_rextsize;
> > +		if (x <= y / 16)
> > +			goto wait;
> > +	}
> > +
> > +	/* A lot of inodes could be freed by inactivation. */
> > +	x = percpu_counter_read_positive(&mp->m_iinactive);
> > +	y = XFS_INODES_PER_CHUNK * 4 * (unsigned long long)mp->m_sb.sb_agcount;
> > +	if (x >= y)
> > +		goto wait;
> 
> And, yeah, that's just wrong once we fix the unlinked inode
> scalability problem, as the patch below does.
> 
> 
> > +	return;
> > +wait:
> > +	switch (iwait) {
> > +	case XFS_IWAIT_ALL:
> > +		xfs_inactive_force(mp);
> > +		break;
> > +	case XFS_IWAIT_INODE:
> > +		xfs_inactive_force_ag(mp, agno);
> > +		break;
> > +	default:
> > +		ASSERT(0);
> 
> So we assert here if we get XFS_IWAIT_UNDECIDED?
> 
> >  	 * reclaim tear down all inodes.
> >  	 */
> >  	xfs_inode_set_reclaim_tag(ip, need_inactive);
> > +
> > +	/*
> > +	 * Wait for inactivation of this inode if the inode has zero nlink.
> > +	 * This cannot be done in fs reclaim context, which means we assume
> > +	 * that unlinked inodes that lose all their refcount are dropped,
> > +	 * evicted, and destroyed immediately in the context of the unlink()ing
> > +	 * process and are never fed to the LRU for reclaim.
> > +	 */
> 
> Apparently overlay breaks this assumption - AFAIA it can drop the last
> inode reference on unlinked inodes from it's dcache shrinker.

Heh.  Ok, maybe we simply never wait for inactivation then.

> > +	xfs_inactive_wait(mp, caniwait, agno);
> 
> So, prototype incore unlinked inode list patch is below. It needs
> log recovery help - the incore list needs to be rebuilt in log
> recovery so xfs_iunlink_remove works correctly. It also
> short-circuits some of the above "should wait for inactivation"
> checks and so some of the performance regressions are mostly
> removed...

As noted earlier, I took over this from Dave.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: incore unlinked inode list
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce an incore per-ag unlinked inode list to avoid needing to
> walk the buffer based unlinked lists to find the inodes to modify
> when removing an inode from the unlinked list.  When the unlinked
> lists grow long, finding an inode on the unlinked list requires
> mapping the inode number to a buffer, reading the buffer, extracting
> the inode from the buffer and reading the next inode number on the
> list. This is a lot of overhead, especially the buffer lookups. If
> we keep an in-core copy of the unlinked list, we don't need to do
> the buffer traversal to find the previous buffer in the list for
> unlinking; it's just the previous inode in the incore list.
> 
> The incore list is rooted in the xfs_perag, and mirrors the same
> hash structure in the AGI. The order of the lists is identical to
> the ordering on disk. Hence we can look up the incore list and then
> find the buffer we need to modify to remove inodes from the unlinked
> list.
> 
> Discussion: in this implementation, the incore list nests inside the
> AGI buffer locks, so the AGI buffer lock effectively serialises all
> unlinked list operations on the AGI whether it is needed or not. If
> we change the order of the locking and use the incore list as the
> canonical source of unlinked inodes, then we only need to get the
> AGI lock when we are modifying the AGI, not all modifications to the
> unlinked list. This could allow some level of parallelisation of
> unlinked list operations even within the same AG....
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c |   9 ++
>  fs/xfs/xfs_inode.c  | 318 +++++++++++++++++++++++++---------------------------
>  fs/xfs/xfs_inode.h  |   2 +
>  fs/xfs/xfs_mount.c  |   7 ++
>  fs/xfs/xfs_mount.h  |   5 +
>  5 files changed, 177 insertions(+), 164 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index a34f6101526b..5a28173b5ecd 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1952,6 +1952,7 @@ xfs_inactive_wait(
>  	enum xfs_iwait		iwait,
>  	xfs_agnumber_t		agno)
>  {
> +	struct xfs_perag	*pag;
>  	unsigned long long	x;
>  	unsigned long long	y;
>  
> @@ -1993,10 +1994,18 @@ xfs_inactive_wait(
>  	}
>  
>  	/* A lot of inodes could be freed by inactivation. */
> +#if 0
>  	x = percpu_counter_read_positive(&mp->m_iinactive);
>  	y = XFS_INODES_PER_CHUNK * 4 * (unsigned long long)mp->m_sb.sb_agcount;
>  	if (x >= y)
>  		goto wait;
> +	pag = xfs_perag_get(mp, agno);
> +	if (pag->pagi_unlinked_count > 10000) {
> +		xfs_perag_put(pag);
> +		goto wait;
> +	}
> +	xfs_perag_put(pag);
> +#endif
>  
>  	return;
>  wait:
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2c7f8aeffa92..a4b6f3e597f8 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2053,6 +2053,48 @@ xfs_inactive(
>  	xfs_qm_dqdetach(ip);
>  }
>  
> +/*
> + * Modify the next unlinked inode pointer on disk. Returns the old pointer value
> + * or a negative error.
> + *
> + * XXX: really want a new logical inode transaction flag for this so we don't
> + * ever need to touch buffers again here. That will change locking requirements,
> + * though.
> + */
> +static int
> +xfs_iunlink_mod(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	xfs_agino_t		next_agino,
> +	xfs_agino_t		*old_agino)
> +{
> +	struct xfs_mount *mp = tp->t_mountp;
> +	struct xfs_dinode *dip;
> +	struct xfs_buf	*ibp;
> +	int		offset;
> +	int		error;
> +
> +	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0, 0);
> +	if (error) {
> +		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
> +			__func__, error);
> +		return error;
> +	}
> +	if (old_agino)
> +		*old_agino = be32_to_cpu(dip->di_next_unlinked);
> +	dip->di_next_unlinked = cpu_to_be32(next_agino);
> +	offset = ip->i_imap.im_boffset +
> +				offsetof(struct xfs_dinode, di_next_unlinked);
> +
> +	/* need to recalc the inode CRC if appropriate */
> +	xfs_dinode_calc_crc(mp, dip);
> +
> +	xfs_trans_inode_buf(tp, ibp);
> +	xfs_trans_log_buf(tp, ibp, offset, (offset + sizeof(xfs_agino_t) - 1));
> +	xfs_inobp_check(mp, ibp);
> +	return 0;
> +}
> +
>  /*
>   * This is called when the inode's link count goes to 0 or we are creating a
>   * tmpfile via O_TMPFILE. In the case of a tmpfile, @ignore_linkcount will be
> @@ -2068,23 +2110,32 @@ xfs_iunlink(
>  	struct xfs_trans *tp,
>  	struct xfs_inode *ip)
>  {
> -	xfs_mount_t	*mp = tp->t_mountp;
> -	xfs_agi_t	*agi;
> -	xfs_dinode_t	*dip;
> -	xfs_buf_t	*agibp;
> -	xfs_buf_t	*ibp;
> +	struct xfs_mount *mp = tp->t_mountp;
> +	struct xfs_perag *pag;
> +	struct xfs_agi	*agi;
> +	struct xfs_buf	*agibp;
>  	xfs_agino_t	agino;
> +	xfs_agino_t	next_agino;
> +	xfs_agnumber_t	agno;
> +	struct list_head *unlinked_list;
>  	short		bucket_index;
>  	int		offset;
>  	int		error;
>  
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  
> +	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	ASSERT(agino != 0);
> +        bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
> +
> +	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	pag = xfs_perag_get(mp, agno);
> +	unlinked_list = &pag->pagi_unlinked_list[bucket_index];
> +
>  	/*
> -	 * Get the agi buffer first.  It ensures lock ordering
> -	 * on the list.
> +	 * Get the agi buffer first.  It ensures lock ordering on the list.
>  	 */
> -	error = xfs_read_agi(mp, tp, XFS_INO_TO_AGNO(mp, ip->i_ino), &agibp);
> +	error = xfs_read_agi(mp, tp, agno, &agibp);
>  	if (error)
>  		return error;
>  	agi = XFS_BUF_TO_AGI(agibp);
> @@ -2093,48 +2144,57 @@ xfs_iunlink(
>  	 * Get the index into the agi hash table for the
>  	 * list this inode will go on.
>  	 */
> -	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	ASSERT(agino != 0);
> -	bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
> -	ASSERT(agi->agi_unlinked[bucket_index]);
> -	ASSERT(be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino);
> +	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	ASSERT(next_agino != 0);
> +	ASSERT(next_agino != agino);
> +
> +	mutex_lock(&pag->pagi_unlink_lock);
> +	if (next_agino != NULLAGINO) {
> +		xfs_agino_t	old_agino;
> +#ifdef DEBUG
> +		struct xfs_inode *nip;
> +
> +		ASSERT(!list_empty(unlinked_list));
> +		nip = list_first_entry(unlinked_list, struct xfs_inode,
> +					i_unlinked_list);
> +		ASSERT(XFS_INO_TO_AGINO(mp, nip->i_ino) == next_agino);
> +#endif
>  
> -	if (agi->agi_unlinked[bucket_index] != cpu_to_be32(NULLAGINO)) {
>  		/*
>  		 * There is already another inode in the bucket we need
>  		 * to add ourselves to.  Add us at the front of the list.
>  		 * Here we put the head pointer into our next pointer,
>  		 * and then we fall through to point the head at us.
>  		 */
> -		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
> -				       0, 0);
> -		if (error)
> -			return error;
> -
> -		ASSERT(dip->di_next_unlinked == cpu_to_be32(NULLAGINO));
> -		dip->di_next_unlinked = agi->agi_unlinked[bucket_index];
> -		offset = ip->i_imap.im_boffset +
> -			offsetof(xfs_dinode_t, di_next_unlinked);
> -
> -		/* need to recalc the inode CRC if appropriate */
> -		xfs_dinode_calc_crc(mp, dip);
> -
> -		xfs_trans_inode_buf(tp, ibp);
> -		xfs_trans_log_buf(tp, ibp, offset,
> -				  (offset + sizeof(xfs_agino_t) - 1));
> -		xfs_inobp_check(mp, ibp);
> +		error = xfs_iunlink_mod(tp, ip, next_agino, &old_agino);
> +		if (error) {
> +			mutex_unlock(&pag->pagi_unlink_lock);
> +			goto out;
> +		}
> +		ASSERT(old_agino == NULLAGINO);
> +	} else {
> +		ASSERT(list_empty(unlinked_list));
>  	}
>  
> +	/*
> +	* As we are always adding the inode at the head of the AGI list,
> +	* it always gets added to the head here.
> +	*/
> +	list_add(&ip->i_unlinked_list, unlinked_list);
> +	mutex_unlock(&pag->pagi_unlink_lock);
> +
>  	/*
>  	 * Point the bucket head pointer at the inode being inserted.
>  	 */
> -	ASSERT(agino != 0);
>  	agi->agi_unlinked[bucket_index] = cpu_to_be32(agino);
>  	offset = offsetof(xfs_agi_t, agi_unlinked) +
>  		(sizeof(xfs_agino_t) * bucket_index);
>  	xfs_trans_log_buf(tp, agibp, offset,
>  			  (offset + sizeof(xfs_agino_t) - 1));
> -	return 0;
> +	pag->pagi_unlinked_count++;
> +out:
> +	xfs_perag_put(pag);
> +	return error;
>  }
>  
>  /*
> @@ -2145,81 +2205,72 @@ xfs_iunlink_remove(
>  	xfs_trans_t	*tp,
>  	xfs_inode_t	*ip)
>  {
> -	xfs_ino_t	next_ino;
> -	xfs_mount_t	*mp;
> -	xfs_agi_t	*agi;
> -	xfs_dinode_t	*dip;
> -	xfs_buf_t	*agibp;
> -	xfs_buf_t	*ibp;
> +	struct xfs_mount *mp = tp->t_mountp;
> +	struct xfs_perag *pag;
> +	struct xfs_agi	*agi;
> +	struct xfs_buf	*agibp;
>  	xfs_agnumber_t	agno;
>  	xfs_agino_t	agino;
> +	xfs_agino_t	old_agino;
>  	xfs_agino_t	next_agino;
> -	xfs_buf_t	*last_ibp;
> -	xfs_dinode_t	*last_dip = NULL;
> +	struct list_head *unlinked_list;
>  	short		bucket_index;
> -	int		offset, last_offset = 0;
>  	int		error;
>  
> -	mp = tp->t_mountp;
>  	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	if (!xfs_verify_agino(mp, agno, agino))
> +		return -EFSCORRUPTED;
> +
> +	pag = xfs_perag_get(mp, agno);
> +        bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
> +	unlinked_list = &pag->pagi_unlinked_list[bucket_index];
>  
>  	/*
> -	 * Get the agi buffer first.  It ensures lock ordering
> -	 * on the list.
> +	 * Get the agi buffer first.  It ensures lock ordering on the list.
>  	 */
>  	error = xfs_read_agi(mp, tp, agno, &agibp);
>  	if (error)
>  		return error;
> -
>  	agi = XFS_BUF_TO_AGI(agibp);
>  
> -	/*
> -	 * Get the index into the agi hash table for the
> -	 * list this inode will go on.
> -	 */
> -	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	if (!xfs_verify_agino(mp, agno, agino))
> -		return -EFSCORRUPTED;
> -	bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
> -	if (!xfs_verify_agino(mp, agno,
> -			be32_to_cpu(agi->agi_unlinked[bucket_index]))) {
> +	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	if (!xfs_verify_agino(mp, agno, next_agino)) {
>  		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  				agi, sizeof(*agi));
>  		return -EFSCORRUPTED;
>  	}
>  
> -	if (be32_to_cpu(agi->agi_unlinked[bucket_index]) == agino) {
> +	mutex_lock(&pag->pagi_unlink_lock);
> +	if (unlinked_list->next == &ip->i_unlinked_list) {
> +		int	offset;
> +
> +		ASSERT(next_agino == agino);
>  		/*
> -		 * We're at the head of the list.  Get the inode's on-disk
> -		 * buffer to see if there is anyone after us on the list.
> -		 * Only modify our next pointer if it is not already NULLAGINO.
> -		 * This saves us the overhead of dealing with the buffer when
> -		 * there is no need to change it.
> +		 * We're at the head of the list. Check if there is anyone
> +		 * after us on the list, and if so modify the on disk pointers
> +		 * to remove us from the list.
>  		 */
> -		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
> -				       0, 0);
> -		if (error) {
> -			xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
> -				__func__, error);
> -			return error;
> -		}
> -		next_agino = be32_to_cpu(dip->di_next_unlinked);
> -		ASSERT(next_agino != 0);
> -		if (next_agino != NULLAGINO) {
> -			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
> -			offset = ip->i_imap.im_boffset +
> -				offsetof(xfs_dinode_t, di_next_unlinked);
> -
> -			/* need to recalc the inode CRC if appropriate */
> -			xfs_dinode_calc_crc(mp, dip);
> -
> -			xfs_trans_inode_buf(tp, ibp);
> -			xfs_trans_log_buf(tp, ibp, offset,
> -					  (offset + sizeof(xfs_agino_t) - 1));
> -			xfs_inobp_check(mp, ibp);
> -		} else {
> -			xfs_trans_brelse(tp, ibp);
> +		next_agino = NULLAGINO;
> +		list_del_init(&ip->i_unlinked_list);
> +		if (!list_empty(unlinked_list)) {
> +			struct xfs_inode *nip;
> +
> +			/*
> +			 * If there's more entries on the list, clear the on-disk
> +			 * unlinked list pointer in the inode, then get the
> +			 * pointer to the new list head.
> +			 */
> +			error = xfs_iunlink_mod(tp, ip, NULLAGINO, &old_agino);
> +			if (error)
> +				goto out_unlock;
> +
> +			nip = list_first_entry(unlinked_list, struct xfs_inode,
> +						i_unlinked_list);
> +			next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
> +			ASSERT(old_agino == next_agino);
>  		}
> +
>  		/*
>  		 * Point the bucket head pointer at the next inode.
>  		 */
> @@ -2230,92 +2281,31 @@ xfs_iunlink_remove(
>  			(sizeof(xfs_agino_t) * bucket_index);
>  		xfs_trans_log_buf(tp, agibp, offset,
>  				  (offset + sizeof(xfs_agino_t) - 1));
> -	} else {
> -		/*
> -		 * We need to search the list for the inode being freed.
> -		 */
> -		next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> -		last_ibp = NULL;
> -		while (next_agino != agino) {
> -			struct xfs_imap	imap;
>  
> -			if (last_ibp)
> -				xfs_trans_brelse(tp, last_ibp);
>  
> -			imap.im_blkno = 0;
> -			next_ino = XFS_AGINO_TO_INO(mp, agno, next_agino);
> -
> -			error = xfs_imap(mp, tp, next_ino, &imap, 0);
> -			if (error) {
> -				xfs_warn(mp,
> -	"%s: xfs_imap returned error %d.",
> -					 __func__, error);
> -				return error;
> -			}
> -
> -			error = xfs_imap_to_bp(mp, tp, &imap, &last_dip,
> -					       &last_ibp, 0, 0);
> -			if (error) {
> -				xfs_warn(mp,
> -	"%s: xfs_imap_to_bp returned error %d.",
> -					__func__, error);
> -				return error;
> -			}
> -
> -			last_offset = imap.im_boffset;
> -			next_agino = be32_to_cpu(last_dip->di_next_unlinked);
> -			if (!xfs_verify_agino(mp, agno, next_agino)) {
> -				XFS_CORRUPTION_ERROR(__func__,
> -						XFS_ERRLEVEL_LOW, mp,
> -						last_dip, sizeof(*last_dip));
> -				return -EFSCORRUPTED;
> -			}
> -		}
> -
> -		/*
> -		 * Now last_ibp points to the buffer previous to us on the
> -		 * unlinked list.  Pull us from the list.
> -		 */
> -		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
> -				       0, 0);
> -		if (error) {
> -			xfs_warn(mp, "%s: xfs_imap_to_bp(2) returned error %d.",
> -				__func__, error);
> -			return error;
> -		}
> -		next_agino = be32_to_cpu(dip->di_next_unlinked);
> -		ASSERT(next_agino != 0);
> -		ASSERT(next_agino != agino);
> -		if (next_agino != NULLAGINO) {
> -			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
> -			offset = ip->i_imap.im_boffset +
> -				offsetof(xfs_dinode_t, di_next_unlinked);
> -
> -			/* need to recalc the inode CRC if appropriate */
> -			xfs_dinode_calc_crc(mp, dip);
> -
> -			xfs_trans_inode_buf(tp, ibp);
> -			xfs_trans_log_buf(tp, ibp, offset,
> -					  (offset + sizeof(xfs_agino_t) - 1));
> -			xfs_inobp_check(mp, ibp);
> -		} else {
> -			xfs_trans_brelse(tp, ibp);
> -		}
> +	} else {
>  		/*
> -		 * Point the previous inode on the list to the next inode.
> +		 * Get the previous inode in the list, point it at the next
> +		 * one in the list.
>  		 */
> -		last_dip->di_next_unlinked = cpu_to_be32(next_agino);
> -		ASSERT(next_agino != 0);
> -		offset = last_offset + offsetof(xfs_dinode_t, di_next_unlinked);
> +		struct xfs_inode *lip = list_entry(ip->i_unlinked_list.prev,
> +					struct xfs_inode, i_unlinked_list);
>  
> -		/* need to recalc the inode CRC if appropriate */
> -		xfs_dinode_calc_crc(mp, last_dip);
> +		list_del_init(&ip->i_unlinked_list);
> +		error = xfs_iunlink_mod(tp, ip, NULLAGINO, &next_agino);
> +		if (error)
> +			goto out_unlock;
> +
> +		error = xfs_iunlink_mod(tp, lip, next_agino, &old_agino);
> +		if (error)
> +			goto out_unlock;
> +		ASSERT(old_agino == agino);
>  
> -		xfs_trans_inode_buf(tp, last_ibp);
> -		xfs_trans_log_buf(tp, last_ibp, offset,
> -				  (offset + sizeof(xfs_agino_t) - 1));
> -		xfs_inobp_check(mp, last_ibp);
>  	}
> +	pag->pagi_unlinked_count--;
> +out_unlock:
> +	mutex_unlock(&pag->pagi_unlink_lock);
> +	xfs_perag_put(pag);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index e95e30a94243..d00c6e2b806b 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -47,6 +47,7 @@ typedef struct xfs_inode {
>  	atomic_t		i_pincount;	/* inode pin count */
>  	spinlock_t		i_flags_lock;	/* inode i_flags lock */
>  	/* Miscellaneous state. */
> +	struct list_head	i_unlinked_list;
>  	unsigned long		i_flags;	/* see defined flags below */
>  	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
>  
> @@ -55,6 +56,7 @@ typedef struct xfs_inode {
>  	xfs_extnum_t		i_cnextents;	/* # of extents in cow fork */
>  	unsigned int		i_cformat;	/* format of cow fork */
>  
> +
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
>  } xfs_inode_t;
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index aa953d2018dd..591e52e93236 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -185,6 +185,7 @@ xfs_initialize_perag(
>  	xfs_agnumber_t	first_initialised = NULLAGNUMBER;
>  	xfs_perag_t	*pag;
>  	int		error = -ENOMEM;
> +	int		i;
>  
>  	/*
>  	 * Walk the current per-ag tree so we don't try to initialise AGs
> @@ -230,6 +231,12 @@ xfs_initialize_perag(
>  		/* first new pag is fully initialized */
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
> +
> +		/* in-core inode unlinked lists */
> +		mutex_init(&pag->pagi_unlink_lock);
> +		for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++)
> +			INIT_LIST_HEAD(&pag->pagi_unlinked_list[i]);
> +
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, agcount);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 34f2cf96ec27..9c4e9c9ceb94 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -402,6 +402,11 @@ typedef struct xfs_perag {
>  
>  	/* reference count */
>  	uint8_t			pagf_refcount_level;
> +
> +	/* inode unlinked lists */
> +	struct mutex		pagi_unlink_lock;
> +	struct list_head	pagi_unlinked_list[XFS_AGI_UNLINKED_BUCKETS];
> +	uint32_t		pagi_unlinked_count;
>  } xfs_perag_t;
>  
>  static inline struct xfs_ag_resv *

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] xfs: free COW staging extents when freezing filesystem
  2019-01-17 18:14       ` Brian Foster
@ 2019-01-17 20:20         ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2019-01-17 20:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 01:14:06PM -0500, Brian Foster wrote:
> On Thu, Jan 17, 2019 at 09:24:28AM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 11, 2019 at 11:28:38AM -0500, Brian Foster wrote:
> > > On Mon, Dec 31, 2018 at 06:16:57PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When we're freezing the filesystem, free all the COW staging extents
> > > > before we shut the log down so that we can minimize the amount of
> > > > recovery work that will be necessary during the next mount.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/xfs_icache.c |   17 ++++++++++++-----
> > > >  fs/xfs/xfs_super.c  |   18 ++++++++++++++++++
> > > >  2 files changed, 30 insertions(+), 5 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 245483cc282b..36d986087abb 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -1741,6 +1741,7 @@ xfs_inode_free_cowblocks(
> > > >  	void			*args)
> > > >  {
> > > >  	struct xfs_eofblocks	*eofb = args;
> > > > +	uint			lock_mode = 0;
> > > >  	int			match;
> > > >  	int			ret = 0;
> > > >  
> > > > @@ -1761,9 +1762,15 @@ xfs_inode_free_cowblocks(
> > > >  			return 0;
> > > >  	}
> > > >  
> > > > -	/* Free the CoW blocks */
> > > > -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> > > > -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > > > +	/*
> > > > +	 * Free the CoW blocks.  We don't need to lock the inode if we're in
> > > > +	 * the process of freezing the filesystem because we've already locked
> > > > +	 * out writes and page faults.
> > > > +	 */
> > > > +	if (ip->i_mount->m_super->s_writers.frozen == SB_UNFROZEN)
> > > > +		lock_mode = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > 
> > > Ok, but is there a problem with actually locking in this context? If so,
> > > I'd expect the comment to say something like: "Free the COW blocks.
> > > Don't lock the inode if we're in the process of freezing the filesystem
> > > because <some bad thing happens>. This is safe because we've already
> > > locked out writes and page faults."
> > 
> > Hmmm, it's now been so long since I wrote this patch I don't remember
> > why I did this.  Wait something's coming back...
> > 
> > > > +	if (lock_mode)
> > > > +		xfs_ilock(ip, lock_mode);
> > > >  
> > > >  	/*
> > > >  	 * Check again, nobody else should be able to dirty blocks or change
> > > > @@ -1772,8 +1779,8 @@ xfs_inode_free_cowblocks(
> > > >  	if (xfs_prep_free_cowblocks(ip))
> > > >  		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > > >  
> > > > -	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > > -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > > > +	if (lock_mode)
> > > > +		xfs_iunlock(ip, lock_mode);
> > > >  
> > > >  	return ret;
> > > >  }
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 6903b36eac5d..bb4953cfd683 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -1102,6 +1102,24 @@ xfs_fs_sync_fs(
> > > >  	if (!wait)
> > > >  		return 0;
> > > >  
> > > > +	/*
> > > > +	 * If we're in the middle of freezing the filesystem, this is our last
> > > > +	 * chance to run regular transactions.  Clear all the COW staging
> > > > +	 * extents so that we can freeze the filesystem with as little recovery
> > > > +	 * work to do at the next mount as possible.  It's safe to do this
> > > > +	 * without locking inodes because the freezer code flushed all the
> > > > +	 * dirty data from the page cache and locked out writes and page faults.
> > > > +	 */
> > > 
> > > Well, we have XFS_TRANS_NO_WRITECOUNT. It seems we could use an eofb
> > > freeze flag or some such to notify the scan that we're in a special
> > > freeze context and in turn use that to tweak the locking (instead of the
> > > SB_FREEZE_* checks) and avoid blocking on transaction allocation. Would
> > > that allow us to run this in xfs_fs_freeze() context?
> > 
> > ...oh, right.  Another thread could have taken the io/mmap locks (say
> > for fallocate) before the fs entered SB_FROZEN, and is now blocked
> > in xfs_trans_alloc waiting for the fs to unfreeze.  I can't remember the
> > exact circumstances though.
> > 
> 
> Ok, so we'd basically deadlock on the freeze IIUC. That's exactly what
> we'd want to document in the comment. :P
> 
> > > I'm also a little curious about the high-level tradeoff we're making.
> > > This patch means that a freeze, and thus something like an LVM snapshot,
> > > would clear/reset all COW blocks in the fs, right? If so, ISTM that
> > > could be annoying if the user had a bunch of reflinked vdisk images or
> > > something on the fs.
> > 
> > Yeah, it's sort of painful to drop all those speculative preallocations.
> > 
> > > Is the tradeoff just that if the user freezes and then doesn't unfreeze
> > > before the next mount that log recovery will have to deal with COW
> > > blocks, or are there considerations for a typical freeze/unfreeze cycle
> > > as well? Is this more expensive at recovery time than at freeze time?
> > 
> > It's entirely to reduce the recovery work necessary after freezing the
> > fs, snapshotting the fs (via lvm or whatever) and then mounting the
> > snapshot separately.  TBH this patch has languished for years now
> > because it's only necessary to optimize this one usage.
> > 
> 
> So the snapshot will run log recovery and end up freeing all of these
> extents "the hard way," right?

Yes.  I suspect it's faster to let log recovery clean it up because it
doesn't have to walk the cowblocks inodes to free cow extents piece by
piece.

> > Maybe it's easier to drop this one, unless anyone feels particularly
> > passionate about clearing out cow staging extents before a freeze?
> > 
> 
> Not really.
> 
> On principle, it seems more appropriate to me to put the work onto the
> snapshot rather than disrupt the primary fs. I could see doing something
> like this without much thought if the blocks were already freed or
> something and the change was thus invisible to a user, but ISTM that
> losing the preallocation can have a significant affect on the primary
> workload.

Agreed.

> Then again, this doesn't affect me much and I'm not terribly familiar
> with the use case or whether there have been complaints or anything. I
> don't feel strongly about it so long as the tradeoffs are documented
> somewhere so it's clear what needs fixing if this behavior became
> problematic down the road.

Welll... if nobody else yells I'm happy to set this patch aside and let
the code be preserved in mailing list lore for a while.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > +	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > +		int		error;
> > > > +
> > > > +		error = xfs_icache_free_cowblocks(mp, NULL);
> > > > +		if (error) {
> > > > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > +			return error;
> > > > +		}
> > > > +	}
> > > > +
> > > >  	xfs_log_force(mp, XFS_LOG_SYNC);
> > > >  	if (laptop_mode) {
> > > >  		/*
> > > > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes
  2019-01-17 18:41     ` Darrick J. Wong
@ 2019-01-17 22:21       ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2019-01-17 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 10:41:58AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 03, 2019 at 11:46:44PM +1100, Dave Chinner wrote:
> > On Mon, Dec 31, 2018 at 06:18:04PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Now that we've constructed a mechanism to batch background inode
> > > inactivation work, we actually want in some cases to throttle the amount
> > > of backlog work that the frontend can generate.  We do this by making
> > > destroy_inode wait for inactivation when we're deleting things, assuming
> > > that deleted inodes are dropped and destroyed in process context and not
> > > from fs reclaim.
> > 
> > This would kills performance of highly concurrent unlink
> > workloads.
> > 
> > That said, the current unlink code needs severe throttling because
> > the unlinked inode list management does not scale at all well - get
> > more than a a couple of hundred inodes into the AGI unlinked lists,
> > and xfs_iunlink_remove burns huge amounts of CPU.
> > 
> > So if it isn't throttled, it'll be just as slow, but burn huge
> > amounts amounts of extra CPU walking the unlinked lists.....
> 
> ...so I've refactored the iunlink code to record "who points to this
> inode?" references, which speeds up iunlink_remove substantially.  I'll
> put that series out for review after letting it run on some real
> workloads.
> 
> I've also dropped this patch from the series entirely, just to see what
> happens.  The tradeoff here is that allocations see increased latency
> upon hitting ENOSPC because we now force inactivation to see if we can
> clean things out, but OTOH if we never hit ENOSPC then the rest of the
> fs runs considerably faster.

I think that's a valid trade-off - we make it in several other
places, so we expect things to slow down near/at ENOSPC.

> > > +/*
> > > + * Decide if this inode is a candidate for unlinked inactivation throttling.
> > > + * We have to decide this prior to setting the NEED_INACTIVE iflag because
> > > + * once we flag the inode for inactivation we can't access it any more.
> > > + */
> > > +enum xfs_iwait
> > > +xfs_iwait_check(
> > > +	struct xfs_inode	*ip)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	unsigned long long	x;
> > > +	unsigned long long	y;
> > > +	bool			rt = XFS_IS_REALTIME_INODE(ip);
> > > +
> > > +	/*
> > > +	 * Don't wait unless we're doing a deletion inactivation.  We assume
> > > +	 * that unlinked inodes that lose all their refcount are dropped,
> > > +	 * evicted, and destroyed immediately in the context of the unlink()ing
> > > +	 * process.
> > > +	 */
> > 
> > I think this is wrong - we want to push unlinked inode processing
> > into the background so we don't have to wait on it, not force
> > inactivation of unlinked inodes to wait for other inodes to be
> > inactivated.
> 
> The original idea behind most of this was that we'd try to slow down a
> rm -rf so that the fs doesn't find itself facing a gigantic flood of
> inactivation work, particularly if there were a lot of extents to free
> or a lot of inodes to free.  Under this scheme we don't ever wait for
> inactivation if we're just closing a linked file, but we could do so for
> deletions.
> 
> However, it is difficult to quantify what "gigantic" means here.  The
> situation I was trying to avoid is where the system gets bogged down
> with background processing work for a long time after the userspace
> process terminates, but perhaps it's better not to bother. :)

Multi-processor systems are ubiquitous now. If we can move things to
the background and burn otherwise idle CPU to do the work so that
users/apps don't have to wait on it, then I think that's fine. IMO,
the more work we can push into async background processing the
better we can optimise it (e.g. start batching or using bulk
operations rather than one-at-a-time).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-01-17 22:21 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-01  2:16 [PATCH 00/12] xfs: deferred inode inactivation Darrick J. Wong
2019-01-01  2:16 ` [PATCH 01/12] xfs: free COW staging extents when freezing filesystem Darrick J. Wong
2019-01-11 16:28   ` Brian Foster
2019-01-17 17:24     ` Darrick J. Wong
2019-01-17 18:14       ` Brian Foster
2019-01-17 20:20         ` Darrick J. Wong
2019-01-01  2:17 ` [PATCH 02/12] xfs: refactor the predicate part of xfs_free_eofblocks Darrick J. Wong
2019-01-11 19:05   ` Brian Foster
2019-01-17 17:33     ` Darrick J. Wong
2019-01-01  2:17 ` [PATCH 03/12] xfs: decide if inode needs inactivation Darrick J. Wong
2019-01-01  2:17 ` [PATCH 04/12] xfs: track unlinked inactive inode fs summary counters Darrick J. Wong
2019-01-01  2:17 ` [PATCH 05/12] xfs: track unlinked inactive inode quota counters Darrick J. Wong
2019-01-01  2:17 ` [PATCH 06/12] xfs: refactor walking of per-AG RECLAIM inodes Darrick J. Wong
2019-01-11 19:06   ` Brian Foster
2019-01-17 17:43     ` Darrick J. Wong
2019-01-01  2:17 ` [PATCH 07/12] xfs: refactor eofblocks inode match code Darrick J. Wong
2019-01-02  9:50   ` Nikolay Borisov
2019-01-17 18:05     ` Darrick J. Wong
2019-01-01  2:17 ` [PATCH 08/12] xfs: deferred inode inactivation Darrick J. Wong
2019-01-01  2:17 ` [PATCH 09/12] xfs: retry fs writes when there isn't space Darrick J. Wong
2019-01-01  2:17 ` [PATCH 10/12] xfs: force inactivation before fallocate when space is low Darrick J. Wong
2019-01-01  2:17 ` [PATCH 11/12] xfs: parallelize inode inactivation Darrick J. Wong
2019-01-01  2:18 ` [PATCH 12/12] xfs: wait for deferred inactivation when destroying unlinked inodes Darrick J. Wong
2019-01-03 12:46   ` Dave Chinner
2019-01-17 18:41     ` Darrick J. Wong
2019-01-17 22:21       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.