linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Subject: [PATCH 23/24] xfs: reclaim inodes from the LRU
Date: Thu,  1 Aug 2019 12:17:51 +1000	[thread overview]
Message-ID: <20190801021752.4986-24-david@fromorbit.com> (raw)
In-Reply-To: <20190801021752.4986-1-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

Replace the AG radix tree walking reclaim code with a list_lru
walker, giving us both node-aware and memcg-aware inode reclaim
at the XFS level. This requires adding an inode isolation function to
determine if the inode can be reclaim, and a list walker to
dispose of the inodes that were isolated.

We want the isolation function to be non-blocking. If we can't
grab an inode then we either skip it or rotate it. If it's clean
then we skip it, if it's dirty then we rotate to give it time to be
cleaned before it is scanned again.

This congregates the dirty inodes at the tail of the LRU, which
means that if we start hitting a majority of dirty inodes either
there are lots of unlinked inodes in the reclaim list or we've
reclaimed all the clean inodes and we're looped back on the dirty
inodes. Either way, this is an indication we should tell kswapd to
back off.

The non-blocking isolation function introduces a complexity for the
filesystem shutdown case. When the filesystem is shut down, we want
to free the inode even if it is dirty, and this may require
blocking. We already hold the locks needed to do this blocking, so
what we do is that we leave inodes locked - both the ILOCK and the
flush lock - while they are sitting on the dispose list to be freed
after the LRU walk completes.  This allows us to process the
shutdown state outside the LRU walk where we can block safely.

Keep in mind we don't have to care about inode lock order or
blocking with inode locks held here because a) we are using
trylocks, and b) once marked with XFS_IRECLAIM they can't be found
via the LRU and inode cache lookups will abort and retry. Hence
nobody will try to lock them in any other context that might also be
holding other inode locks.

Also convert xfs_reclaim_inodes() to use a LRU walk to free all
the reclaimable inodes in the filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 199 ++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_icache.h |  10 ++-
 fs/xfs/xfs_inode.h  |   8 ++
 fs/xfs/xfs_super.c  |  50 +++++++++--
 4 files changed, 232 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 610f643df9f6..891fe3795c8f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1195,7 +1195,7 @@ xfs_reclaim_inode(
  *
  * Return the number of inodes freed.
  */
-STATIC int
+int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
@@ -1297,40 +1297,185 @@ xfs_reclaim_inodes_ag(
 	return freed;
 }
 
-void
-xfs_reclaim_inodes(
-	struct xfs_mount	*mp)
+enum lru_status
+xfs_inode_reclaim_isolate(
+	struct list_head	*item,
+	struct list_lru_one	*lru,
+	spinlock_t		*lru_lock,
+	void			*arg)
 {
-	xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX);
+        struct xfs_ireclaim_args *ra = arg;
+        struct inode		*inode = container_of(item, struct inode, i_lru);
+        struct xfs_inode	*ip = XFS_I(inode);
+	enum lru_status		ret;
+	xfs_lsn_t		lsn = 0;
+
+	/* Careful: inversion of iflags_lock and everything else here */
+	if (!spin_trylock(&ip->i_flags_lock))
+		return LRU_SKIP;
+
+	ret = LRU_ROTATE;
+	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_unlock_flags;
+	}
+
+	ret = LRU_SKIP;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+		goto out_unlock_flags;
+
+	if (!__xfs_iflock_nowait(ip)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_unlock_inode;
+	}
+
+	/* if we are in shutdown, we'll reclaim it even if dirty */
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+		goto reclaim;
+
+	/*
+	 * Now the inode is locked, we can actually determine if it is dirty
+	 * without racing with anything.
+	 */
+	ret = LRU_ROTATE;
+	if (xfs_ipincount(ip)) {
+		ra->dirty_skipped++;
+		goto out_ifunlock;
+	}
+	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_ifunlock;
+	}
+
+reclaim:
+	/*
+	 * Once we mark the inode with XFS_IRECLAIM, no-one will grab it again.
+	 * RCU lookups will still find the inode, but they'll stop when they set
+	 * the IRECLAIM flag. Hence we can leave the inode locked as we move it
+	 * to the dispose list so we can deal with shutdown cleanup there
+	 * outside the LRU lock context.
+	 */
+	__xfs_iflags_set(ip, XFS_IRECLAIM);
+	list_lru_isolate_move(lru, &inode->i_lru, &ra->freeable);
+	spin_unlock(&ip->i_flags_lock);
+	return LRU_REMOVED;
+
+out_ifunlock:
+	xfs_ifunlock(ip);
+out_unlock_inode:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_unlock_flags:
+	spin_unlock(&ip->i_flags_lock);
+
+	if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
+		ra->lowest_lsn = lsn;
+	return ret;
 }
 
-/*
- * Scan a certain number of inodes for reclaim.
- *
- * When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doing synchronous
- * reclaim of inodes. That means if we come across dirty inodes, we wait for
- * them to be cleaned, which we hope will not be very long due to the
- * background walker having already kicked the IO off on those dirty inodes.
- */
-long
-xfs_reclaim_inodes_nr(
-	struct xfs_mount	*mp,
-	int			nr_to_scan)
+static void
+xfs_dispose_inode(
+	struct xfs_inode	*ip)
 {
-	int			sync_mode = 0;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	xfs_ino_t		ino;
+
+	ASSERT(xfs_isiflocked(ip));
+	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
+	ASSERT(ip->i_ino != 0);
 
 	/*
-	 * For kswapd, we kick background inode writeback. For direct
-	 * reclaim, we issue and wait on inode writeback to throttle
-	 * reclaim rates and avoid shouty OOM-death.
+	 * Process the shutdown reclaim work we deferred from the LRU isolation
+	 * callback before we go any further.
 	 */
-	if (current_is_kswapd())
-		xfs_ail_push_all(mp->m_ail);
-	else
-		sync_mode |= SYNC_WAIT;
+	if (XFS_FORCED_SHUTDOWN(mp)) {
+		xfs_iunpin_wait(ip);
+		xfs_iflush_abort(ip, false);
+	} else {
+		xfs_ifunlock(ip);
+	}
 
-	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
+	/*
+	 * Because we use RCU freeing we need to ensure the inode always appears
+	 * to be reclaimed with an invalid inode number when in the free state.
+	 * We do this as early as possible under the ILOCK so that
+	 * xfs_iflush_cluster() and xfs_ifree_cluster() can be guaranteed to
+	 * detect races with us here. By doing this, we guarantee that once
+	 * xfs_iflush_cluster() or xfs_ifree_cluster() has locked XFS_ILOCK that
+	 * it will see either a valid inode that will serialise correctly, or it
+	 * will see an invalid inode that it can skip.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	ino = ip->i_ino; /* for radix_tree_delete */
+	ip->i_flags = XFS_IRECLAIM;
+	ip->i_ino = 0;
+	spin_unlock(&ip->i_flags_lock);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	XFS_STATS_INC(mp, xs_ig_reclaims);
+	/*
+	 * Remove the inode from the per-AG radix tree.
+	 *
+	 * Because radix_tree_delete won't complain even if the item was never
+	 * added to the tree assert that it's been there before to catch
+	 * problems with the inode life time early on.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
+	spin_lock(&pag->pag_ici_lock);
+	if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino)))
+		ASSERT(0);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+
+	/*
+	 * Here we do an (almost) spurious inode lock in order to coordinate
+	 * with inode cache radix tree lookups.  This is because the lookup
+	 * can reference the inodes in the cache without taking references.
+	 *
+	 * We make that OK here by ensuring that we wait until the inode is
+	 * unlocked after the lookup before we go ahead and free it.
+	 *
+	 * XXX: need to check this is still true. Not sure it is.
+	 */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_qm_dqdetach(ip);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	__xfs_inode_free(ip);
+}
+
+void
+xfs_dispose_inodes(
+	struct list_head	*freeable)
+{
+	while (!list_empty(freeable)) {
+		struct inode *inode;
+
+		inode = list_first_entry(freeable, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
+
+		xfs_dispose_inode(XFS_I(inode));
+		cond_resched();
+	}
+}
+void
+xfs_reclaim_inodes(
+	struct xfs_mount	*mp)
+{
+	while (list_lru_count(&mp->m_inode_lru)) {
+		struct xfs_ireclaim_args ra;
+
+		INIT_LIST_HEAD(&ra.freeable);
+		ra.lowest_lsn = NULLCOMMITLSN;
+		list_lru_walk(&mp->m_inode_lru, xfs_inode_reclaim_isolate,
+				&ra, LONG_MAX);
+		xfs_dispose_inodes(&ra.freeable);
+		if (ra.lowest_lsn != NULLCOMMITLSN)
+			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
+	}
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 0ab08b58cd45..dadc69a30f33 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,8 +49,16 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
+struct xfs_ireclaim_args {
+	struct list_head	freeable;
+	xfs_lsn_t		lowest_lsn;
+	unsigned long		dirty_skipped;
+};
+
+enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
+		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
+void xfs_dispose_inodes(struct list_head *freeable);
 void xfs_reclaim_inodes(struct xfs_mount *mp);
-long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 558173f95a03..463170dc4c02 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -263,6 +263,14 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 
 extern void __xfs_iflock(struct xfs_inode *ip);
 
+static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
+{
+	if (ip->i_flags & XFS_IFLOCK)
+		return false;
+	ip->i_flags |= XFS_IFLOCK;
+	return true;
+}
+
 static inline int xfs_iflock_nowait(struct xfs_inode *ip)
 {
 	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b5c4c1b6fd19..e3e898a2896c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -17,6 +17,7 @@
 #include "xfs_alloc.h"
 #include "xfs_fsops.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
@@ -1810,23 +1811,58 @@ xfs_fs_mount(
 }
 
 static long
-xfs_fs_nr_cached_objects(
+xfs_fs_free_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
-	/* Paranoia: catch incorrect calls during mount setup or teardown */
-	if (WARN_ON_ONCE(!sb->s_fs_info))
-		return 0;
+	struct xfs_mount	*mp = XFS_M(sb);
+        struct xfs_ireclaim_args ra;
+	long freed;
 
-	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
+	INIT_LIST_HEAD(&ra.freeable);
+	ra.lowest_lsn = NULLCOMMITLSN;
+	ra.dirty_skipped = 0;
+
+	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
+					xfs_inode_reclaim_isolate, &ra);
+	xfs_dispose_inodes(&ra.freeable);
+
+	/*
+	 * Deal with dirty inodes if we skipped any. We will have the LSN of
+	 * the oldest dirty inode in our reclaim args if we skipped any.
+	 *
+	 * For direct reclaim, wait on an AIL push to clean some inodes.
+	 *
+	 * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
+	 * we freed) then we need kswapd to back off once it's scan has been
+	 * completed. That way it will have some clean inodes once it comes back
+	 * and can make progress, but make sure we have inode cleaning in
+	 * progress....
+	 */
+	if (current_is_kswapd()) {
+		if (ra.dirty_skipped >= freed) {
+			if (current->reclaim_state)
+				current->reclaim_state->need_backoff = true;
+			if (ra.lowest_lsn != NULLCOMMITLSN)
+				xfs_ail_push(mp->m_ail, ra.lowest_lsn);
+		}
+	} else if (ra.lowest_lsn != NULLCOMMITLSN) {
+		xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
+	}
+
+	return freed;
 }
 
 static long
-xfs_fs_free_cached_objects(
+xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
+	/* Paranoia: catch incorrect calls during mount setup or teardown */
+	if (WARN_ON_ONCE(!sb->s_fs_info))
+		return 0;
+
+	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
 }
 
 static const struct super_operations xfs_super_operations = {
-- 
2.22.0


  parent reply	other threads:[~2019-08-01  2:18 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:49     ` Dave Chinner
2019-08-05 17:42       ` Brian Foster
2019-08-05 23:43         ` Dave Chinner
2019-08-06 12:27           ` Brian Foster
2019-08-06 22:22             ` Dave Chinner
2019-08-07 11:13               ` Brian Foster
2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:50     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
2019-08-02 15:08   ` Nikolay Borisov
2019-08-04  2:05     ` Dave Chinner
2019-08-02 15:31   ` Brian Foster
2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
2019-08-02 15:34   ` Brian Foster
2019-08-04 16:48   ` Nikolay Borisov
2019-08-04 21:37     ` Dave Chinner
2019-08-07 16:12   ` kbuild test robot
2019-08-07 18:00   ` kbuild test robot
2019-08-01  2:17 ` [PATCH 05/24] shrinker: clean up variable types and tracepoints Dave Chinner
2019-08-01  2:17 ` [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-08-01  2:17 ` [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-08-01  2:17 ` [PATCH 08/24] mm: kswapd backoff for shrinkers Dave Chinner
2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
2019-08-01 13:39   ` Chris Mason
2019-08-01 23:58     ` Dave Chinner
2019-08-02  8:12       ` Christoph Hellwig
2019-08-02 14:11       ` Chris Mason
2019-08-02 18:34         ` Matthew Wilcox
2019-08-02 23:28         ` Dave Chinner
2019-08-05 18:32           ` Chris Mason
2019-08-05 23:09             ` Dave Chinner
2019-08-01  2:17 ` [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait Dave Chinner
2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
2019-08-01  8:16   ` Christoph Hellwig
2019-08-01  9:21     ` Dave Chinner
2019-08-06  5:51       ` Christoph Hellwig
2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-08-06  5:52   ` Christoph Hellwig
2019-08-06 21:05     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
2019-08-05 17:51   ` Brian Foster
2019-08-05 23:21     ` Dave Chinner
2019-08-06 12:29       ` Brian Foster
2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-08-05 17:53   ` Brian Foster
2019-08-05 23:28     ` Dave Chinner
2019-08-06  5:33       ` Dave Chinner
2019-08-06 12:53         ` Brian Foster
2019-08-06 21:11           ` Dave Chinner
2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
2019-08-05 18:03   ` Brian Foster
2019-08-05 23:33     ` Dave Chinner
2019-08-06 12:57       ` Brian Foster
2019-08-06 21:21         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-08-04 17:12   ` Nikolay Borisov
2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-08-06 18:21   ` Brian Foster
2019-08-06 21:27     ` Dave Chinner
2019-08-07 11:14       ` Brian Foster
2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-08-06 18:22   ` Brian Foster
2019-08-06 21:33     ` Dave Chinner
2019-08-07 11:30       ` Brian Foster
2019-08-07 23:16         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 19/24] xfs: kill background reclaim work Dave Chinner
2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-08-07 18:09   ` Brian Foster
2019-08-07 23:10     ` Dave Chinner
2019-08-08 16:20       ` Brian Foster
2019-08-01  2:17 ` [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-08-08 16:36   ` Brian Foster
2019-08-09  0:10     ` Dave Chinner
2019-08-01  2:17 ` Dave Chinner [this message]
2019-08-08 16:39   ` [PATCH 23/24] xfs: reclaim inodes from the LRU Brian Foster
2019-08-09  1:20     ` Dave Chinner
2019-08-09 12:36       ` Brian Foster
2019-08-11  2:17         ` Dave Chinner
2019-08-11 12:46           ` Brian Foster
2019-08-01  2:17 ` [PATCH 24/24] xfs: remove unusued old inode reclaim code Dave Chinner
2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-08-06 21:37   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190801021752.4986-24-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).