Re: [PATCH 24/28] xfs: reclaim inodes from the LRU

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 24/28] xfs: reclaim inodes from the LRU
Date: Wed, 6 Nov 2019 12:21:04 -0500	[thread overview]
Message-ID: <20191106172104.GB37080@bfoster> (raw)
In-Reply-To: <20191031234618.15403-25-david@fromorbit.com>

On Fri, Nov 01, 2019 at 10:46:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Replace the AG radix tree walking reclaim code with a list_lru
> walker, giving us both node-aware and memcg-aware inode reclaim
> at the XFS level. This requires adding an inode isolation function to
> determine if the inode can be reclaim, and a list walker to
> dispose of the inodes that were isolated.
> 
> We want the isolation function to be non-blocking. If we can't
> grab an inode then we either skip it or rotate it. If it's clean
> then we skip it, if it's dirty then we rotate to give it time to be
> cleaned before it is scanned again.
> 
> This congregates the dirty inodes at the tail of the LRU, which
> means that if we start hitting a majority of dirty inodes either
> there are lots of unlinked inodes in the reclaim list or we've
> reclaimed all the clean inodes and we're looped back on the dirty
> inodes. Either way, this is an indication we should tell kswapd to
> back off.
> 
> The non-blocking isolation function introduces a complexity for the
> filesystem shutdown case. When the filesystem is shut down, we want
> to free the inode even if it is dirty, and this may require
> blocking. We already hold the locks needed to do this blocking, so
> what we do is that we leave inodes locked - both the ILOCK and the
> flush lock - while they are sitting on the dispose list to be freed
> after the LRU walk completes.  This allows us to process the
> shutdown state outside the LRU walk where we can block safely.
> 
> Because we now are reclaiming inodes from the context that it needs
> memory in (memcg and/or node), direct reclaim throttling within the
> high level reclaim code in now much more effective. Hence we don't
> wait on IO for either kswapd or direct reclaim. However, we have to
> tell kswapd to back off if we start hitting too many dirty inodes.
> This implies we've wrapped around the LRU and don't have many clean
> inodes left to reclaim, so it needs to wait a while for the AIL
> pushing to clean some of the remaining reclaimable inodes.
> 
> Keep in mind we don't have to care about inode lock order or
> blocking with inode locks held here because a) we are using
> trylocks, and b) once marked with XFS_IRECLAIM they can't be found
> via the LRU and inode cache lookups will abort and retry. Hence
> nobody will try to lock them in any other context that might also be
> holding other inode locks.
> 
> Also convert xfs_reclaim_all_inodes() to use a LRU walk to free all
> the reclaimable inodes in the filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks fundamentally sane. Some logic quibbles..

>  fs/xfs/xfs_icache.c | 404 +++++++++++++-------------------------------
>  fs/xfs/xfs_icache.h |  18 +-
>  fs/xfs/xfs_inode.h  |  18 ++
>  fs/xfs/xfs_super.c  |  46 ++++-
>  4 files changed, 190 insertions(+), 296 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 350f42e7730b..05dd292bfdb6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -968,160 +968,110 @@ xfs_inode_ag_iterator_tag(
>  	return last_error;
>  }
>  
> -/*
> - * Grab the inode for reclaim.
> - *
> - * Return false if we aren't going to reclaim it, true if it is a reclaim
> - * candidate.
> - *
> - * If the inode is clean or unreclaimable, return 0 to tell the caller it does
> - * not require flushing. Otherwise return the log item lsn of the inode so the
> - * caller can determine it's inode flush target.  If we get the clean/dirty
> - * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks
> - * held.
> - */
> -STATIC bool
> -xfs_reclaim_inode_grab(
> -	struct xfs_inode	*ip,
> -	int			flags,
> -	xfs_lsn_t		*lsn)
> +enum lru_status
> +xfs_inode_reclaim_isolate(
> +	struct list_head	*item,
> +	struct list_lru_one	*lru,
> +	spinlock_t		*lru_lock,

Did we ever establish whether we should cycle the lru_lock during long
running scans?

> +	void			*arg)
>  {
> -	ASSERT(rcu_read_lock_held());
> -	*lsn = 0;
> +        struct xfs_ireclaim_args *ra = arg;
> +        struct inode		*inode = container_of(item, struct inode,
> +						      i_lru);
> +        struct xfs_inode	*ip = XFS_I(inode);

Whitespace damage on the above lines (space indentation vs tabs).

> +	enum lru_status		ret;
> +	xfs_lsn_t		lsn = 0;
> +
> +	/* Careful: inversion of iflags_lock and everything else here */
> +	if (!spin_trylock(&ip->i_flags_lock))
> +		return LRU_SKIP;
> +
> +	/* if we are in shutdown, we'll reclaim it even if dirty */
> +	ret = LRU_ROTATE;
> +	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) &&
> +	    !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;
> +		ra->dirty_skipped++;
> +		goto out_unlock_flags;
> +	}
>  
> -	/* quick check for stale RCU freed inode */
> -	if (!ip->i_ino)
> -		return false;
> +	ret = LRU_SKIP;
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> +		goto out_unlock_flags;
>  
> -	/*
> -	 * Do unlocked checks to see if the inode already is being flushed or in
> -	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> -	 * position in the AIL for the caller to push to.
> -	 */
> -	if (!xfs_inode_clean(ip)) {
> -		*lsn = ip->i_itemp->ili_item.li_lsn;
> -		return false;
> +	if (!__xfs_iflock_nowait(ip)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;

This looks like a potential crash vector if we ever got here with a
clean inode.

> +		ra->dirty_skipped++;
> +		goto out_unlock_inode;
>  	}
>  
> -	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> -		return false;
> +	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> +		goto reclaim;
>  
>  	/*
> -	 * The radix tree lock here protects a thread in xfs_iget from racing
> -	 * with us starting reclaim on the inode.  Once we have the
> -	 * XFS_IRECLAIM flag set it will not touch us.
> -	 *
> -	 * Due to RCU lookup, we may find inodes that have been freed and only
> -	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> -	 * aren't candidates for reclaim at all, so we must check the
> -	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> +	 * Now the inode is locked, we can actually determine if it is dirty
> +	 * without racing with anything.
>  	 */
> -	spin_lock(&ip->i_flags_lock);
> -	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
> -	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
> -		/* not a reclaim candidate. */
> -		spin_unlock(&ip->i_flags_lock);
> -		return false;
> +	ret = LRU_ROTATE;
> +	if (xfs_ipincount(ip)) {
> +		ra->dirty_skipped++;

Hmm.. didn't we have an LSN check here?

Altogether, I think the logic in this function would be a lot more
simple if we had something like the following:

	...
	/* ret == LRU_SKIP */
        if (!xfs_inode_clean(ip)) {
		ret = LRU_ROTATE;
                lsn = ip->i_itemp->ili_item.li_lsn;
                ra->dirty_skipped++;
        }
        if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
                ra->lowest_lsn = lsn;
        return ret;

... as the non-reclaim exit path. Then the earlier logic simply dictates
how we process the inode instead of conflating lru processing with
lsn/dirty checks. Otherwise for example (based on the current logic),
it's not really clear to me whether ->dirty_skipped cares about dirty
inodes or just the fact that we skipped an inode.

> +		goto out_ifunlock;
> +	}
> +	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;
> +		ra->dirty_skipped++;
> +		goto out_ifunlock;
>  	}
> +
...
> @@ -1165,167 +1108,52 @@ xfs_reclaim_inode(
...
>  void
>  xfs_reclaim_all_inodes(
>  	struct xfs_mount	*mp)
>  {
...
> +	while (list_lru_count(&mp->m_inode_lru)) {

It seems unnecessary to call this twice per-iter:

	while ((to_free = list_lru_count(&mp->m_inode_lru))) {
		...
	}

Hm?

Brian

> +		struct xfs_ireclaim_args ra;
> +		long freed, to_free;
> +
> +		xfs_ireclaim_args_init(&ra);
> +
> +		to_free = list_lru_count(&mp->m_inode_lru);
> +		freed = list_lru_walk(&mp->m_inode_lru,
> +				      xfs_inode_reclaim_isolate, &ra, to_free);
> +		xfs_dispose_inodes(&ra.freeable);
> +
> +		if (freed == 0) {
> +			xfs_log_force(mp, XFS_LOG_SYNC);
> +			xfs_ail_push_all(mp->m_ail);
> +		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
> +			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
> +		}
> +		cond_resched();
> +	}
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index afd692b06c13..86e858e4a281 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -49,8 +49,24 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
>  struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
>  void xfs_inode_free(struct xfs_inode *ip);
>  
> +struct xfs_ireclaim_args {
> +	struct list_head	freeable;
> +	xfs_lsn_t		lowest_lsn;
> +	unsigned long		dirty_skipped;
> +};
> +
> +static inline void
> +xfs_ireclaim_args_init(struct xfs_ireclaim_args *ra)
> +{
> +	INIT_LIST_HEAD(&ra->freeable);
> +	ra->lowest_lsn = NULLCOMMITLSN;
> +	ra->dirty_skipped = 0;
> +}
> +
> +enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
> +		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
> +void xfs_dispose_inodes(struct list_head *freeable);
>  void xfs_reclaim_all_inodes(struct xfs_mount *mp);
> -long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
>  
>  void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bcfb35a9c5ca..00145debf820 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -270,6 +270,15 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  
>  extern void __xfs_iflock(struct xfs_inode *ip);
>  
> +static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
> +{
> +	lockdep_assert_held(&ip->i_flags_lock);
> +	if (ip->i_flags & XFS_IFLOCK)
> +		return false;
> +	ip->i_flags |= XFS_IFLOCK;
> +	return true;
> +}
> +
>  static inline int xfs_iflock_nowait(struct xfs_inode *ip)
>  {
>  	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
> @@ -281,6 +290,15 @@ static inline void xfs_iflock(struct xfs_inode *ip)
>  		__xfs_iflock(ip);
>  }
>  
> +static inline void __xfs_ifunlock(struct xfs_inode *ip)
> +{
> +	lockdep_assert_held(&ip->i_flags_lock);
> +	ASSERT(ip->i_flags & XFS_IFLOCK);
> +	ip->i_flags &= ~XFS_IFLOCK;
> +	smp_mb();
> +	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
> +}
> +
>  static inline void xfs_ifunlock(struct xfs_inode *ip)
>  {
>  	ASSERT(xfs_isiflocked(ip));
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 98ffbe42f8ae..096ae31b5436 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -17,6 +17,7 @@
>  #include "xfs_alloc.h"
>  #include "xfs_fsops.h"
>  #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  #include "xfs_log_priv.h"
> @@ -1772,23 +1773,54 @@ xfs_fs_mount(
>  }
>  
>  static long
> -xfs_fs_nr_cached_objects(
> +xfs_fs_free_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	/* Paranoia: catch incorrect calls during mount setup or teardown */
> -	if (WARN_ON_ONCE(!sb->s_fs_info))
> -		return 0;
> +	struct xfs_mount	*mp = XFS_M(sb);
> +	struct xfs_ireclaim_args ra;
> +	long			freed;
>  
> -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> +	xfs_ireclaim_args_init(&ra);
> +
> +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> +					xfs_inode_reclaim_isolate, &ra);
> +	xfs_dispose_inodes(&ra.freeable);
> +
> +	/*
> +	 * Deal with dirty inodes. We will have the LSN of
> +	 * the oldest dirty inode in our reclaim args if we skipped any.
> +	 *
> +	 * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
> +	 * we freed) then we need kswapd to back off once it's scan has been
> +	 * completed. That way it will have some clean inodes once it comes back
> +	 * and can make progress, but make sure we have inode cleaning in
> +	 * progress.
> +	 *
> +	 * Direct reclaim will be throttled by the caller as it winds the
> +	 * priority up. All we need to do is keep pushing on dirty inodes
> +	 * in the background so when we come back progress will be made.
> +	 */
> +	if (current_is_kswapd() && ra.dirty_skipped >= freed) {
> +		if (current->reclaim_state)
> +			current->reclaim_state->need_backoff = true;
> +	}
> +	if (ra.lowest_lsn != NULLCOMMITLSN)
> +		xfs_ail_push(mp->m_ail, ra.lowest_lsn);
> +
> +	return freed;
>  }
>  
>  static long
> -xfs_fs_free_cached_objects(
> +xfs_fs_nr_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
> +	/* Paranoia: catch incorrect calls during mount setup or teardown */
> +	if (WARN_ON_ONCE(!sb->s_fs_info))
> +		return 0;
> +
> +	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
>  }
>  
>  static const struct super_operations xfs_super_operations = {
> -- 
> 2.24.0.rc0
>