Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 20/26] xfs: use AIL pushing for inode reclaim IO
Date: Fri, 11 Oct 2019 13:38:35 -0400
Message-ID: <20191011173835.GA64237@bfoster> (raw)
In-Reply-To: <20191009032124.10541-21-david@fromorbit.com>

On Wed, Oct 09, 2019 at 02:21:18PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim currently issues it's own inode IO when it comes
> across dirty inodes. This is used to throttle direct reclaim down to
> the rate at which we can reclaim dirty inodes. Failure to throttle
> in this manner results in the OOM killer being trivial to trigger
> even when there is lots of free memory available.
> 
> However, having direct reclaimers issue IO causes an amount of
> IO thrashing to occur. We can have up to the number of AGs in the
> filesystem concurrently issuing IO, plus the AIL pushing thread as
> well. This means we can many competing sources of IO and they all
> end up thrashing and competing for the request slots in the block
> device.
> 
> Similar to dirty page throttling and the BDI flusher thread, we can
> use the AIL pushing thread the sole place we issue inode writeback
> from and everything else waits for it to make progress. To do this,
> reclaim will skip over dirty inodes, but in doing so will record the
> lowest LSN of all the dirty inodes it skips. It will then push the
> AIL to this LSN and wait for it to complete that work.
> 
> In doing so, we block direct reclaim on the IO of at least one IO,
> thereby providing some level of throttling for when we encounter
> dirty inodes. However we gain the ability to scan and reclaim
> clean inodes in a non-blocking fashion. This allows us to
> remove all the per-ag reclaim locking that avoids excessive direct
> reclaim, as repeated concurrent direct reclaim will hit the same
> dirty inodes and block waiting on the same IO to complete.
> 
> Hence direct reclaim will be throttled directly by the rate at which
> dirty inodes are cleaned by AIL pushing, rather than by delays
> caused by competing IO submissions. This allows us to remove all the
> locking that limits direct reclaim concurrency and greatly
> simplifies the inode reclaim code now that it just skips dirty
> inodes.
> 

The above couple paragraphs should probably change to explain the
modified locking since the locking is no longer completely removed. 

Otherwise, just a few small things..

> Note: this patch by itself isn't completely able to throttle direct
> reclaim sufficiently to prevent OOM killer madness. We can't do that
> until we change the way we index reclaimable inodes in the next
> patch and can feed back state to the mm core sanely.  However, we
> can't change the way we index reclaimable inodes until we have
> IO-less non-blocking reclaim for both direct reclaim and kswapd
> reclaim.  Catch-22...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 215 +++++++++++++++++++-------------------------
>  1 file changed, 90 insertions(+), 125 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 7e175304e146..ed996b37bda0 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
...
> @@ -967,28 +968,42 @@ xfs_inode_ag_iterator_tag(
...
> -STATIC int
> +STATIC bool
>  xfs_reclaim_inode_grab(
>  	struct xfs_inode	*ip,
> -	int			flags)
> +	int			flags,
> +	xfs_lsn_t		*lsn)
>  {
>  	ASSERT(rcu_read_lock_held());
> +	*lsn = 0;
>  
>  	/* quick check for stale RCU freed inode */
>  	if (!ip->i_ino)
> -		return 1;
> +		return false;
>  
>  	/*
> -	 * If we are asked for non-blocking operation, do unlocked checks to
> -	 * see if the inode already is being flushed or in reclaim to avoid
> -	 * lock traffic.
> +	 * Do unlocked checks to see if the inode already is being flushed or in
> +	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> +	 * it's position in the AIL for the caller to push to.

"return the it's position ..." ?

>  	 */
> -	if ((flags & SYNC_TRYLOCK) &&
> -	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> -		return 1;
> +	if (!xfs_inode_clean(ip)) {
> +		*lsn = ip->i_itemp->ili_item.li_lsn;
> +		return false;
> +	}
> +
> +	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> +		return false;
>  
>  	/*
>  	 * The radix tree lock here protects a thread in xfs_iget from racing
...
> @@ -1050,92 +1065,64 @@ xfs_reclaim_inode_grab(
...
> -STATIC int
> +STATIC bool
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
> -	int			sync_mode)
> +	xfs_lsn_t		*lsn)
>  {
...
>  
>  	/*
> -	 * Never flush out dirty data during non-blocking reclaim, as it would
> -	 * just contend with AIL pushing trying to do the same job.
> +	 * If it is pinned, we don't have an LSN we can push the AIL to - just
> +	 * an LSN that we can push the CIL with. We don't want to block doing
> +	 * that, so we'll just skip over this one without triggering writeback
> +	 * for now.
>  	 */
> -	if (!(sync_mode & SYNC_WAIT))
> +	if (xfs_ipincount(ip))
>  		goto out_ifunlock;

Hmm, so this seems slightly inconsistent with the grab codepath in terms
of how we handle the lsn of a pinned inode. Here, we ignore the li_lsn
of a pinned inode even though it may very well be in the AIL. However in
the _grab() case, we consider li_lsn whenever the inode is !clean (even
though it may be pinned).

TBH I'm not really convinced it matters which approach we use in the
bigger picture, but it would be good to have consistent logic either
way. Just dropping the pin check from here is probably the most simple
thing since we no longer do I/O from reclaim.

>  
...
> @@ -1205,44 +1186,34 @@ xfs_reclaim_inode(
>   * corrupted, we still want to try to reclaim all the inodes. If we don't,
>   * then a shut down during filesystem unmount reclaim walk leak all the
>   * unreclaimed inodes.
> + *
> + * Return the number of inodes freed.
>   */
>  STATIC int
>  xfs_reclaim_inodes_ag(
>  	struct xfs_mount	*mp,
>  	int			flags,
> -	int			*nr_to_scan)
> +	int			nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
> -	int			last_error = 0;
>  	xfs_agnumber_t		ag;
> -	int			trylock = flags & SYNC_TRYLOCK;
> -	int			skipped;
> +	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
> +	long			freed = 0;
>  
...
>  		do {
>  			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
>  			int	i;
>  
> +			mutex_lock(&pag->pag_ici_reclaim_lock);
> +			first_index = pag->pag_ici_reclaim_cursor;
> +
>  			rcu_read_lock();
>  			nr_found = radix_tree_gang_lookup_tag(
>  					&pag->pag_ici_root,

We need to unlock ->pag_ici_reclaim_lock in the !nr_found case before we
break out of the loop.

> @@ -1262,9 +1233,13 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				struct xfs_inode *ip = batch[i];
>  
> -				if (done || xfs_reclaim_inode_grab(ip, flags))
> +				if (done ||
> +				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
>  					batch[i] = NULL;
>  
> +				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
> +					lowest_lsn = lsn;

FWIW, this is a little tricky in that xfs_lsn_t is signed and
NULLCOMMITLSN is -1. It works because [CYCLE|BLOCK]_LSN() cast to uint,
but it might be worth checking for lowest_lsn == NULLCOMMITLSN
explicitly as done in other places.

Brian

> +
>  				/*
>  				 * Update the index for the next lookup. Catch
>  				 * overflows into the next AG range which can
> @@ -1289,41 +1264,33 @@ xfs_reclaim_inodes_ag(
>  
>  			/* unlock now we've grabbed the inodes. */
>  			rcu_read_unlock();
> +			if (!done)
> +				pag->pag_ici_reclaim_cursor = first_index;
> +			else
> +				pag->pag_ici_reclaim_cursor = 0;
> +			mutex_unlock(&pag->pag_ici_reclaim_lock);
>  
>  			for (i = 0; i < nr_found; i++) {
>  				if (!batch[i])
>  					continue;
> -				error = xfs_reclaim_inode(batch[i], pag, flags);
> -				if (error && last_error != -EFSCORRUPTED)
> -					last_error = error;
> +				if (xfs_reclaim_inode(batch[i], pag, &lsn))
> +					freed++;
> +				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
> +					lowest_lsn = lsn;
>  			}
>  
> -			*nr_to_scan -= XFS_LOOKUP_BATCH;
> -
> +			nr_to_scan -= XFS_LOOKUP_BATCH;
>  			cond_resched();
>  
> -		} while (nr_found && !done && *nr_to_scan > 0);
> +		} while (nr_found && !done && nr_to_scan > 0);
>  
> -		if (trylock && !done)
> -			pag->pag_ici_reclaim_cursor = first_index;
> -		else
> -			pag->pag_ici_reclaim_cursor = 0;
> -		mutex_unlock(&pag->pag_ici_reclaim_lock);
>  		xfs_perag_put(pag);
>  	}
>  
> -	/*
> -	 * if we skipped any AG, and we still have scan count remaining, do
> -	 * another pass this time using blocking reclaim semantics (i.e
> -	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
> -	 * ensure that when we get more reclaimers than AGs we block rather
> -	 * than spin trying to execute reclaim.
> -	 */
> -	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
> -		trylock = 0;
> -		goto restart;
> -	}
> -	return last_error;
> +	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
> +		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
> +
> +	return freed;
>  }
>  
>  int
> @@ -1331,9 +1298,7 @@ xfs_reclaim_inodes(
>  	xfs_mount_t	*mp,
>  	int		mode)
>  {
> -	int		nr_to_scan = INT_MAX;
> -
> -	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
>  }
>  
>  /*
> @@ -1350,7 +1315,7 @@ xfs_reclaim_inodes_nr(
>  	struct xfs_mount	*mp,
>  	int			nr_to_scan)
>  {
> -	int			sync_mode = SYNC_TRYLOCK;
> +	int			sync_mode = 0;
>  
>  	/*
>  	 * For kswapd, we kick background inode writeback. For direct
> @@ -1362,7 +1327,7 @@ xfs_reclaim_inodes_nr(
>  	else
>  		sync_mode |= SYNC_WAIT;
>  
> -	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
> +	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
>  }
>  
>  /*
> -- 
> 2.23.0.rc1
> 

  reply index

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09  3:20 [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-10-09  3:20 ` [PATCH 01/26] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:08   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 02/26] xfs: Throttle commits on delayed background CIL push Dave Chinner
2019-10-11 12:38   ` Brian Foster
2019-10-09  3:21 ` [PATCH 03/26] xfs: don't allow log IO to be throttled Dave Chinner
2019-10-11  9:35   ` Christoph Hellwig
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:14   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 04/26] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-11 12:57     ` Christoph Hellwig
2019-10-11 23:14       ` Dave Chinner
2019-10-11 23:13     ` Dave Chinner
2019-10-12 12:05       ` Brian Foster
2019-10-13  3:14         ` Dave Chinner
2019-10-14 13:05           ` Brian Foster
2019-10-30 17:25   ` Darrick J. Wong
2019-10-30 21:43     ` Dave Chinner
2019-10-31  3:06       ` Darrick J. Wong
2019-10-31 20:50         ` Dave Chinner
2019-10-31 21:05           ` Darrick J. Wong
2019-10-31 21:22             ` Christoph Hellwig
2019-11-03 21:26             ` Dave Chinner
2019-11-04 23:08               ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 05/26] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:16   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 06/26] xfs: synchronous AIL pushing Dave Chinner
2019-10-11  9:42   ` Christoph Hellwig
2019-10-11 12:40   ` Brian Foster
2019-10-11 23:15     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 07/26] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-10-11  9:50   ` Christoph Hellwig
2019-10-11 12:40   ` Brian Foster
2019-10-09  3:21 ` [PATCH 08/26] mm: directed shrinker work deferral Dave Chinner
2019-10-14  8:46   ` Christoph Hellwig
2019-10-14 13:06     ` Brian Foster
2019-10-18  7:59     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 09/26] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
2019-10-09  3:21 ` [PATCH 10/26] mm: factor shrinker work calculations Dave Chinner
2019-10-09  3:21 ` [PATCH 11/26] shrinker: defer work only to kswapd Dave Chinner
2019-10-09  3:21 ` [PATCH 12/26] shrinker: clean up variable types and tracepoints Dave Chinner
2019-10-09  3:21 ` [PATCH 13/26] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-10-09  3:21 ` [PATCH 14/26] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-10-11 16:21   ` Matthew Wilcox
2019-10-11 23:20     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 15/26] mm: kswapd backoff for shrinkers Dave Chinner
2019-10-09  3:21 ` [PATCH 16/26] xfs: synchronous AIL pushing Dave Chinner
2019-10-11 10:18   ` Christoph Hellwig
2019-10-11 15:29     ` Brian Foster
2019-10-11 23:27       ` Dave Chinner
2019-10-12 12:08         ` Brian Foster
2019-10-09  3:21 ` [PATCH 17/26] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-10-11 15:29   ` Brian Foster
2019-10-09  3:21 ` [PATCH 18/26] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-10-11 10:29   ` Christoph Hellwig
2019-10-09  3:21 ` [PATCH 19/26] xfs: kill background reclaim work Dave Chinner
2019-10-11 10:31   ` Christoph Hellwig
2019-10-09  3:21 ` [PATCH 20/26] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-10-11 17:38   ` Brian Foster [this message]
2019-10-09  3:21 ` [PATCH 21/26] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-10-11 10:39   ` Christoph Hellwig
2019-10-14 13:07   ` Brian Foster
2019-10-09  3:21 ` [PATCH 22/26] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-10-11 10:42   ` Christoph Hellwig
2019-10-14 13:07   ` Brian Foster
2019-10-09  3:21 ` [PATCH 23/26] xfs: reclaim inodes from the LRU Dave Chinner
2019-10-11 10:56   ` Christoph Hellwig
2019-10-30 23:25     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 24/26] xfs: remove unusued old inode reclaim code Dave Chinner
2019-10-09  3:21 ` [PATCH 25/26] xfs: rework unreferenced inode lookups Dave Chinner
2019-10-11 12:55   ` Christoph Hellwig
2019-10-11 13:39     ` Peter Zijlstra
2019-10-11 23:38     ` Dave Chinner
2019-10-14 13:07   ` Brian Foster
2019-10-17  1:24     ` Dave Chinner
2019-10-17  7:57       ` Brian Foster
2019-10-18 20:29         ` Dave Chinner
2019-10-09  3:21 ` [PATCH 26/26] xfs: use xfs_ail_push_all_sync in xfs_reclaim_inodes Dave Chinner
2019-10-11  9:55   ` Christoph Hellwig
2019-10-09  7:06 ` [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-10-11 19:03 ` Josef Bacik
2019-10-11 23:48   ` Dave Chinner
2019-10-12  0:19     ` Josef Bacik
2019-10-12  0:48       ` Dave Chinner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191011173835.GA64237@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git