Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Subject: [PATCH 11/26] shrinker: defer work only to kswapd
Date: Wed,  9 Oct 2019 14:21:09 +1100
Message-ID: <20191009032124.10541-12-david@fromorbit.com> (raw)
In-Reply-To: <20191009032124.10541-1-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

Right now deferred work is picked up by whatever GFP_KERNEL context
reclaimer that wins the race to empty the node's deferred work
counter. However, if there are lots of direct reclaimers, that
work might be continually picked up by contexts taht can't do any
work and so the opportunities to do the work are missed by contexts
that could do them.

A further problem with the current code is that the deferred work
can be picked up by a random direct reclaimer, resulting in that
specific process having to do all the deferred reclaim work and
hence can take extremely long latencies if the reclaim work blocks
regularly. This is not good for direct reclaim fairness or for
minimising long tail latency events.

To avoid these problems, simply limit deferred work to kswapd
contexts. We know kswapd is a context that can always do reclaim
work, and hence deferring work to kswapd allows the deferred work to
be done in the background and not adversely affect any specific
process context doing direct reclaim.

The advantage of this is that amount of work to be done in direct
reclaim is now bound and predictable - it is entirely based on
the cache's freeable objects and the reclaim priority. hence all
direct reclaimers running at the same time should be doing
relatively equal amounts of work, thereby reducing the incidence of
long tail latencies due to uneven reclaim workloads.

Note that we use signed integers for everything except the freed
count as the returns from the shrinker callouts cannot be guaranteed
untainted. Indeed, the shrinkers can return scan counts larger that
were fed in, so we need scan counts to underflow in a detectable
manner to terminate loops. This is necessary to avoid a misbehaving
shrinker from triggering endless scanning loops.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h |  2 +-
 mm/vmscan.c              | 98 +++++++++++++++++++++-------------------
 2 files changed, 52 insertions(+), 48 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 3405c39ab92c..30c10f42109f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -81,7 +81,7 @@ struct shrinker {
 	int id;
 #endif
 	/* objs pending delete, per node */
-	atomic_long_t *nr_deferred;
+	atomic64_t *nr_deferred;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index de6b09ad97ed..d05f64bd26ff 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -516,16 +516,16 @@ static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker, int priority)
 {
-	unsigned long freed = 0;
-	long total_scan;
+	uint64_t freed = 0;
 	int64_t freeable_objects = 0;
 	int64_t scan_count;
-	long nr;
-	long new_nr;
+	int64_t scanned_objects = 0;
+	int64_t next_deferred = 0;
+	int64_t deferred_count = 0;
+	int64_t new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
-	long scanned = 0, next_deferred;
 
 	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 		nid = 0;
@@ -536,47 +536,51 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		return scan_count;
 
 	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
+	 * If kswapd, we take all the deferred work and do it here. We don't let
+	 * direct reclaim do this, because then it means some poor sod is going
+	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
+	 * amount of reclaim work from concurrent kswapd operations. Hence we do
+	 * the work in the wrong place, at the wrong time, and it's largely
+	 * unpredictable.
+	 *
+	 * By doing the deferred work only in kswapd, we can schedule the work
+	 * according the the reclaim priority - low priority reclaim will do
+	 * less deferred work, hence we'll do more of the deferred work the more
+	 * desperate we become for free memory. This avoids the need for needing
+	 * to specifically avoid deferred work windup as low amount os memory
+	 * pressure won't excessive trim caches anymore.
 	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	if (current_is_kswapd()) {
+		int64_t	deferred_scan;
 
-	total_scan = nr + scan_count;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = scan_count;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
+		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
 
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (scan_count < freeable_objects / 4)
-		total_scan = min_t(long, total_scan, freeable_objects / 2);
+		/* we want to scan 5-10% of the deferred work here at minimum */
+		deferred_scan = deferred_count;
+		if (priority)
+			do_div(deferred_scan, priority);
+		scan_count += deferred_scan;
+
+		/*
+		 * If there is more deferred work than the number of freeable
+		 * items in the cache, limit the amount of work we will carry
+		 * over to the next kswapd run on this cache. This prevents
+		 * deferred work windup.
+		 */
+		deferred_count = min(deferred_count, freeable_objects * 2);
+
+	}
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable_objects * 2)
-		total_scan = freeable_objects * 2;
+	scan_count = min(scan_count, freeable_objects * 2);
 
-	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
 				   freeable_objects, scan_count,
-				   total_scan, priority);
+				   scan_count, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -600,10 +604,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * scanning at high prio and therefore should try to reclaim as much as
 	 * possible.
 	 */
-	while (total_scan >= batch_size ||
-	       total_scan >= freeable_objects) {
+	while (scan_count >= batch_size ||
+	       scan_count >= freeable_objects) {
 		unsigned long ret;
-		unsigned long nr_to_scan = min(batch_size, total_scan);
+		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
 
 		shrinkctl->nr_to_scan = nr_to_scan;
 		shrinkctl->nr_scanned = nr_to_scan;
@@ -613,29 +617,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		freed += ret;
 
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
-		total_scan -= shrinkctl->nr_scanned;
-		scanned += shrinkctl->nr_scanned;
+		scan_count -= shrinkctl->nr_scanned;
+		scanned_objects += shrinkctl->nr_scanned;
 
 		cond_resched();
 	}
-
 done:
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
+	if (deferred_count)
+		next_deferred = deferred_count - scanned_objects;
 	else
-		next_deferred = 0;
+		next_deferred = scan_count;
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
 	 * scan, there is no need to do an update.
 	 */
 	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
+		new_nr = atomic64_add_return(next_deferred,
 						&shrinker->nr_deferred[nid]);
 	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
+					scan_count);
 	return freed;
 }
 
-- 
2.23.0.rc1



  parent reply index

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09  3:20 [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-10-09  3:20 ` [PATCH 01/26] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:08   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 02/26] xfs: Throttle commits on delayed background CIL push Dave Chinner
2019-10-11 12:38   ` Brian Foster
2019-10-09  3:21 ` [PATCH 03/26] xfs: don't allow log IO to be throttled Dave Chinner
2019-10-11  9:35   ` Christoph Hellwig
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:14   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 04/26] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-11 12:57     ` Christoph Hellwig
2019-10-11 23:14       ` Dave Chinner
2019-10-11 23:13     ` Dave Chinner
2019-10-12 12:05       ` Brian Foster
2019-10-13  3:14         ` Dave Chinner
2019-10-14 13:05           ` Brian Foster
2019-10-30 17:25   ` Darrick J. Wong
2019-10-30 21:43     ` Dave Chinner
2019-10-31  3:06       ` Darrick J. Wong
2019-10-31 20:50         ` Dave Chinner
2019-10-31 21:05           ` Darrick J. Wong
2019-10-31 21:22             ` Christoph Hellwig
2019-11-03 21:26             ` Dave Chinner
2019-11-04 23:08               ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 05/26] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-10-11 12:39   ` Brian Foster
2019-10-30 17:16   ` Darrick J. Wong
2019-10-09  3:21 ` [PATCH 06/26] xfs: synchronous AIL pushing Dave Chinner
2019-10-11  9:42   ` Christoph Hellwig
2019-10-11 12:40   ` Brian Foster
2019-10-11 23:15     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 07/26] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-10-11  9:50   ` Christoph Hellwig
2019-10-11 12:40   ` Brian Foster
2019-10-09  3:21 ` [PATCH 08/26] mm: directed shrinker work deferral Dave Chinner
2019-10-14  8:46   ` Christoph Hellwig
2019-10-14 13:06     ` Brian Foster
2019-10-18  7:59     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 09/26] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
2019-10-09  3:21 ` [PATCH 10/26] mm: factor shrinker work calculations Dave Chinner
2019-10-09  3:21 ` Dave Chinner [this message]
2019-10-09  3:21 ` [PATCH 12/26] shrinker: clean up variable types and tracepoints Dave Chinner
2019-10-09  3:21 ` [PATCH 13/26] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-10-09  3:21 ` [PATCH 14/26] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-10-11 16:21   ` Matthew Wilcox
2019-10-11 23:20     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 15/26] mm: kswapd backoff for shrinkers Dave Chinner
2019-10-09  3:21 ` [PATCH 16/26] xfs: synchronous AIL pushing Dave Chinner
2019-10-11 10:18   ` Christoph Hellwig
2019-10-11 15:29     ` Brian Foster
2019-10-11 23:27       ` Dave Chinner
2019-10-12 12:08         ` Brian Foster
2019-10-09  3:21 ` [PATCH 17/26] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-10-11 15:29   ` Brian Foster
2019-10-09  3:21 ` [PATCH 18/26] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-10-11 10:29   ` Christoph Hellwig
2019-10-09  3:21 ` [PATCH 19/26] xfs: kill background reclaim work Dave Chinner
2019-10-11 10:31   ` Christoph Hellwig
2019-10-09  3:21 ` [PATCH 20/26] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-10-11 17:38   ` Brian Foster
2019-10-09  3:21 ` [PATCH 21/26] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-10-11 10:39   ` Christoph Hellwig
2019-10-14 13:07   ` Brian Foster
2019-10-09  3:21 ` [PATCH 22/26] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-10-11 10:42   ` Christoph Hellwig
2019-10-14 13:07   ` Brian Foster
2019-10-09  3:21 ` [PATCH 23/26] xfs: reclaim inodes from the LRU Dave Chinner
2019-10-11 10:56   ` Christoph Hellwig
2019-10-30 23:25     ` Dave Chinner
2019-10-09  3:21 ` [PATCH 24/26] xfs: remove unusued old inode reclaim code Dave Chinner
2019-10-09  3:21 ` [PATCH 25/26] xfs: rework unreferenced inode lookups Dave Chinner
2019-10-11 12:55   ` Christoph Hellwig
2019-10-11 13:39     ` Peter Zijlstra
2019-10-11 23:38     ` Dave Chinner
2019-10-14 13:07   ` Brian Foster
2019-10-17  1:24     ` Dave Chinner
2019-10-17  7:57       ` Brian Foster
2019-10-18 20:29         ` Dave Chinner
2019-10-09  3:21 ` [PATCH 26/26] xfs: use xfs_ail_push_all_sync in xfs_reclaim_inodes Dave Chinner
2019-10-11  9:55   ` Christoph Hellwig
2019-10-09  7:06 ` [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-10-11 19:03 ` Josef Bacik
2019-10-11 23:48   ` Dave Chinner
2019-10-12  0:19     ` Josef Bacik
2019-10-12  0:48       ` Dave Chinner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191009032124.10541-12-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git