linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim
@ 2019-08-01  2:17 Dave Chinner
  2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
                   ` (24 more replies)
  0 siblings, 25 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

Hi Folks,

We've had a problem with inode reclaim for a long time - XFS is
capable of caching tens of millions of inodes with ease and dirtying
hundreds of thousands of those cached inodes every second. It is
also capable of reclaiming more than half a million clean inodes per
second per reclaim thread.

The result of this is that when there is a significant change in
sustained memory pressure on a system ith a large inode cache,
memory reclaim rapidly frees all the clean XFS inodes, but cannot
make progress on reclaiming dirty inodes because they are rate
limited by IO.

However, the shrinker infrastructure in the kernel has no way to
feed back rate limiting to the core memory reclaim algorithms.  In
fact there are no feedback mechanisms at all, and so when reclaim
has freed all the clean inodes and starts hitting dirty inodes, the
filesystem shrinker has no way of telling reclaim that the inode
reclaim rate has dropped from 500k/s to 500/s.

The result is that reclaim continues to try to free memory but
because it makes no progress freeing inodes it puts much more
pressure on the page LRUs and frees more pages. When it runs out of
pages, it starts swapping, and when it runs out of swap or can't get
a page for swap-in it starts going on an OOM kill rampage.  This
does nothing to "fix" the shortage of memory caused by the slowness
of dirty inode reclaim - if memory demand continues we just keep
hitting the OOM killer until either something critical is killed or
memory demand eases.

For a long time, XFS has avoided this insane spiral of shouty
OOM-killer rage death by cleaning inodes directly in the shrinker.
This has the effect of throttling memory reclaim to the rate at
which dirty inodes can be cleaned. Hence when we get into the state
when memory reclaim is dependent on inode reclaim making progress
we don't ever allow LRU reclaim to run so far ahead of inode reclaim
that it winds up reclaim priority and runs out of LRU pages to
reclaim and/or swap.

This has a downside, though. When there is a large amount of clean
page cache and a small amount of inode cache that is dirty (e.g.
lots of file data pressure and/or application memory demand) the
inode reclaim shrinkers can run out of clean inodes to reclaim and
start blocking on inode writeback. This can result in long reclaim
latencies even though there is lots of memory that can be
immediately reclaimed from the page cache. This is ... suboptimal.

There are other issues. We have to block kswapd, too, because it
will continue running until watermarks are satisfied, and that is
frequently the vector for shouty swappy death because it doesn't
back off before priority windup from lack of progress occurs.
Blocking kswapd then affects direct reclaim function, which often
backs off expecting kswapd to make progress in the mean time. But if
kswapd is not making progress, direct reclaim ends up in priority
windup from lack of progress, too. This is especially prevalent in
workloads that have a high percentage of GFP_NOFS allocations (e.g.
filesystem modification workloads).

The shrinkers have another problem w/ GFP_NOFS reclaim: the work
that is deferred because the shrinker cannot make progress gets
lumped on the first reclaim context that can do that work. That
means a direct reclaimer might get lumped with scanning millions of
objects during low priority scanning when it should only be scanning
a couple of thousand objects. This can result in highly
unpredictable and extremely long direct reclaim delays, especially
if it bumps into dirty inodes.

This is most definitely sub-optimal, but it's better than random
and/or premature OOM killer invocation under trivial workloads and
lots of reclaimable memory still being available. Hence we've kinda
been stuck with the sucky behaviour we have now.

This patch set aims to fix all these problems (at last!). The memory
reclaim and shrinker changes involve:

- a substantial rework of how the shrinker defers work, moving all
  the deferred work to kswapd to remove all the unpredictability
  from direct reclaim.  Direct reclaim will only do the work the
  direct reclaim context determines is necesary.

- deferred work is capped to prevent excessive scanning, and the
  amount of deferred work kswapd will do in each scan is increased
  linearly w.r.t. increasing reclaim priority. Hence when we are
  desparate for memory, kswapd will be running all the deferred work
  as quickly as possible.

- The amount of deferred work and the amount of scanning that is
  done by the shrinkers is now tracked in the struct reclaim_state.
  This allows shrink_node() to see how much work is being done in
  comparison to both the LRU scanning and how much shrinker work is
  being deferred to kswapd. This allows direct reclaim to back off
  when too much work is being deferred and hence allow kswapd to
  make progress on the deferred work while it waits.

- A "need backoff" flag has been added to the struct reclaim_state.
  This allows individual shrinkers to indicate to kswapd that they
  need some time to finish work before being scanned again. This is
  basically for the same situation where kswapd backs off from LRU
  scanning.

  i.e. the LRU scanning has run into the tail of the LRU and is only
  finding dirty objects that require IO to complete before reclaim
  can make further progress. This is exactly the same problem we
  have with inode reclaim in XFS, and it is this mechanism that
  enables us to move to an IO-less inode reclaim architecture.

The XFS changes are all over the place, and address both the reclaim
blocking problems and all the other related issues I found while
working on this patchest. These involve:

- fixing IO priority inversion problems between metadata
  writeback (inodes!) and log IO caused by the block layer write
  throttling (more on this later).

- some slab caches weren't marked as reclaimable, so were
  incorrectly accounted. Also account for the pages xfs_buf reclaim
  releases.

- reduced the delayed logging dirty item aggregation size (the CIL).
  This defines the minimum amount of memory XFS can operate in when
  there is heavy modifications in progress.

- reduced the memory footprint of the CIL when repeated
  modifications to objects occur.

- Added a mechanism to push the AIL to a specific LSN (metadata
  modification epoch) and wait for it. This forms the basis for
  direct inode reclaim deferring IO and waiting for some progress
  without issuing IO iteslf.

- reworked inode reclaim to use a list_lru to track inodes in
  reclaim rather than a radix tree tag in the inode cache. We
  iterated the radix tree for reclaim because it resulted in optimal
  IO patterns from multiple concurrent reclaimers, but we dont' have
  to care about that any more because all IO comes from the AIL now.

  This gives us try LRU reclaim, and it allows us to effectively
  determine when we've run out of clean inodes to easily reclaim and
  provide that feedback to the higher levels via the "need backoff"
  flag.

- direct reclaim is non-blocking while scanning, but at the end of a
  scan it will still block waiting for IO, but only for /some/
  progress to be made and not specific individual IOs.

- kswapd based reclaim is fully non-blocking.

The result is that there is now enough feedback from the shrinkers
into the main memory reclaim loop for it to back off in the
situations where back-off is required to avoid OOM killer
invocation, despite XFS now largely doing non-blocking reclaim.

Testing involves at 16p/16GB machine running a fsmark workload that
creates sustained heavy dirty inode cache pressure, then
progressively locking 2GB of memory at time to squeeze the workload
into less and less memory. A vanilla kernel runs well up to 12GB
squeezed, but at 14GB squeezed performance goes to hell. With just
the hacky "don't block kswapd by removing SYNC_WAIT" patch that
people seem to like, OOM kills start when squeezed to 12GB. With
that extended to direct reclaim, OOM kills start with squeezed to
just 8GB. With the full patchset, it runs similar to a vanilla
kernel up to 12GB squeezed, and vastly out-performs the vanilla
kernel with 14GB squeezed. Performance only drops ~20% with a 14GB
squeeze, whereas the vanilla kernel sees up to a 90% drop in
performance.

I also ran testing with simoop, a simulated workload that Chris
Mason put together to demonstrate the long tail latency and
allocation stall problems the blocking in inode reclaim was causing
for workloads at FB.  The vanilla kernel averaged ~5 stalls/sec over
a test period of 10 hours, this patch series resulted in:

alloc stall rate = 0.00/sec (avg: 0.04) (p50: 0.04) (p95: 0.16) (p99: 0.32)

stalls almost going away entirely over a 10 hour period.

IOWs, the signs are there that this is a workable solution to the
problems caused by blocking inode reclaim without re-introducing the
the Death-by-OOM-killer issues the blocking avoids.

Please note that I haven't full gone non-blocking on direct reclaim
for a couple of reasons:

1. congestion_wait() and wait_iff_congested() are completely broken.
The blkmq change-over ripped out all the block layer congestion
reporting in 5.0 and didn't replace it with anything, so unless you
are operating on an NFS client, Ceph, FUSE or a DVD, congestion
checks and backoff aren't actually doing what they are supposed to.
i.e. wait_iff_congested() never blocks, and congestion_wait() always
sleeps for it's full timeout.

IOWs, the whole bdi-based IO congestion feedback mechanism no longer
functions as intended, and so I'm betting a lot of the memory
reclaim heuristics no longer function as they were intended to...

2. The block layer write throttle is full of priority inversions.
Apart from the log IO one I fixed in this series, I noticed that
swap in/out has a major problem. I lost count of the number of OOM
kills that occurred from the swap in path when there were several
processes blocked in wbt_wait() in the block layer in the swap out
path. i.e. if swap out had been making progress, swap in would not
have oom killed. Hence I found it still necessary to throttle direct
reclaim back in the shrinker as there wasn't a realiable way to get
the core reclaim code to throttle effectively.

FWIW, from the swap in/out perspective, this whole inversion problem
is made worse by #1: the congestion_wait/wait_iff_congested
interfaces being broken. Direct reclaim uses wait_iff_congested() to
back off if kswapd has indicated that the node is congested
(PGDAT_CONGESTED) and reclaim is struggling to make progress.
However, this backoff never actually happens now and hence direct
reclaim barrels into the swap code as hard as it can and blocks in
wbt_wait() waiting behind other swap IO instead of backing off and
waiting for some IO to complete and then retrying it's allocation....

3. the memory reclaim code is so full of special case heuristics I'd
be surprised if anyone knows how it actually functions. That's why
I'm not surprised anyone noticed that the congestion backoff code
doesn't actually work properly anymore. And it's impossible to
tell where it is best to place back-off functionality because there
are so many different places that do special case back-offs and
retry loops and they all do it differently.

Hence my attempts to get shrinker driven back-offs to work
effectively have largely been hitting different parts of the code
with a heavy sledgehammer and seeing if there was any observable
effect on the inode cache reclaim patterns and OOM kill resistance.
There's a limit to how effective such brute force discovery can be,
and mixed with the lack of functioning congestion back-off I never
really found the best place to throttle direct reclaim effectively.

So maybe if we fix the bdi congestion interfaces so they work again
we can get rid of the waiting in direct reclaim, but right now I
don't see any other choice. One battle at a time....

Comments, thoughts welcome.

-Dave.


Diffstat;

 drivers/staging/android/ashmem.c |   8 +-
 fs/gfs2/glock.c                  |   5 +-
 fs/gfs2/quota.c                  |   6 +-
 fs/inode.c                       |   2 +-
 fs/nfs/dir.c                     |   6 +-
 fs/super.c                       |   6 +-
 fs/xfs/xfs_buf.c                 |  15 +-
 fs/xfs/xfs_icache.c              | 563 ++++++++-------------------------------
 fs/xfs/xfs_icache.h              |  20 +-
 fs/xfs/xfs_inode.h               |   8 +
 fs/xfs/xfs_inode_item.c          |  28 +-
 fs/xfs/xfs_log.c                 |  19 +-
 fs/xfs/xfs_log_cil.c             |   7 +-
 fs/xfs/xfs_log_priv.h            |   3 +-
 fs/xfs/xfs_mount.c               |  10 +-
 fs/xfs/xfs_mount.h               |   6 +-
 fs/xfs/xfs_qm.c                  |  11 +-
 fs/xfs/xfs_super.c               |  96 +++++--
 fs/xfs/xfs_trans_ail.c           | 104 ++++++--
 fs/xfs/xfs_trans_priv.h          |   8 +-
 include/linux/shrinker.h         |   7 +
 include/linux/swap.h             |   8 +-
 include/trace/events/vmscan.h    |  69 +++--
 mm/slab.c                        |   2 +-
 mm/slob.c                        |   2 +-
 mm/slub.c                        |   2 +-
 mm/vmscan.c                      | 203 +++++++++-----
 net/sunrpc/auth.c                |   5 +-
 28 files changed, 549 insertions(+), 680 deletions(-)



^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-02 15:27   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Introduce a mechanism for ->count_objects() to indicate to the
shrinker infrastructure that the reclaim context will not allow
scanning work to be done and so the work it decides is necessary
needs to be deferred.

This simplifies the code by separating out the accounting of
deferred work from the actual doing of the work, and allows better
decisions to be made by the shrinekr control logic on what action it
can take.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h | 7 +++++++
 mm/vmscan.c              | 8 ++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..af78c475fc32 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -31,6 +31,13 @@ struct shrink_control {
 
 	/* current memcg being shrunk (for memcg aware shrinkers) */
 	struct mem_cgroup *memcg;
+
+	/*
+	 * set by ->count_objects if reclaim context prevents reclaim from
+	 * occurring. This allows the shrinker to immediately defer all the
+	 * work and not even attempt to scan the cache.
+	 */
+	bool will_defer;
 };
 
 #define SHRINK_STOP (~0UL)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44df66a98f2a..ae3035fe94bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
 
+	/*
+	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+	 * defer the work to a context that can scan the cache.
+	 */
+	if (shrinkctl->will_defer)
+		goto done;
+
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
 	 * pass to avoid too frequent shrinker calls, but if the slab has less
@@ -575,6 +582,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
+done:
 	if (next_deferred >= scanned)
 		next_deferred -= scanned;
 	else
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
  2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-02 15:27   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

For shrinkers that currently avoid scanning when called under
GFP_NOFS contexts, conver them to use the new ->will_defer flag
rather than checking and returning errors during scans.

This makes it very clear that these shrinkers are not doing any work
because of the context limitations, not because there is no work
that can be done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/staging/android/ashmem.c |  8 ++++----
 fs/gfs2/glock.c                  |  5 +++--
 fs/gfs2/quota.c                  |  6 +++---
 fs/nfs/dir.c                     |  6 +++---
 fs/super.c                       |  6 +++---
 fs/xfs/xfs_buf.c                 |  4 ++++
 fs/xfs/xfs_qm.c                  | 11 ++++++++---
 net/sunrpc/auth.c                |  5 ++---
 8 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 74d497d39c5a..fd9027dbd28c 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -438,10 +438,6 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	unsigned long freed = 0;
 
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!mutex_trylock(&ashmem_mutex))
 		return -1;
 
@@ -478,6 +474,10 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 static unsigned long
 ashmem_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
+	/* We might recurse into filesystem code, so bail out if necessary */
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->will_defer = true;
+
 	/*
 	 * note that lru_count is count of pages on the lru, not a count of
 	 * objects on the list. This means the scan function needs to return the
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e23fb8b7b020..08c95172d0e5 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1517,14 +1517,15 @@ static long gfs2_scan_glock_lru(int nr)
 static unsigned long gfs2_glock_shrink_scan(struct shrinker *shrink,
 					    struct shrink_control *sc)
 {
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
 	return gfs2_scan_glock_lru(sc->nr_to_scan);
 }
 
 static unsigned long gfs2_glock_shrink_count(struct shrinker *shrink,
 					     struct shrink_control *sc)
 {
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->will_defer = true;
+
 	return vfs_pressure_ratio(atomic_read(&lru_count));
 }
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 69c4b77f127b..d35beda906e8 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -166,9 +166,6 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	LIST_HEAD(dispose);
 	unsigned long freed;
 
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
 				     gfs2_qd_isolate, &dispose);
 
@@ -180,6 +177,9 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->will_defer = true;
+
 	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8d501093660f..73735ab1d623 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2202,10 +2202,7 @@ unsigned long
 nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	int nr_to_scan = sc->nr_to_scan;
-	gfp_t gfp_mask = sc->gfp_mask;
 
-	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return SHRINK_STOP;
 	return nfs_do_access_cache_scan(nr_to_scan);
 }
 
@@ -2213,6 +2210,9 @@ nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 unsigned long
 nfs_access_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 {
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		sc->will_defer = true;
+
 	return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
 }
 
diff --git a/fs/super.c b/fs/super.c
index 113c58f19425..66dd2af6cfde 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -73,9 +73,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
 	 * to recurse into the FS that called us in clear_inode() and friends..
 	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!trylock_super(sb))
 		return SHRINK_STOP;
 
@@ -140,6 +137,9 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 	smp_rmb();
 
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->will_defer = true;
+
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index ca0849043f54..6e0f76532535 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1680,6 +1680,10 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
+
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->will_defer = true;
+
 	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 5e7a37f0cf84..13c842e8f13b 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -502,9 +502,6 @@ xfs_qm_shrink_scan(
 	unsigned long		freed;
 	int			error;
 
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
-		return 0;
-
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
@@ -534,6 +531,14 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
+	/*
+	 * __GFP_DIRECT_RECLAIM is used here to avoid blocking kswapd
+	 */
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) !=
+					(__GFP_FS|__GFP_DIRECT_RECLAIM)) {
+		sc->will_defer = true;
+	}
+
 	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index cdb05b48de44..6babcbac4a00 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -527,9 +527,6 @@ static unsigned long
 rpcauth_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 
 {
-	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return SHRINK_STOP;
-
 	/* nothing left, don't come back */
 	if (list_empty(&cred_unused))
 		return SHRINK_STOP;
@@ -541,6 +538,8 @@ static unsigned long
 rpcauth_cache_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 
 {
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		sc->will_defer = true;
 	return number_cred_unused * sysctl_vfs_cache_pressure / 100;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 03/24] mm: factor shrinker work calculations
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
  2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
  2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-02 15:08   ` Nikolay Borisov
  2019-08-02 15:31   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Start to clean up the shrinker code by factoring out the calculation
that determines how much work to do. This separates the calculation
from clamping and other adjustments that are done before the
shrinker work is run.

Also convert the calculation for the amount of work to be done to
use 64 bit logic so we don't have to keep jumping through hoops to
keep calculations within 32 bits on 32 bit systems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 74 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 47 insertions(+), 27 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ae3035fe94bc..b7472953b0e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -464,13 +464,45 @@ EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
 
+/*
+ * Calculate the number of new objects to scan this time around. Return
+ * the work to be done. If there are freeable objects, return that number in
+ * @freeable_objects.
+ */
+static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
+			    struct shrinker *shrinker, int priority,
+			    int64_t *freeable_objects)
+{
+	uint64_t delta;
+	uint64_t freeable;
+
+	freeable = shrinker->count_objects(shrinker, shrinkctl);
+	if (freeable == 0 || freeable == SHRINK_EMPTY)
+		return freeable;
+
+	if (shrinker->seeks) {
+		delta = freeable >> (priority - 2);
+		do_div(delta, shrinker->seeks);
+	} else {
+		/*
+		 * These objects don't require any IO to create. Trim
+		 * them aggressively under memory pressure to keep
+		 * them from causing refetches in the IO caches.
+		 */
+		delta = freeable / 2;
+	}
+
+	*freeable_objects = freeable;
+	return delta > 0 ? delta : 0;
+}
+
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker, int priority)
 {
 	unsigned long freed = 0;
-	unsigned long long delta;
 	long total_scan;
-	long freeable;
+	int64_t freeable_objects = 0;
+	int64_t scan_count;
 	long nr;
 	long new_nr;
 	int nid = shrinkctl->nid;
@@ -481,9 +513,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 		nid = 0;
 
-	freeable = shrinker->count_objects(shrinker, shrinkctl);
-	if (freeable == 0 || freeable == SHRINK_EMPTY)
-		return freeable;
+	scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
+					&freeable_objects);
+	if (scan_count == 0 || scan_count == SHRINK_EMPTY)
+		return scan_count;
 
 	/*
 	 * copy the current shrinker scan count into a local variable
@@ -492,25 +525,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 */
 	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
 
-	total_scan = nr;
-	if (shrinker->seeks) {
-		delta = freeable >> priority;
-		delta *= 4;
-		do_div(delta, shrinker->seeks);
-	} else {
-		/*
-		 * These objects don't require any IO to create. Trim
-		 * them aggressively under memory pressure to keep
-		 * them from causing refetches in the IO caches.
-		 */
-		delta = freeable / 2;
-	}
-
-	total_scan += delta;
+	total_scan = nr + scan_count;
 	if (total_scan < 0) {
 		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
 		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
+		total_scan = scan_count;
 		next_deferred = nr;
 	} else
 		next_deferred = total_scan;
@@ -527,19 +546,20 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * Hence only allow the shrinker to scan the entire cache when
 	 * a large delta change is calculated directly.
 	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
+	if (scan_count < freeable_objects / 4)
+		total_scan = min_t(long, total_scan, freeable_objects / 2);
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	if (total_scan > freeable_objects * 2)
+		total_scan = freeable_objects * 2;
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
-				   freeable, delta, total_scan, priority);
+				   freeable_objects, scan_count,
+				   total_scan, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -564,7 +584,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * possible.
 	 */
 	while (total_scan >= batch_size ||
-	       total_scan >= freeable) {
+	       total_scan >= freeable_objects) {
 		unsigned long ret;
 		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 04/24] shrinker: defer work only to kswapd
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (2 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-02 15:34   ` Brian Foster
                     ` (3 more replies)
  2019-08-01  2:17 ` [PATCH 05/24] shrinker: clean up variable types and tracepoints Dave Chinner
                   ` (20 subsequent siblings)
  24 siblings, 4 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Right now deferred work is picked up by whatever GFP_KERNEL context
reclaimer that wins the race to empty the node's deferred work
counter. However, if there are lots of direct reclaimers, that
work might be continually picked up by contexts taht can't do any
work and so the opportunities to do the work are missed by contexts
that could do them.

A further problem with the current code is that the deferred work
can be picked up by a random direct reclaimer, resulting in that
specific process having to do all the deferred reclaim work and
hence can take extremely long latencies if the reclaim work blocks
regularly. This is not good for direct reclaim fairness or for
minimising long tail latency events.

To avoid these problems, simply limit deferred work to kswapd
contexts. We know kswapd is a context that can always do reclaim
work, and hence deferring work to kswapd allows the deferred work to
be done in the background and not adversely affect any specific
process context doing direct reclaim.

The advantage of this is that amount of work to be done in direct
reclaim is now bound and predictable - it is entirely based on
the cache's freeable objects and the reclaim priority. hence all
direct reclaimers running at the same time should be doing
relatively equal amounts of work, thereby reducing the incidence of
long tail latencies due to uneven reclaim workloads.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 93 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 50 insertions(+), 43 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7472953b0e6..c583b4efb9bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -500,15 +500,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker, int priority)
 {
 	unsigned long freed = 0;
-	long total_scan;
 	int64_t freeable_objects = 0;
 	int64_t scan_count;
-	long nr;
+	int64_t scanned_objects = 0;
+	int64_t next_deferred = 0;
+	int64_t deferred_count = 0;
 	long new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
-	long scanned = 0, next_deferred;
 
 	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 		nid = 0;
@@ -519,47 +519,53 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		return scan_count;
 
 	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
+	 * If kswapd, we take all the deferred work and do it here. We don't let
+	 * direct reclaim do this, because then it means some poor sod is going
+	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
+	 * amount of reclaim work from concurrent kswapd operations. Hence we do
+	 * the work in the wrong place, at the wrong time, and it's largely
+	 * unpredictable.
+	 *
+	 * By doing the deferred work only in kswapd, we can schedule the work
+	 * according the the reclaim priority - low priority reclaim will do
+	 * less deferred work, hence we'll do more of the deferred work the more
+	 * desperate we become for free memory. This avoids the need for needing
+	 * to specifically avoid deferred work windup as low amount os memory
+	 * pressure won't excessive trim caches anymore.
 	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	if (current_is_kswapd()) {
+		int64_t	deferred_scan;
 
-	total_scan = nr + scan_count;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = scan_count;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
+		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
 
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (scan_count < freeable_objects / 4)
-		total_scan = min_t(long, total_scan, freeable_objects / 2);
+		/* we want to scan 5-10% of the deferred work here at minimum */
+		deferred_scan = deferred_count;
+		if (priority)
+			do_div(deferred_scan, priority);
+		scan_count += deferred_scan;
+
+		/*
+		 * If there is more deferred work than the number of freeable
+		 * items in the cache, limit the amount of work we will carry
+		 * over to the next kswapd run on this cache. This prevents
+		 * deferred work windup.
+		 */
+		if (deferred_count > freeable_objects * 2)
+			deferred_count = freeable_objects * 2;
+
+	}
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable_objects * 2)
-		total_scan = freeable_objects * 2;
+	if (scan_count > freeable_objects * 2)
+		scan_count = freeable_objects * 2;
 
-	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
 				   freeable_objects, scan_count,
-				   total_scan, priority);
+				   scan_count, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -583,10 +589,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * scanning at high prio and therefore should try to reclaim as much as
 	 * possible.
 	 */
-	while (total_scan >= batch_size ||
-	       total_scan >= freeable_objects) {
+	while (scan_count >= batch_size ||
+	       scan_count >= freeable_objects) {
 		unsigned long ret;
-		unsigned long nr_to_scan = min(batch_size, total_scan);
+		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
 
 		shrinkctl->nr_to_scan = nr_to_scan;
 		shrinkctl->nr_scanned = nr_to_scan;
@@ -596,17 +602,17 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		freed += ret;
 
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
-		total_scan -= shrinkctl->nr_scanned;
-		scanned += shrinkctl->nr_scanned;
+		scan_count -= shrinkctl->nr_scanned;
+		scanned_objects += shrinkctl->nr_scanned;
 
 		cond_resched();
 	}
 
 done:
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
-	else
-		next_deferred = 0;
+	if (deferred_count)
+		next_deferred = deferred_count - scanned_objects;
+	else if (scan_count > 0)
+		next_deferred = scan_count;
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
@@ -618,7 +624,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	else
 		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
+					scan_count);
 	return freed;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 05/24] shrinker: clean up variable types and tracepoints
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (3 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The tracepoint information in the shrinker code don't make a lot of
sense anymore and contain redundant information as a result of the
changes in the patchset. Refine the information passed to the
tracepoints so they expose the operation of the shrinkers more
precisely and clean up the remaining code and varibles in the
shrinker code so it all makes sense.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/trace/events/vmscan.h | 69 ++++++++++++++++-------------------
 mm/vmscan.c                   | 24 +++++-------
 2 files changed, 41 insertions(+), 52 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a5ab2973e8dc..110637d9efa5 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -184,84 +184,77 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 
 TRACE_EVENT(mm_shrink_slab_start,
 	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
-		long nr_objects_to_shrink, unsigned long cache_items,
-		unsigned long long delta, unsigned long total_scan,
-		int priority),
+		int64_t deferred_count, int64_t freeable_objects,
+		int64_t scan_count, int priority),
 
-	TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
+	TP_ARGS(shr, sc, deferred_count, freeable_objects, scan_count,
 		priority),
 
 	TP_STRUCT__entry(
 		__field(struct shrinker *, shr)
 		__field(void *, shrink)
 		__field(int, nid)
-		__field(long, nr_objects_to_shrink)
-		__field(gfp_t, gfp_flags)
-		__field(unsigned long, cache_items)
-		__field(unsigned long long, delta)
-		__field(unsigned long, total_scan)
+		__field(int64_t, deferred_count)
+		__field(int64_t, freeable_objects)
+		__field(int64_t, scan_count)
 		__field(int, priority)
+		__field(gfp_t, gfp_flags)
 	),
 
 	TP_fast_assign(
 		__entry->shr = shr;
 		__entry->shrink = shr->scan_objects;
 		__entry->nid = sc->nid;
-		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
-		__entry->gfp_flags = sc->gfp_mask;
-		__entry->cache_items = cache_items;
-		__entry->delta = delta;
-		__entry->total_scan = total_scan;
+		__entry->deferred_count = deferred_count;
+		__entry->freeable_objects = freeable_objects;
+		__entry->scan_count = scan_count;
 		__entry->priority = priority;
+		__entry->gfp_flags = sc->gfp_mask;
 	),
 
-	TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+	TP_printk("%pS %p: nid: %d scan count %lld freeable items %lld deferred count %lld priority %d gfp_flags %s",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nid,
-		__entry->nr_objects_to_shrink,
-		show_gfp_flags(__entry->gfp_flags),
-		__entry->cache_items,
-		__entry->delta,
-		__entry->total_scan,
-		__entry->priority)
+		__entry->scan_count,
+		__entry->freeable_objects,
+		__entry->deferred_count,
+		__entry->priority,
+		show_gfp_flags(__entry->gfp_flags))
 );
 
 TRACE_EVENT(mm_shrink_slab_end,
-	TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
-		long unused_scan_cnt, long new_scan_cnt, long total_scan),
+	TP_PROTO(struct shrinker *shr, int nid, int64_t freed_objects,
+		int64_t scanned_objects, int64_t deferred_scan),
 
-	TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
-		total_scan),
+	TP_ARGS(shr, nid, freed_objects, scanned_objects,
+		deferred_scan),
 
 	TP_STRUCT__entry(
 		__field(struct shrinker *, shr)
 		__field(int, nid)
 		__field(void *, shrink)
-		__field(long, unused_scan)
-		__field(long, new_scan)
-		__field(int, retval)
-		__field(long, total_scan)
+		__field(long long, freed_objects)
+		__field(long long, scanned_objects)
+		__field(long long, deferred_scan)
 	),
 
 	TP_fast_assign(
 		__entry->shr = shr;
 		__entry->nid = nid;
 		__entry->shrink = shr->scan_objects;
-		__entry->unused_scan = unused_scan_cnt;
-		__entry->new_scan = new_scan_cnt;
-		__entry->retval = shrinker_retval;
-		__entry->total_scan = total_scan;
+		__entry->freed_objects = freed_objects;
+		__entry->scanned_objects = scanned_objects;
+		__entry->deferred_scan = deferred_scan;
 	),
 
-	TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+	TP_printk("%pS %p: nid: %d freed objects %lld scanned objects %lld, deferred scan %lld",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nid,
-		__entry->unused_scan,
-		__entry->new_scan,
-		__entry->total_scan,
-		__entry->retval)
+		__entry->freed_objects,
+		__entry->scanned_objects,
+		__entry->deferred_scan)
 );
 
 TRACE_EVENT(mm_vmscan_lru_isolate,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c583b4efb9bf..d5ce26b4d49d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -505,7 +505,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	int64_t scanned_objects = 0;
 	int64_t next_deferred = 0;
 	int64_t deferred_count = 0;
-	long new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
@@ -564,8 +563,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		scan_count = freeable_objects * 2;
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
-				   freeable_objects, scan_count,
-				   scan_count, priority);
+				   freeable_objects, scan_count, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -609,23 +607,21 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 done:
+	/*
+	 * Calculate the remaining work that we need to defer to kswapd, and
+	 * store it in a manner that handles concurrent updates. If we exhausted
+	 * the scan, there is no need to do an update.
+	 */
 	if (deferred_count)
 		next_deferred = deferred_count - scanned_objects;
 	else if (scan_count > 0)
 		next_deferred = scan_count;
-	/*
-	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates. If we exhausted the
-	 * scan, there is no need to do an update.
-	 */
+
 	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
-						&shrinker->nr_deferred[nid]);
-	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+		atomic_long_add(next_deferred, &shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
-					scan_count);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, scanned_objects,
+				 next_deferred);
 	return freed;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (4 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 05/24] shrinker: clean up variable types and tracepoints Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Name change only, no logic changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c           | 2 +-
 include/linux/swap.h | 5 +++--
 mm/slab.c            | 2 +-
 mm/slob.c            | 2 +-
 mm/slub.c            | 2 +-
 mm/vmscan.c          | 4 ++--
 6 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 0f1e3b563c47..8c70f0643218 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -762,7 +762,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 			else
 				__count_vm_events(PGINODESTEAL, reap);
 			if (current->reclaim_state)
-				current->reclaim_state->reclaimed_slab += reap;
+				current->reclaim_state->reclaimed_pages += reap;
 		}
 		iput(inode);
 		spin_lock(lru_lock);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..978e6cd5c05a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -126,10 +126,11 @@ union swap_header {
 
 /*
  * current->reclaim_state points to one of these when a task is running
- * memory reclaim
+ * memory reclaim. It is typically used by shrinkers to return reclaim
+ * information back to the main vmscan loop.
  */
 struct reclaim_state {
-	unsigned long reclaimed_slab;
+	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
 };
 
 #ifdef __KERNEL__
diff --git a/mm/slab.c b/mm/slab.c
index 9df370558e5d..abc97e340f6d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1396,7 +1396,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	page->mapping = NULL;
 
 	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
+		current->reclaim_state->reclaimed_pages += 1 << order;
 	uncharge_slab_page(page, order, cachep);
 	__free_pages(page, order);
 }
diff --git a/mm/slob.c b/mm/slob.c
index 7f421d0ca9ab..c46ce297805e 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -208,7 +208,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 static void slob_free_pages(void *b, int order)
 {
 	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
+		current->reclaim_state->reclaimed_pages += 1 << order;
 	free_pages((unsigned long)b, order);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index e6c030e47364..a3e4bc62383b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1743,7 +1743,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 
 	page->mapping = NULL;
 	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += pages;
+		current->reclaim_state->reclaimed_pages += pages;
 	uncharge_slab_page(page, order, s);
 	__free_pages(page, order);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d5ce26b4d49d..231ddcfcd046 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2765,8 +2765,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
 		if (reclaim_state) {
-			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
+			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
+			reclaim_state->reclaimed_pages = 0;
 		}
 
 		/* Record the subtree's reclaim efficiency */
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (5 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 08/24] mm: kswapd backoff for shrinkers Dave Chinner
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When the majority of possible shrinker reclaim work is deferred by
the shrinkers (e.g. due to GFP_NOFS context), and there is more work
defered than LRU pages were scanned, back off reclaim if there are
large amounts of IO in progress.

This tends to occur when there are inode cache heavy workloads that
have little page cache or application memory pressure on filesytems
like XFS. Inode cache heavy workloads involve lots of IO, so if we
are getting device congestion it is indicative of memory reclaim
running up against an IO throughput limitation. in this situation
we need to throttle direct reclaim as we nee dto wait for kswapd to
get some of the deferred work done.

However, if there is no device congestion, then the system is
keeping up with both the workload and memory reclaim and so there's
no need to throttle.

Hence we should only back off scanning for a bit if we see this
condition and there is block device congestion present.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/swap.h |  2 ++
 mm/vmscan.c          | 30 +++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 978e6cd5c05a..1a3502a9bc1f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -131,6 +131,8 @@ union swap_header {
  */
 struct reclaim_state {
 	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
+	unsigned long	scanned_objects;	/* quantity of work done */ 
+	unsigned long	deferred_objects;	/* work that wasn't done */
 };
 
 #ifdef __KERNEL__
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 231ddcfcd046..4dc8e333f2c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -569,8 +569,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
 	 * defer the work to a context that can scan the cache.
 	 */
-	if (shrinkctl->will_defer)
+	if (shrinkctl->will_defer) {
+		if (current->reclaim_state)
+			current->reclaim_state->deferred_objects += scan_count;
 		goto done;
+	}
 
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
@@ -605,6 +608,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 		cond_resched();
 	}
+	if (current->reclaim_state)
+		current->reclaim_state->scanned_objects += scanned_objects;
 
 done:
 	/*
@@ -2766,7 +2771,30 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
+
+			/*
+			 * If we are deferring more work than we are actually
+			 * doing in the shrinkers, and we are scanning more
+			 * objects than we are pages, the we have a large amount
+			 * of slab caches we are deferring work to kswapd for.
+			 * We better back off here for a while, otherwise
+			 * we risk priority windup, swap storms and OOM kills
+			 * once we empty the page lists but still can't make
+			 * progress on the shrinker memory.
+			 *
+			 * kswapd won't ever defer work as it's run under a
+			 * GFP_KERNEL context and can always do work.
+			 */
+			if ((reclaim_state->deferred_objects >
+					sc->nr_scanned - nr_scanned) &&
+			    (reclaim_state->deferred_objects >
+					reclaim_state->scanned_objects)) {
+				wait_iff_congested(BLK_RW_ASYNC, HZ/50);
+			}
+
 			reclaim_state->reclaimed_pages = 0;
+			reclaim_state->deferred_objects = 0;
+			reclaim_state->scanned_objects = 0;
 		}
 
 		/* Record the subtree's reclaim efficiency */
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 08/24] mm: kswapd backoff for shrinkers
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (6 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When kswapd reaches the end of the page LRU and starts hitting dirty
pages, the logic in shrink_node() allows it to back off and wait for
IO to complete, thereby preventing kswapd from scanning excessively
and driving the system into swap thrashing and OOM conditions.

When we have inode cache heavy workloads on XFS, we have exactly the
same problem with reclaim inodes. The non-blocking kswapd reclaim
will keep putting pressure onto the inode cache which is unable to
make progress. When the system gets to the point where there is no
pages in the LRU to free, there is no swap left and there are no
clean inodes that can be freed, it will OOM. This has a specific
signature in OOM:

[  110.841987] Mem-Info:
[  110.842816] active_anon:241 inactive_anon:82 isolated_anon:1
                active_file:168 inactive_file:143 isolated_file:0
                unevictable:2621523 dirty:1 writeback:8 unstable:0
                slab_reclaimable:564445 slab_unreclaimable:420046
                mapped:1042 shmem:11 pagetables:6509 bounce:0
                free:77626 free_pcp:2 free_cma:0

In this case, we have about 500-600 pages left in teh LRUs, but we
have ~565000 reclaimable slab pages still available for reclaim.
Unfortunately, they are mostly dirty inodes, and so we really need
to be able to throttle kswapd when shrinker progress is limited due
to reaching the dirty end of the LRU...

So, add a flag into the reclaim_state so if the shrinker decides it
needs kswapd to back off and wait for a while (for whatever reason)
it can do so.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/swap.h |  1 +
 mm/vmscan.c          | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1a3502a9bc1f..416680b1bf0c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -133,6 +133,7 @@ struct reclaim_state {
 	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
 	unsigned long	scanned_objects;	/* quantity of work done */ 
 	unsigned long	deferred_objects;	/* work that wasn't done */
+	bool		need_backoff;		/* tell kswapd to slow down */
 };
 
 #ifdef __KERNEL__
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4dc8e333f2c6..029dba76ee5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2844,8 +2844,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			 * implies that pages are cycling through the LRU
 			 * faster than they are written so also forcibly stall.
 			 */
-			if (sc->nr.immediate)
+			if (sc->nr.immediate) {
 				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			} else if (reclaim_state && reclaim_state->need_backoff) {
+				/*
+				 * Ditto, but it's a slab cache that is cycling
+				 * through the LRU faster than they are written
+				 */
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				reclaim_state->need_backoff = false;
+			}
 		}
 
 		/*
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (7 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 08/24] mm: kswapd backoff for shrinkers Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01 13:39   ` Chris Mason
  2019-08-01  2:17 ` [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait Dave Chinner
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Running metadata intensive workloads, I've been seeing the AIL
pushing getting stuck on pinned buffers and triggering log forces.
The log force is taking a long time to run because the log IO is
getting throttled by wbt_wait() - the block layer writeback
throttle. It's being throttled because there is a huge amount of
metadata writeback going on which is filling the request queue.

IOWs, we have a priority inversion problem here.

Mark the log IO bios with REQ_IDLE so they don't get throttled
by the block layer writeback throttle. When we are forcing the CIL,
we are likely to need to to tens of log IOs, and they are issued as
fast as they can be build and IO completed. Hence REQ_IDLE is
appropriate - it's an indication that more IO will follow shortly.

And because we also set REQ_SYNC, the writeback throttle will no
treat log IO the same way it treats direct IO writes - it will not
throttle them at all. Hence we solve the priority inversion problem
caused by the writeback throttle being unable to distinguish between
high priority log IO and background metadata writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00e9f5c388d3..7bdea629e749 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1723,7 +1723,15 @@ xlog_write_iclog(
 	iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno;
 	iclog->ic_bio.bi_end_io = xlog_bio_end_io;
 	iclog->ic_bio.bi_private = iclog;
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_FUA;
+
+	/*
+	 * We use REQ_SYNC | REQ_IDLE here to tell the block layer the are more
+	 * IOs coming immediately after this one. This prevents the block layer
+	 * writeback throttle from throttling log writes behind background
+	 * metadata writeback and causing priority inversions.
+	 */
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
+				REQ_IDLE | REQ_FUA;
 	if (need_flush)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (8 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Rik van Riel <riel@surriel.com>

The code in xlog_wait uses the spinlock to make adding the task to
the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
with respect to the waker.

Doing the wakeup after releasing the spinlock opens up the following
race condition:

Task 1					task 2
add task to wait queue
					wake up task
set task state to UNINTERRUPTIBLE

This issue was found through code inspection as a result of kworkers
being observed stuck in UNINTERRUPTIBLE state with an empty
wait queue. It is rare and largely unreproducable.

Simply moving the spin_unlock to after the wake_up_all results
in the waker not being able to see a task on the waitqueue before
it has set its state to UNINTERRUPTIBLE.

This bug dates back to the conversion of this code to generic
waitqueue infrastructure from a counting semaphore back in 2008
which didn't place the wakeups consistently w.r.t. to the relevant
spin locks.

[dchinner: Also fix a similar issue in the shutdown path on
xc_commit_wait. Update commit log with more details of the issue.]

Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 7bdea629e749..b78c5e95bbba 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2630,7 +2630,6 @@ xlog_state_do_callback(
 	int		   funcdidcallbacks; /* flag: function did callbacks */
 	int		   repeats;	/* for issuing console warnings if
 					 * looping too many times */
-	int		   wake = 0;
 
 	spin_lock(&log->l_icloglock);
 	first_iclog = iclog = log->l_iclog;
@@ -2826,11 +2825,9 @@ xlog_state_do_callback(
 #endif
 
 	if (log->l_iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_IOERROR))
-		wake = 1;
-	spin_unlock(&log->l_icloglock);
-
-	if (wake)
 		wake_up_all(&log->l_flush_wait);
+
+	spin_unlock(&log->l_icloglock);
 }
 
 
@@ -3930,7 +3927,9 @@ xfs_log_force_umount(
 	 * item committed callback functions will do this again under lock to
 	 * avoid races.
 	 */
+	spin_lock(&log->l_cilp->xc_push_lock);
 	wake_up_all(&log->l_cilp->xc_commit_wait);
+	spin_unlock(&log->l_cilp->xc_push_lock);
 	xlog_state_do_callback(log, true, NULL);
 
 #ifdef XFSERRORDEBUG
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 11/24] xfs:: account for memory freed from metadata buffers
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (9 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  8:16   ` Christoph Hellwig
  2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The buffer cache shrinker frees more than just the xfs_buf slab
objects - it also frees the pages attached to the buffers. Make sure
the memory reclaim code accounts for this memory being freed
correctly, similar to how the inode shrinker accounts for pages
freed from the page cache due to mapping invalidation.

We also need to make sure that the mm subsystem knows these are
reclaimable objects. We provide the memory reclaim subsystem with a
a shrinker to reclaim xfs_bufs, so we should really mark the slab
that way.

We also have a lot of xfs_bufs in a busy system, spread them around
like we do inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6e0f76532535..beb816cd54d6 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1667,6 +1667,14 @@ xfs_buftarg_shrink_scan(
 		struct xfs_buf *bp;
 		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
 		list_del_init(&bp->b_lru);
+
+		/*
+		 * Account for the buffer memory freed here so memory reclaim
+		 * sees this and not just the xfs_buf slab entry being freed.
+		 */
+		if (current->reclaim_state)
+			current->reclaim_state->reclaimed_pages += bp->b_page_count;
+
 		xfs_buf_rele(bp);
 	}
 
@@ -2057,7 +2065,8 @@ int __init
 xfs_buf_init(void)
 {
 	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
-						KM_ZONE_HWALIGN, NULL);
+			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
+			NULL);
 	if (!xfs_buf_zone)
 		goto out;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 12/24] xfs: correctly acount for reclaimable slabs
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (10 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-06  5:52   ` Christoph Hellwig
  2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The XFS inode item slab actually reclaimed by inode shrinker
callbacks from the memory reclaim subsystem. These should be marked
as reclaimable so the mm subsystem has the full picture of how much
memory it can actually reclaim from the XFS slab caches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f9450235533c..67b59815d0df 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1916,7 +1916,7 @@ xfs_init_zones(void)
 
 	xfs_ili_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_log_item_t), "xfs_ili",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD | KM_ZONE_RECLAIM, NULL);
 	if (!xfs_ili_zone)
 		goto out_destroy_inode_zone;
 	xfs_icreate_zone = kmem_zone_init(sizeof(struct xfs_icreate_item),
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 13/24] xfs: synchronous AIL pushing
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (11 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-05 17:51   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Provide an interface to push the AIL to a target LSN and wait for
the tail of the log to move past that LSN. This is used to wait for
all items older than a specific LSN to either be cleaned (written
back) or relogged to a higher LSN in the AIL. The primary use for
this is to allow IO free inode reclaim throttling.

Factor the common AIL deletion code that does all the wakeups into a
helper so we only have one copy of this somewhat tricky code to
interface with all the wakeups necessary when the LSN of the log
tail changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 12 +------
 fs/xfs/xfs_trans_ail.c  | 69 +++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_trans_priv.h |  6 +++-
 3 files changed, 62 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index c9a502eed204..7b942a63e992 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -743,17 +743,7 @@ xfs_iflush_done(
 				xfs_clear_li_failed(blip);
 			}
 		}
-
-		if (mlip_changed) {
-			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
-				xlog_assign_tail_lsn_locked(ailp->ail_mount);
-			if (list_empty(&ailp->ail_head))
-				wake_up_all(&ailp->ail_empty);
-		}
-		spin_unlock(&ailp->ail_lock);
-
-		if (mlip_changed)
-			xfs_log_space_wake(ailp->ail_mount);
+		xfs_ail_delete_finish(ailp, mlip_changed);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 6ccfd75d3c24..9e3102179221 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -654,6 +654,37 @@ xfs_ail_push_all(
 		xfs_ail_push(ailp, threshold_lsn);
 }
 
+/*
+ * Push the AIL to a specific lsn and wait for it to complete.
+ */
+void
+xfs_ail_push_sync(
+	struct xfs_ail		*ailp,
+	xfs_lsn_t		threshold_lsn)
+{
+	struct xfs_log_item	*lip;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&ailp->ail_lock);
+	while ((lip = xfs_ail_min(ailp)) != NULL) {
+		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
+		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
+		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
+			break;
+		/* XXX: cmpxchg? */
+		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
+			xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);
+		wake_up_process(ailp->ail_task);
+		spin_unlock(&ailp->ail_lock);
+		schedule();
+		spin_lock(&ailp->ail_lock);
+	}
+	spin_unlock(&ailp->ail_lock);
+
+	finish_wait(&ailp->ail_push, &wait);
+}
+
+
 /*
  * Push out all items in the AIL immediately and wait until the AIL is empty.
  */
@@ -764,6 +795,28 @@ xfs_ail_delete_one(
 	return mlip == lip;
 }
 
+void
+xfs_ail_delete_finish(
+	struct xfs_ail		*ailp,
+	bool			do_tail_update) __releases(ailp->ail_lock)
+{
+	struct xfs_mount	*mp = ailp->ail_mount;
+
+	if (!do_tail_update) {
+		spin_unlock(&ailp->ail_lock);
+		return;
+	}
+
+	if (!XFS_FORCED_SHUTDOWN(mp))
+		xlog_assign_tail_lsn_locked(mp);
+
+	wake_up_all(&ailp->ail_push);
+	if (list_empty(&ailp->ail_head))
+		wake_up_all(&ailp->ail_empty);
+	spin_unlock(&ailp->ail_lock);
+	xfs_log_space_wake(mp);
+}
+
 /**
  * Remove a log items from the AIL
  *
@@ -789,10 +842,9 @@ void
 xfs_trans_ail_delete(
 	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip,
-	int			shutdown_type) __releases(ailp->ail_lock)
+	int			shutdown_type)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
-	bool			mlip_changed;
 
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
@@ -805,17 +857,7 @@ xfs_trans_ail_delete(
 		return;
 	}
 
-	mlip_changed = xfs_ail_delete_one(ailp, lip);
-	if (mlip_changed) {
-		if (!XFS_FORCED_SHUTDOWN(mp))
-			xlog_assign_tail_lsn_locked(mp);
-		if (list_empty(&ailp->ail_head))
-			wake_up_all(&ailp->ail_empty);
-	}
-
-	spin_unlock(&ailp->ail_lock);
-	if (mlip_changed)
-		xfs_log_space_wake(ailp->ail_mount);
+	xfs_ail_delete_finish(ailp, xfs_ail_delete_one(ailp, lip));
 }
 
 int
@@ -834,6 +876,7 @@ xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	init_waitqueue_head(&ailp->ail_push);
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_fsname);
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2e073c1c4614..5ab70b9b896f 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -61,6 +61,7 @@ struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	wait_queue_head_t	ail_push;
 };
 
 /*
@@ -92,8 +93,10 @@ xfs_trans_ail_update(
 }
 
 bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
+void xfs_ail_delete_finish(struct xfs_ail *ailp, bool do_tail_update)
+			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
-		int shutdown_type) __releases(ailp->ail_lock);
+		int shutdown_type);
 
 static inline void
 xfs_trans_ail_remove(
@@ -111,6 +114,7 @@ xfs_trans_ail_remove(
 }
 
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
+void			xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
 void			xfs_ail_push_all(struct xfs_ail *);
 void			xfs_ail_push_all_sync(struct xfs_ail *);
 struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 14/24] xfs: tail updates only need to occur when LSN changes
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (12 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-05 17:53   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

We currently wake anything waiting on the log tail to move whenever
the log item at the tail of the log is removed. Historically this
was fine behaviour because there were very few items at any given
LSN. But with delayed logging, there may be thousands of items at
any given LSN, and we can't move the tail until they are all gone.

Hence if we are removing them in near tail-first order, we might be
waking up processes waiting on the tail LSN to change (e.g. log
space waiters) repeatedly without them being able to make progress.
This also occurs with the new sync push waiters, and can result in
thousands of spurious wakeups every second when under heavy direct
reclaim pressure.

To fix this, check that the tail LSN has actually changed on the
AIL before triggering wakeups. This will reduce the number of
spurious wakeups when doing bulk AIL removal and make this code much
more efficient.

XXX: occasionally get a temporary hang in xfs_ail_push_sync() with
this change - log force from log worker gets things moving again.
Only happens under extreme memory pressure - possibly push racing
with a tail update on an empty log. Needs further investigation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 18 +++++++++++++-----
 fs/xfs/xfs_trans_ail.c  | 37 ++++++++++++++++++++++++++++---------
 fs/xfs/xfs_trans_priv.h |  4 ++--
 3 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 7b942a63e992..16a7d6f752c9 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -731,19 +731,27 @@ xfs_iflush_done(
 	 * holding the lock before removing the inode from the AIL.
 	 */
 	if (need_ail) {
-		bool			mlip_changed = false;
+		xfs_lsn_t	tail_lsn = 0;
 
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(blip, &tmp, li_bio_list) {
 			if (INODE_ITEM(blip)->ili_logged &&
-			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
-				mlip_changed |= xfs_ail_delete_one(ailp, blip);
-			else {
+			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
+				/*
+				 * xfs_ail_delete_finish() only cares about the
+				 * lsn of the first tail item removed, any others
+				 * will be at the same or higher lsn so we just
+				 * ignore them.
+				 */
+				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
+				if (!tail_lsn && lsn)
+					tail_lsn = lsn;
+			} else {
 				xfs_clear_li_failed(blip);
 			}
 		}
-		xfs_ail_delete_finish(ailp, mlip_changed);
+		xfs_ail_delete_finish(ailp, tail_lsn);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 9e3102179221..00d66175f41a 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -108,17 +108,25 @@ xfs_ail_next(
  * We need the AIL lock in order to get a coherent read of the lsn of the last
  * item in the AIL.
  */
+static xfs_lsn_t
+__xfs_ail_min_lsn(
+	struct xfs_ail		*ailp)
+{
+	struct xfs_log_item	*lip = xfs_ail_min(ailp);
+
+	if (lip)
+		return lip->li_lsn;
+	return 0;
+}
+
 xfs_lsn_t
 xfs_ail_min_lsn(
 	struct xfs_ail		*ailp)
 {
-	xfs_lsn_t		lsn = 0;
-	struct xfs_log_item	*lip;
+	xfs_lsn_t		lsn;
 
 	spin_lock(&ailp->ail_lock);
-	lip = xfs_ail_min(ailp);
-	if (lip)
-		lsn = lip->li_lsn;
+	lsn = __xfs_ail_min_lsn(ailp);
 	spin_unlock(&ailp->ail_lock);
 
 	return lsn;
@@ -779,12 +787,20 @@ xfs_trans_ail_update_bulk(
 	}
 }
 
-bool
+/*
+ * Delete one log item from the AIL.
+ *
+ * If this item was at the tail of the AIL, return the LSN of the log item so
+ * that we can use it to check if the LSN of the tail of the log has moved
+ * when finishing up the AIL delete process in xfs_ail_delete_finish().
+ */
+xfs_lsn_t
 xfs_ail_delete_one(
 	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip)
 {
 	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
+	xfs_lsn_t		lsn = lip->li_lsn;
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
@@ -792,17 +808,20 @@ xfs_ail_delete_one(
 	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
 	lip->li_lsn = 0;
 
-	return mlip == lip;
+	if (mlip == lip)
+		return lsn;
+	return 0;
 }
 
 void
 xfs_ail_delete_finish(
 	struct xfs_ail		*ailp,
-	bool			do_tail_update) __releases(ailp->ail_lock)
+	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
 
-	if (!do_tail_update) {
+	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
+	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
 		spin_unlock(&ailp->ail_lock);
 		return;
 	}
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 5ab70b9b896f..db589bb7468d 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -92,8 +92,8 @@ xfs_trans_ail_update(
 	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
 }
 
-bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
-void xfs_ail_delete_finish(struct xfs_ail *ailp, bool do_tail_update)
+xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
+void xfs_ail_delete_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
 			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
 		int shutdown_type);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (13 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-05 18:03   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The CIL can pin a lot of memory and effectively defines the lower
free memory boundary of operation for XFS. The way we hang onto
log item shadow buffers "just in case" effectively doubles the
memory footprint of the CIL for dubious reasons.

That is, we hang onto the old shadow buffer in case the next time
we log the item it will fit into the shadow buffer and we won't have
to allocate a new one. However, we only ever tend to grow dirty
objects in the CIL through relogging, so once we've allocated a
larger buffer the old buffer we set as a shadow buffer will never
get reused as the amount we log never decreases until the item is
clean. And then for buffer items we free the log item and the shadow
buffers, anyway. Inode items will hold onto their shadow buffer
until they are reclaimed - this could double the inode's memory
footprint for it's lifetime...

Hence we should just free the old log item buffer when we replace it
with a new shadow buffer rather than storing it for later use. It's
not useful, get rid of it as early as possible.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index fa5602d0fd7f..1863a9bdf4a9 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -238,9 +238,7 @@ xfs_cil_prepare_item(
 	/*
 	 * If there is no old LV, this is the first time we've seen the item in
 	 * this CIL context and so we need to pin it. If we are replacing the
-	 * old_lv, then remove the space it accounts for and make it the shadow
-	 * buffer for later freeing. In both cases we are now switching to the
-	 * shadow buffer, so update the the pointer to it appropriately.
+	 * old_lv, then remove the space it accounts for and free it.
 	 */
 	if (!old_lv) {
 		if (lv->lv_item->li_ops->iop_pin)
@@ -251,7 +249,8 @@ xfs_cil_prepare_item(
 
 		*diff_len -= old_lv->lv_bytes;
 		*diff_iovecs -= old_lv->lv_niovecs;
-		lv->lv_item->li_lv_shadow = old_lv;
+		kmem_free(old_lv);
+		lv->lv_item->li_lv_shadow = NULL;
 	}
 
 	/* attach new log vector to log item */
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 16/24] xfs: Lower CIL flush limit for large logs
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (14 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-04 17:12   ` Nikolay Borisov
  2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

FLushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can 2MB of log IO in flight at once,
so limit aggregation to 8x this size (arbitrary). This has some
impact on performance (5-10% decrease on 16-way fsmark) and
increases the amount of log traffic (~50% on same workload) but it
is necessary to prevent rampant OOM killing under iworkloads that
modify large amounts of metadata under heavy memory pressure.

This was found via trace analysis or AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

XXX: Need to try bigger sizes to see where the performance/stability
boundary lies to see if some of the losses can be regained and log
bandwidth increases minimised.

XXX: Ideally this threshold should slide with memory pressure. We
can allow large amounts of metadata to build up when there is no
memory pressure, but then close the window as memory pressure builds
up to reduce the footprint of the CIL until memory pressure passes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_priv.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b880c23cb6e4..87c6191daef7 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -329,7 +329,8 @@ struct xfs_cil {
  * enforced to ensure we stay within our maximum checkpoint size bounds.
  * threshold, yet give us plenty of space for aggregation on large logs.
  */
-#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
+#define XLOG_CIL_SPACE_LIMIT(log)	\
+	min_t(int, (log)->l_logsize >> 3, XLOG_TOTAL_REC_SHIFT(log) << 3)
 
 /*
  * ticket grant locks, queues and accounting have their own cachlines
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 17/24] xfs: don't block kswapd in inode reclaim
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (15 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-06 18:21   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

We have a number of reasons for blocking kswapd in XFS inode
reclaim, mainly all to do with the fact that memory reclaim has no
feedback mechanisms to throttle on dirty slab objects that need IO
to reclaim.

As a result, we currently throttle inode reclaim by issuing IO in
the reclaim context. The unfortunate side effect of this is that it
can cause long tail latencies in reclaim and for some workloads this
can be a problem.

Now that the shrinkers finally have a method of telling kswapd to
back off, we can start the process of making inode reclaim in XFS
non-blocking. The first thing we need to do is not block kswapd, but
so that doesn't cause immediate serious problems, make sure inode
writeback is always underway when kswapd is running.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0b0fd10a36d4..2fa2f8dcf86b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
-	/* kick background reclaimer and push the AIL */
+	int			sync_mode = SYNC_TRYLOCK;
+
+	/* kick background reclaimer */
 	xfs_reclaim_work_queue(mp);
-	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	/*
+	 * For kswapd, we kick background inode writeback. For direct
+	 * reclaim, we issue and wait on inode writeback to throttle
+	 * reclaim rates and avoid shouty OOM-death.
+	 */
+	if (current_is_kswapd())
+		xfs_ail_push_all(mp->m_ail);
+	else
+		sync_mode |= SYNC_WAIT;
+
+	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
 }
 
 /*
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 18/24] xfs: reduce kswapd blocking on inode locking.
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (16 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-06 18:22   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 19/24] xfs: kill background reclaim work Dave Chinner
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When doing async node reclaiming, we grab a batch of inodes that we
are likely able to reclaim and ignore those that are already
flushing. However, when we actually go to reclaim them, the first
thing we do is lock the inode. If we are racing with something
else reclaiming the inode or flushing it because it is dirty,
we block on the inode lock. Hence we can still block kswapd here.

Further, if we flush an inode, we also cluster all the other dirty
inodes in that cluster into the same IO, flush locking them all.
However, if the workload is operating on sequential inodes (e.g.
created by a tarball extraction) most of these inodes will be
sequntial in the cache and so in the same batch
we've already grabbed for reclaim scanning.

As a result, it is common for all the inodes in the batch to be
dirty and it is common for the first inode flushed to also flush all
the inodes in the reclaim batch. In which case, they are now all
going to be flush locked and we do not want to block on them.

Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use
trylock semantics and abort reclaim of an inode as quickly as we can
without blocking kswapd.

Found via tracing and finding big batches of repeated lock/unlock
runs on inodes that we just flushed by write clustering during
reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2fa2f8dcf86b..e6b9030875b9 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1104,11 +1104,23 @@ xfs_reclaim_inode(
 
 restart:
 	error = 0;
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
+	/*
+	 * Don't try to flush the inode if another inode in this cluster has
+	 * already flushed it after we did the initial checks in
+	 * xfs_reclaim_inode_grab().
+	 */
+	if (sync_mode & SYNC_TRYLOCK) {
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 			goto out;
-		xfs_iflock(ip);
+		if (!xfs_iflock_nowait(ip))
+			goto out_unlock;
+	} else {
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		if (!xfs_iflock_nowait(ip)) {
+			if (!(sync_mode & SYNC_WAIT))
+				goto out_unlock;
+			xfs_iflock(ip);
+		}
 	}
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
@@ -1215,9 +1227,10 @@ xfs_reclaim_inode(
 
 out_ifunlock:
 	xfs_ifunlock(ip);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	/*
 	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
 	 * a short while. However, this just burns CPU time scanning the tree
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 19/24] xfs: kill background reclaim work
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (17 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

This function is now entirely done by kswapd, so we don't need the
worker thread to do async reclaim anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 44 --------------------------------------------
 fs/xfs/xfs_icache.h |  2 --
 fs/xfs/xfs_mount.c  |  2 --
 fs/xfs/xfs_mount.h  |  2 --
 fs/xfs/xfs_super.c  | 11 +----------
 5 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e6b9030875b9..0bd4420a7e16 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -138,44 +138,6 @@ xfs_inode_free(
 	__xfs_inode_free(ip);
 }
 
-/*
- * Queue a new inode reclaim pass if there are reclaimable inodes and there
- * isn't a reclaim pass already in progress. By default it runs every 5s based
- * on the xfs periodic sync default of 30s. Perhaps this should have it's own
- * tunable, but that can be done if this method proves to be ineffective or too
- * aggressive.
- */
-static void
-xfs_reclaim_work_queue(
-	struct xfs_mount        *mp)
-{
-
-	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
-		queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work,
-			msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
-	}
-	rcu_read_unlock();
-}
-
-/*
- * This is a fast pass over the inode cache to try to get reclaim moving on as
- * many inodes as possible in a short period of time. It kicks itself every few
- * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
- */
-void
-xfs_reclaim_worker(
-	struct work_struct *work)
-{
-	struct xfs_mount *mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_reclaim_work);
-
-	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
-	xfs_reclaim_work_queue(mp);
-}
-
 static void
 xfs_perag_set_reclaim_tag(
 	struct xfs_perag	*pag)
@@ -192,9 +154,6 @@ xfs_perag_set_reclaim_tag(
 			   XFS_ICI_RECLAIM_TAG);
 	spin_unlock(&mp->m_perag_lock);
 
-	/* schedule periodic background inode reclaim */
-	xfs_reclaim_work_queue(mp);
-
 	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
 }
 
@@ -1393,9 +1352,6 @@ xfs_reclaim_inodes_nr(
 {
 	int			sync_mode = SYNC_TRYLOCK;
 
-	/* kick background reclaimer */
-	xfs_reclaim_work_queue(mp);
-
 	/*
 	 * For kswapd, we kick background inode writeback. For direct
 	 * reclaim, we issue and wait on inode writeback to throttle
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 48f1fd2bb6ad..4c0d8920cc54 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,8 +49,6 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
-void xfs_reclaim_worker(struct work_struct *work);
-
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 322da6909290..a1805021c92f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -988,7 +988,6 @@ xfs_mountfs(
 	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
 	 * quota inodes.
 	 */
-	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp, SYNC_WAIT);
 	xfs_health_unmount(mp);
  out_log_dealloc:
@@ -1071,7 +1070,6 @@ xfs_unmountfs(
 	 * reclaim just to be sure. We can stop background inode reclaim
 	 * here as well if it is still running.
 	 */
-	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp, SYNC_WAIT);
 	xfs_health_unmount(mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index fdb60e09a9c5..f0cc952ad527 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -165,7 +165,6 @@ typedef struct xfs_mount {
 	uint			m_chsize;	/* size of next field */
 	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
-	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
 						     trimming */
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
@@ -182,7 +181,6 @@ typedef struct xfs_mount {
 	struct workqueue_struct *m_buf_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
-	struct workqueue_struct	*m_reclaim_workqueue;
 	struct workqueue_struct *m_eofblocks_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 67b59815d0df..09e41c6c1794 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -822,15 +822,10 @@ xfs_init_mount_workqueues(
 	if (!mp->m_cil_workqueue)
 		goto out_destroy_unwritten;
 
-	mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s",
-			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
-	if (!mp->m_reclaim_workqueue)
-		goto out_destroy_cil;
-
 	mp->m_eofblocks_workqueue = alloc_workqueue("xfs-eofblocks/%s",
 			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
 	if (!mp->m_eofblocks_workqueue)
-		goto out_destroy_reclaim;
+		goto out_destroy_cil;
 
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s", WQ_FREEZABLE, 0,
 					       mp->m_fsname);
@@ -841,8 +836,6 @@ xfs_init_mount_workqueues(
 
 out_destroy_eofb:
 	destroy_workqueue(mp->m_eofblocks_workqueue);
-out_destroy_reclaim:
-	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_cil:
 	destroy_workqueue(mp->m_cil_workqueue);
 out_destroy_unwritten:
@@ -859,7 +852,6 @@ xfs_destroy_mount_workqueues(
 {
 	destroy_workqueue(mp->m_sync_workqueue);
 	destroy_workqueue(mp->m_eofblocks_workqueue);
-	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
 	destroy_workqueue(mp->m_buf_workqueue);
@@ -1557,7 +1549,6 @@ xfs_mount_alloc(
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
 	atomic_set(&mp->m_active_trans, 0);
-	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
 	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (18 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 19/24] xfs: kill background reclaim work Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-07 18:09   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim currently issues it's own inode IO when it comes
across dirty inodes. This is used to throttle direct reclaim down to
the rate at which we can reclaim dirty inodes. Failure to throttle
in this manner results in the OOM killer being trivial to trigger
even when there is lots of free memory available.

However, having direct reclaimers issue IO causes an amount of
IO thrashing to occur. We can have up to the number of AGs in the
filesystem concurrently issuing IO, plus the AIL pushing thread as
well. This means we can many competing sources of IO and they all
end up thrashing and competing for the request slots in the block
device.

Similar to dirty page throttling and the BDI flusher thread, we can
use the AIL pushing thread the sole place we issue inode writeback
from and everything else waits for it to make progress. To do this,
reclaim will skip over dirty inodes, but in doing so will record the
lowest LSN of all the dirty inodes it skips. It will then push the
AIL to this LSN and wait for it to complete that work.

In doing so, we block direct reclaim on the IO of at least one IO,
thereby providing some level of throttling for when we encounter
dirty inodes. However we gain the ability to scan and reclaim
clean inodes in a non-blocking fashion. This allows us to
remove all the per-ag reclaim locking that avoids excessive direct
reclaim, as repeated concurrent direct reclaim will hit the same
dirty inodes on block waiting on the same IO to complete.

Hence direct reclaim will be throttled directly by the rate at which
dirty inodes are cleaned by AIL pushing, rather than by delays
caused by competing IO submissions. This allows us to remove all the
locking that limits direct reclaim concurrency and greatly
simplifies the inode reclaim code now that it just skips dirty
inodes.

Note: this patch by itself isn't completely able to throttle direct
reclaim sufficiently to prevent OOM killer madness. We can't do that
until we change the way we index reclaimable inodes in the next
patch and can feed back state to the mm core sanely.  However, we
can't change the way we index reclaimable inodes until we have
IO-less non-blocking reclaim for both direct reclaim and kswapd
reclaim.  Catch-22...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c    | 208 +++++++++++++++++------------------------
 fs/xfs/xfs_mount.c     |   4 -
 fs/xfs/xfs_mount.h     |   1 -
 fs/xfs/xfs_trans_ail.c |   4 +-
 4 files changed, 90 insertions(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0bd4420a7e16..4c4c5bc12147 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -22,6 +22,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_log.h"
 
 #include <linux/iversion.h>
 
@@ -967,28 +968,42 @@ xfs_inode_ag_iterator_tag(
 }
 
 /*
- * Grab the inode for reclaim exclusively.
- * Return 0 if we grabbed it, non-zero otherwise.
+ * Grab the inode for reclaim.
+ *
+ * Return false if we aren't going to reclaim it, true if it is a reclaim
+ * candidate.
+ *
+ * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
+ * caller it does not require flushing. Otherwise return the log item lsn of the
+ * inode so the caller can determine it's inode flush target.  If we get the
+ * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
+ * have locks held.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode_grab(
 	struct xfs_inode	*ip,
-	int			flags)
+	int			flags,
+	xfs_lsn_t		*lsn)
 {
 	ASSERT(rcu_read_lock_held());
+	*lsn = 0;
 
 	/* quick check for stale RCU freed inode */
 	if (!ip->i_ino)
-		return 1;
+		return false;
 
 	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
+	 * Do unlocked checks to see if the inode already is being flushed or in
+	 * reclaim to avoid lock traffic. If the inode is not clean, return the
+	 * it's position in the AIL for the caller to push to.
 	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return 1;
+	if (!xfs_inode_clean(ip)) {
+		*lsn = ip->i_itemp->ili_item.li_lsn;
+		return false;
+	}
+
+	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+		return false;
 
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
@@ -1005,11 +1020,11 @@ xfs_reclaim_inode_grab(
 	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
 		/* not a reclaim candidate. */
 		spin_unlock(&ip->i_flags_lock);
-		return 1;
+		return false;
 	}
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
-	return 0;
+	return true;
 }
 
 /*
@@ -1050,92 +1065,67 @@ xfs_reclaim_inode_grab(
  *	clean		=> reclaim
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
+ *
+ * Returns true if the inode was reclaimed, false otherwise.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
-	int			sync_mode)
+	xfs_lsn_t		*lsn)
 {
-	struct xfs_buf		*bp = NULL;
-	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
-	int			error;
+	xfs_ino_t		ino;
+
+	*lsn = 0;
 
-restart:
-	error = 0;
 	/*
 	 * Don't try to flush the inode if another inode in this cluster has
 	 * already flushed it after we did the initial checks in
 	 * xfs_reclaim_inode_grab().
 	 */
-	if (sync_mode & SYNC_TRYLOCK) {
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
-			goto out;
-		if (!xfs_iflock_nowait(ip))
-			goto out_unlock;
-	} else {
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		if (!xfs_iflock_nowait(ip)) {
-			if (!(sync_mode & SYNC_WAIT))
-				goto out_unlock;
-			xfs_iflock(ip);
-		}
-	}
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+		goto out;
+	if (!xfs_iflock_nowait(ip))
+		goto out_unlock;
 
+	/* If we are in shutdown, we don't care about blocking. */
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
 		/* xfs_iflush_abort() drops the flush lock */
 		xfs_iflush_abort(ip, false);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out_ifunlock;
-		xfs_iunpin_wait(ip);
-	}
-	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto reclaim;
-	}
 
 	/*
-	 * Never flush out dirty data during non-blocking reclaim, as it would
-	 * just contend with AIL pushing trying to do the same job.
+	 * If it is pinned, we only want to flush this if there's nothing else
+	 * to be flushed as it requires a log force. Hence we essentially set
+	 * the LSN to flush the entire AIL which will end up triggering a log
+	 * force to unpin this inode, but that will only happen if there are not
+	 * other inodes in the scan that only need writeback.
 	 */
-	if (!(sync_mode & SYNC_WAIT))
+	if (xfs_ipincount(ip)) {
+		*lsn = ip->i_itemp->ili_last_lsn;
 		goto out_ifunlock;
+	}
 
 	/*
-	 * Now we have an inode that needs flushing.
-	 *
-	 * Note that xfs_iflush will never block on the inode buffer lock, as
-	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
-	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
-	 * result in an ABBA deadlock with xfs_ifree_cluster().
-	 *
-	 * As xfs_ifree_cluser() must gather all inodes that are active in the
-	 * cache to mark them stale, if we hit this case we don't actually want
-	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
-	 * inode, back off and try again.  Hopefully the next pass through will
-	 * see the stale flag set on the inode.
+	 * Dirty inode we didn't catch, skip it.
 	 */
-	error = xfs_iflush(ip, &bp);
-	if (error == -EAGAIN) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		/* backoff longer than in xfs_ifree_cluster */
-		delay(2);
-		goto restart;
+	if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) {
+		*lsn = ip->i_itemp->ili_item.li_lsn;
+		goto out_ifunlock;
 	}
 
-	if (!error) {
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-	}
+	/*
+	 * It's clean, we have it locked, we can now drop the flush lock
+	 * and reclaim it.
+	 */
+	xfs_ifunlock(ip);
 
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
+	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
+	ASSERT(ip->i_ino != 0);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1148,6 +1138,7 @@ xfs_reclaim_inode(
 	 * will see an invalid inode that it can skip.
 	 */
 	spin_lock(&ip->i_flags_lock);
+	ino = ip->i_ino; /* for radix_tree_delete */
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
@@ -1182,7 +1173,7 @@ xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	__xfs_inode_free(ip);
-	return error;
+	return true;
 
 out_ifunlock:
 	xfs_ifunlock(ip);
@@ -1190,14 +1181,7 @@ xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	/*
-	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and the reclaim work never goes back to
-	 * the idle state. Instead, return 0 to let the next scheduled
-	 * background reclaim attempt to reclaim the inode again.
-	 */
-	return 0;
+	return false;
 }
 
 /*
@@ -1205,39 +1189,28 @@ xfs_reclaim_inode(
  * corrupted, we still want to try to reclaim all the inodes. If we don't,
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
+ *
+ * Return the number of inodes freed.
  */
 STATIC int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
-	int			*nr_to_scan)
+	int			nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
+	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
+	long			freed = 0;
 
-restart:
 	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
-
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
+		first_index = pag->pag_ici_reclaim_cursor;
 
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
@@ -1262,9 +1235,13 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done ||
+				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
 					batch[i] = NULL;
 
+				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
+					lowest_lsn = lsn;
+
 				/*
 				 * Update the index for the next lookup. Catch
 				 * overflows into the next AG range which can
@@ -1293,37 +1270,28 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
+				if (xfs_reclaim_inode(batch[i], pag, &lsn))
+					freed++;
+				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
+					lowest_lsn = lsn;
 			}
 
-			*nr_to_scan -= XFS_LOOKUP_BATCH;
-
+			nr_to_scan -= XFS_LOOKUP_BATCH;
 			cond_resched();
 
-		} while (nr_found && !done && *nr_to_scan > 0);
+		} while (nr_found && !done && nr_to_scan > 0);
 
-		if (trylock && !done)
+		if (!done)
 			pag->pag_ici_reclaim_cursor = first_index;
 		else
 			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
 
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return last_error;
+	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
+		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
+
+	return freed;
 }
 
 int
@@ -1331,9 +1299,7 @@ xfs_reclaim_inodes(
 	xfs_mount_t	*mp,
 	int		mode)
 {
-	int		nr_to_scan = INT_MAX;
-
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
 }
 
 /*
@@ -1350,7 +1316,7 @@ xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
-	int			sync_mode = SYNC_TRYLOCK;
+	int			sync_mode = 0;
 
 	/*
 	 * For kswapd, we kick background inode writeback. For direct
@@ -1362,7 +1328,7 @@ xfs_reclaim_inodes_nr(
 	else
 		sync_mode |= SYNC_WAIT;
 
-	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index a1805021c92f..bcf8f64d1b1f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,7 +148,6 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -200,7 +199,6 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -242,7 +240,6 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
-	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -252,7 +249,6 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f0cc952ad527..2049e764faed 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -383,7 +383,6 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 00d66175f41a..5802139f786b 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -676,8 +676,10 @@ xfs_ail_push_sync(
 	spin_lock(&ailp->ail_lock);
 	while ((lip = xfs_ail_min(ailp)) != NULL) {
 		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
+	trace_printk("lip lsn 0x%llx thres 0x%llx targ 0x%llx",
+			lip->li_lsn, threshold_lsn, ailp->ail_target);
 		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
-		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
+		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
 			break;
 		/* XXX: cmpxchg? */
 		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes()
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (19 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Because it's always SYNC_WAIT now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 7 +++----
 fs/xfs/xfs_icache.h | 2 +-
 fs/xfs/xfs_mount.c  | 4 ++--
 fs/xfs/xfs_super.c  | 3 +--
 4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4c4c5bc12147..aaa1f840a86c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1294,12 +1294,11 @@ xfs_reclaim_inodes_ag(
 	return freed;
 }
 
-int
+void
 xfs_reclaim_inodes(
-	xfs_mount_t	*mp,
-	int		mode)
+	struct xfs_mount	*mp)
 {
-	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
+	xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX);
 }
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 4c0d8920cc54..1c9b9edb2986 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,7 +49,7 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+void xfs_reclaim_inodes(struct xfs_mount *mp);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bcf8f64d1b1f..e851b9cfbabd 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -984,7 +984,7 @@ xfs_mountfs(
 	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
 	 * quota inodes.
 	 */
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
  out_log_dealloc:
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
@@ -1066,7 +1066,7 @@ xfs_unmountfs(
 	 * reclaim just to be sure. We can stop background inode reclaim
 	 * here as well if it is still running.
 	 */
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
 
 	xfs_qm_unmount(mp);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 09e41c6c1794..a59d3a21be5c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1179,8 +1179,7 @@ xfs_quiesce_attr(
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
 	/* reclaim inodes to do any IO before the freeze completes */
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 22/24] xfs: track reclaimable inodes using a LRU list
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (20 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-08 16:36   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 23/24] xfs: reclaim inodes from the LRU Dave Chinner
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Now that we don't do IO from the inode reclaim code, there is no
need to optimise inode scanning order for optimal IO
characteristics. The AIL takes care of that for us, so now reclaim
can focus on selecting the best inodes to reclaim.

Hence we can change the inode reclaim algorithm to a real LRU and
remove the need to use the radix tree to track and walk inodes under
reclaim. This frees up a radix tree bit and simplifies the code that
marks inodes are reclaim candidates. It also simplifies the reclaim
code - we don't need batching anymore and all the reclaim logic
can be added to the LRU isolation callback.

Further, we get node aware reclaim at the xfs_inode level, which
should help the per-node reclaim code free relevant inodes faster.

We can re-use the VFS inode lru pointers - once the inode has been
reclaimed from the VFS, we can use these pointers ourselves. Hence
we don't need to grow the inode to change the way we index
reclaimable inodes.

Start by adding the list_lru tracking in parallel with the existing
reclaim code. This makes it easier to see the LRU infrastructure
separate to the reclaim algorithm changes. Especially the locking
order, which is ip->i_flags_lock -> list_lru lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 31 +++++++------------------------
 fs/xfs/xfs_icache.h |  1 -
 fs/xfs/xfs_mount.h  |  1 +
 fs/xfs/xfs_super.c  | 31 ++++++++++++++++++++++++-------
 4 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index aaa1f840a86c..610f643df9f6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -370,12 +370,11 @@ xfs_iget_cache_hit(
 
 		/*
 		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
-		 * from stomping over us while we recycle the inode.  We can't
-		 * clear the radix tree reclaimable tag yet as it requires
-		 * pag_ici_lock to be held exclusive.
+		 * from stomping over us while we recycle the inode. Remove it
+		 * from the LRU straight away so we can re-init the VFS inode.
 		 */
 		ip->i_flags |= XFS_IRECLAIM;
-
+		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
@@ -390,6 +389,7 @@ xfs_iget_cache_hit(
 			spin_lock(&ip->i_flags_lock);
 			wake = !!__xfs_iflags_test(ip, XFS_INEW);
 			ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+			list_lru_add(&mp->m_inode_lru, &inode->i_lru);
 			if (wake)
 				wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
 			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
@@ -1141,6 +1141,9 @@ xfs_reclaim_inode(
 	ino = ip->i_ino; /* for radix_tree_delete */
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
+
+	/* XXX: temporary until lru based reclaim */
+	list_lru_del(&pag->pag_mount->m_inode_lru, &VFS_I(ip)->i_lru);
 	spin_unlock(&ip->i_flags_lock);
 
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1330,26 +1333,6 @@ xfs_reclaim_inodes_nr(
 	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
 }
 
-/*
- * Return the number of reclaimable inodes in the filesystem for
- * the shrinker to determine how much to reclaim.
- */
-int
-xfs_reclaim_inodes_count(
-	struct xfs_mount	*mp)
-{
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag = 0;
-	int			reclaimable = 0;
-
-	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
-		ag = pag->pag_agno + 1;
-		reclaimable += pag->pag_ici_reclaimable;
-		xfs_perag_put(pag);
-	}
-	return reclaimable;
-}
-
 STATIC int
 xfs_inode_match_id(
 	struct xfs_inode	*ip,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 1c9b9edb2986..0ab08b58cd45 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -50,7 +50,6 @@ struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
 void xfs_reclaim_inodes(struct xfs_mount *mp);
-int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2049e764faed..4a4ecbc22246 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -75,6 +75,7 @@ typedef struct xfs_mount {
 	uint8_t			m_rt_sick;
 
 	struct xfs_ail		*m_ail;		/* fs active log item list */
+	struct list_lru		m_inode_lru;
 
 	struct xfs_sb		m_sb;		/* copy of fs superblock */
 	spinlock_t		m_sb_lock;	/* sb counter lock */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a59d3a21be5c..b5c4c1b6fd19 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -919,28 +919,30 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
 
 	trace_xfs_destroy_inode(ip);
 
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-	XFS_STATS_INC(ip->i_mount, vn_rele);
-	XFS_STATS_INC(ip->i_mount, vn_remove);
+	XFS_STATS_INC(mp, vn_rele);
+	XFS_STATS_INC(mp, vn_remove);
 
 	xfs_inactive(ip);
 
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
 		ASSERT(0);
 	}
 
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
+	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
 	 * We should never get here with one of the reclaim flags already set.
 	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	spin_lock(&ip->i_flags_lock);
+	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIM));
 
 	/*
 	 * We always use background reclaim here because even if the
@@ -949,6 +951,9 @@ xfs_fs_destroy_inode(
 	 * this more efficiently than we can here, so simply let background
 	 * reclaim tear down all inodes.
 	 */
+	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
+	spin_unlock(&ip->i_flags_lock);
 	xfs_inode_set_reclaim_tag(ip);
 }
 
@@ -1541,6 +1546,15 @@ xfs_mount_alloc(
 	if (!mp)
 		return NULL;
 
+	/*
+	 * The inode lru needs to be associated with the superblock shrinker,
+	 * and like the rest of the superblock shrinker, it's memcg aware.
+	 */
+	if (list_lru_init_memcg(&mp->m_inode_lru, &sb->s_shrink)) {
+		kfree(mp);
+		return NULL;
+	}
+
 	mp->m_super = sb;
 	spin_lock_init(&mp->m_sb_lock);
 	spin_lock_init(&mp->m_agirotor_lock);
@@ -1748,6 +1762,7 @@ xfs_fs_fill_super(
  out_free_fsname:
 	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
+	list_lru_destroy(&mp->m_inode_lru);
 	kfree(mp);
  out:
 	return error;
@@ -1780,6 +1795,7 @@ xfs_fs_put_super(
 
 	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
+	list_lru_destroy(&mp->m_inode_lru);
 	kfree(mp);
 }
 
@@ -1801,7 +1817,8 @@ xfs_fs_nr_cached_objects(
 	/* Paranoia: catch incorrect calls during mount setup or teardown */
 	if (WARN_ON_ONCE(!sb->s_fs_info))
 		return 0;
-	return xfs_reclaim_inodes_count(XFS_M(sb));
+
+	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
 }
 
 static long
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 23/24] xfs: reclaim inodes from the LRU
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (21 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-08 16:39   ` Brian Foster
  2019-08-01  2:17 ` [PATCH 24/24] xfs: remove unusued old inode reclaim code Dave Chinner
  2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
  24 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Replace the AG radix tree walking reclaim code with a list_lru
walker, giving us both node-aware and memcg-aware inode reclaim
at the XFS level. This requires adding an inode isolation function to
determine if the inode can be reclaim, and a list walker to
dispose of the inodes that were isolated.

We want the isolation function to be non-blocking. If we can't
grab an inode then we either skip it or rotate it. If it's clean
then we skip it, if it's dirty then we rotate to give it time to be
cleaned before it is scanned again.

This congregates the dirty inodes at the tail of the LRU, which
means that if we start hitting a majority of dirty inodes either
there are lots of unlinked inodes in the reclaim list or we've
reclaimed all the clean inodes and we're looped back on the dirty
inodes. Either way, this is an indication we should tell kswapd to
back off.

The non-blocking isolation function introduces a complexity for the
filesystem shutdown case. When the filesystem is shut down, we want
to free the inode even if it is dirty, and this may require
blocking. We already hold the locks needed to do this blocking, so
what we do is that we leave inodes locked - both the ILOCK and the
flush lock - while they are sitting on the dispose list to be freed
after the LRU walk completes.  This allows us to process the
shutdown state outside the LRU walk where we can block safely.

Keep in mind we don't have to care about inode lock order or
blocking with inode locks held here because a) we are using
trylocks, and b) once marked with XFS_IRECLAIM they can't be found
via the LRU and inode cache lookups will abort and retry. Hence
nobody will try to lock them in any other context that might also be
holding other inode locks.

Also convert xfs_reclaim_inodes() to use a LRU walk to free all
the reclaimable inodes in the filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 199 ++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_icache.h |  10 ++-
 fs/xfs/xfs_inode.h  |   8 ++
 fs/xfs/xfs_super.c  |  50 +++++++++--
 4 files changed, 232 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 610f643df9f6..891fe3795c8f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1195,7 +1195,7 @@ xfs_reclaim_inode(
  *
  * Return the number of inodes freed.
  */
-STATIC int
+int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
@@ -1297,40 +1297,185 @@ xfs_reclaim_inodes_ag(
 	return freed;
 }
 
-void
-xfs_reclaim_inodes(
-	struct xfs_mount	*mp)
+enum lru_status
+xfs_inode_reclaim_isolate(
+	struct list_head	*item,
+	struct list_lru_one	*lru,
+	spinlock_t		*lru_lock,
+	void			*arg)
 {
-	xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX);
+        struct xfs_ireclaim_args *ra = arg;
+        struct inode		*inode = container_of(item, struct inode, i_lru);
+        struct xfs_inode	*ip = XFS_I(inode);
+	enum lru_status		ret;
+	xfs_lsn_t		lsn = 0;
+
+	/* Careful: inversion of iflags_lock and everything else here */
+	if (!spin_trylock(&ip->i_flags_lock))
+		return LRU_SKIP;
+
+	ret = LRU_ROTATE;
+	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_unlock_flags;
+	}
+
+	ret = LRU_SKIP;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+		goto out_unlock_flags;
+
+	if (!__xfs_iflock_nowait(ip)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_unlock_inode;
+	}
+
+	/* if we are in shutdown, we'll reclaim it even if dirty */
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+		goto reclaim;
+
+	/*
+	 * Now the inode is locked, we can actually determine if it is dirty
+	 * without racing with anything.
+	 */
+	ret = LRU_ROTATE;
+	if (xfs_ipincount(ip)) {
+		ra->dirty_skipped++;
+		goto out_ifunlock;
+	}
+	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_ifunlock;
+	}
+
+reclaim:
+	/*
+	 * Once we mark the inode with XFS_IRECLAIM, no-one will grab it again.
+	 * RCU lookups will still find the inode, but they'll stop when they set
+	 * the IRECLAIM flag. Hence we can leave the inode locked as we move it
+	 * to the dispose list so we can deal with shutdown cleanup there
+	 * outside the LRU lock context.
+	 */
+	__xfs_iflags_set(ip, XFS_IRECLAIM);
+	list_lru_isolate_move(lru, &inode->i_lru, &ra->freeable);
+	spin_unlock(&ip->i_flags_lock);
+	return LRU_REMOVED;
+
+out_ifunlock:
+	xfs_ifunlock(ip);
+out_unlock_inode:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_unlock_flags:
+	spin_unlock(&ip->i_flags_lock);
+
+	if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
+		ra->lowest_lsn = lsn;
+	return ret;
 }
 
-/*
- * Scan a certain number of inodes for reclaim.
- *
- * When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doing synchronous
- * reclaim of inodes. That means if we come across dirty inodes, we wait for
- * them to be cleaned, which we hope will not be very long due to the
- * background walker having already kicked the IO off on those dirty inodes.
- */
-long
-xfs_reclaim_inodes_nr(
-	struct xfs_mount	*mp,
-	int			nr_to_scan)
+static void
+xfs_dispose_inode(
+	struct xfs_inode	*ip)
 {
-	int			sync_mode = 0;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	xfs_ino_t		ino;
+
+	ASSERT(xfs_isiflocked(ip));
+	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
+	ASSERT(ip->i_ino != 0);
 
 	/*
-	 * For kswapd, we kick background inode writeback. For direct
-	 * reclaim, we issue and wait on inode writeback to throttle
-	 * reclaim rates and avoid shouty OOM-death.
+	 * Process the shutdown reclaim work we deferred from the LRU isolation
+	 * callback before we go any further.
 	 */
-	if (current_is_kswapd())
-		xfs_ail_push_all(mp->m_ail);
-	else
-		sync_mode |= SYNC_WAIT;
+	if (XFS_FORCED_SHUTDOWN(mp)) {
+		xfs_iunpin_wait(ip);
+		xfs_iflush_abort(ip, false);
+	} else {
+		xfs_ifunlock(ip);
+	}
 
-	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
+	/*
+	 * Because we use RCU freeing we need to ensure the inode always appears
+	 * to be reclaimed with an invalid inode number when in the free state.
+	 * We do this as early as possible under the ILOCK so that
+	 * xfs_iflush_cluster() and xfs_ifree_cluster() can be guaranteed to
+	 * detect races with us here. By doing this, we guarantee that once
+	 * xfs_iflush_cluster() or xfs_ifree_cluster() has locked XFS_ILOCK that
+	 * it will see either a valid inode that will serialise correctly, or it
+	 * will see an invalid inode that it can skip.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	ino = ip->i_ino; /* for radix_tree_delete */
+	ip->i_flags = XFS_IRECLAIM;
+	ip->i_ino = 0;
+	spin_unlock(&ip->i_flags_lock);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	XFS_STATS_INC(mp, xs_ig_reclaims);
+	/*
+	 * Remove the inode from the per-AG radix tree.
+	 *
+	 * Because radix_tree_delete won't complain even if the item was never
+	 * added to the tree assert that it's been there before to catch
+	 * problems with the inode life time early on.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
+	spin_lock(&pag->pag_ici_lock);
+	if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino)))
+		ASSERT(0);
+	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
+
+	/*
+	 * Here we do an (almost) spurious inode lock in order to coordinate
+	 * with inode cache radix tree lookups.  This is because the lookup
+	 * can reference the inodes in the cache without taking references.
+	 *
+	 * We make that OK here by ensuring that we wait until the inode is
+	 * unlocked after the lookup before we go ahead and free it.
+	 *
+	 * XXX: need to check this is still true. Not sure it is.
+	 */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_qm_dqdetach(ip);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	__xfs_inode_free(ip);
+}
+
+void
+xfs_dispose_inodes(
+	struct list_head	*freeable)
+{
+	while (!list_empty(freeable)) {
+		struct inode *inode;
+
+		inode = list_first_entry(freeable, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
+
+		xfs_dispose_inode(XFS_I(inode));
+		cond_resched();
+	}
+}
+void
+xfs_reclaim_inodes(
+	struct xfs_mount	*mp)
+{
+	while (list_lru_count(&mp->m_inode_lru)) {
+		struct xfs_ireclaim_args ra;
+
+		INIT_LIST_HEAD(&ra.freeable);
+		ra.lowest_lsn = NULLCOMMITLSN;
+		list_lru_walk(&mp->m_inode_lru, xfs_inode_reclaim_isolate,
+				&ra, LONG_MAX);
+		xfs_dispose_inodes(&ra.freeable);
+		if (ra.lowest_lsn != NULLCOMMITLSN)
+			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
+	}
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 0ab08b58cd45..dadc69a30f33 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,8 +49,16 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
+struct xfs_ireclaim_args {
+	struct list_head	freeable;
+	xfs_lsn_t		lowest_lsn;
+	unsigned long		dirty_skipped;
+};
+
+enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
+		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
+void xfs_dispose_inodes(struct list_head *freeable);
 void xfs_reclaim_inodes(struct xfs_mount *mp);
-long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 558173f95a03..463170dc4c02 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -263,6 +263,14 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 
 extern void __xfs_iflock(struct xfs_inode *ip);
 
+static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
+{
+	if (ip->i_flags & XFS_IFLOCK)
+		return false;
+	ip->i_flags |= XFS_IFLOCK;
+	return true;
+}
+
 static inline int xfs_iflock_nowait(struct xfs_inode *ip)
 {
 	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b5c4c1b6fd19..e3e898a2896c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -17,6 +17,7 @@
 #include "xfs_alloc.h"
 #include "xfs_fsops.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
@@ -1810,23 +1811,58 @@ xfs_fs_mount(
 }
 
 static long
-xfs_fs_nr_cached_objects(
+xfs_fs_free_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
-	/* Paranoia: catch incorrect calls during mount setup or teardown */
-	if (WARN_ON_ONCE(!sb->s_fs_info))
-		return 0;
+	struct xfs_mount	*mp = XFS_M(sb);
+        struct xfs_ireclaim_args ra;
+	long freed;
 
-	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
+	INIT_LIST_HEAD(&ra.freeable);
+	ra.lowest_lsn = NULLCOMMITLSN;
+	ra.dirty_skipped = 0;
+
+	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
+					xfs_inode_reclaim_isolate, &ra);
+	xfs_dispose_inodes(&ra.freeable);
+
+	/*
+	 * Deal with dirty inodes if we skipped any. We will have the LSN of
+	 * the oldest dirty inode in our reclaim args if we skipped any.
+	 *
+	 * For direct reclaim, wait on an AIL push to clean some inodes.
+	 *
+	 * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
+	 * we freed) then we need kswapd to back off once it's scan has been
+	 * completed. That way it will have some clean inodes once it comes back
+	 * and can make progress, but make sure we have inode cleaning in
+	 * progress....
+	 */
+	if (current_is_kswapd()) {
+		if (ra.dirty_skipped >= freed) {
+			if (current->reclaim_state)
+				current->reclaim_state->need_backoff = true;
+			if (ra.lowest_lsn != NULLCOMMITLSN)
+				xfs_ail_push(mp->m_ail, ra.lowest_lsn);
+		}
+	} else if (ra.lowest_lsn != NULLCOMMITLSN) {
+		xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
+	}
+
+	return freed;
 }
 
 static long
-xfs_fs_free_cached_objects(
+xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
+	/* Paranoia: catch incorrect calls during mount setup or teardown */
+	if (WARN_ON_ONCE(!sb->s_fs_info))
+		return 0;
+
+	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
 }
 
 static const struct super_operations xfs_super_operations = {
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 24/24] xfs: remove unusued old inode reclaim code
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (22 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 23/24] xfs: reclaim inodes from the LRU Dave Chinner
@ 2019-08-01  2:17 ` Dave Chinner
  2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
  24 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  2:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

We don't use the custom AG radix tree walker, the reclaim radix tree
tag, the reclaimable inode counters, etc, so remove the all now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 410 +-------------------------------------------
 fs/xfs/xfs_icache.h |   7 +-
 fs/xfs/xfs_mount.h  |   2 -
 fs/xfs/xfs_super.c  |   1 -
 4 files changed, 3 insertions(+), 417 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 891fe3795c8f..ad04de119ac1 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -139,81 +139,6 @@ xfs_inode_free(
 	__xfs_inode_free(ip);
 }
 
-static void
-xfs_perag_set_reclaim_tag(
-	struct xfs_perag	*pag)
-{
-	struct xfs_mount	*mp = pag->pag_mount;
-
-	lockdep_assert_held(&pag->pag_ici_lock);
-	if (pag->pag_ici_reclaimable++)
-		return;
-
-	/* propagate the reclaim tag up into the perag radix tree */
-	spin_lock(&mp->m_perag_lock);
-	radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
-			   XFS_ICI_RECLAIM_TAG);
-	spin_unlock(&mp->m_perag_lock);
-
-	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
-}
-
-static void
-xfs_perag_clear_reclaim_tag(
-	struct xfs_perag	*pag)
-{
-	struct xfs_mount	*mp = pag->pag_mount;
-
-	lockdep_assert_held(&pag->pag_ici_lock);
-	if (--pag->pag_ici_reclaimable)
-		return;
-
-	/* clear the reclaim tag from the perag radix tree */
-	spin_lock(&mp->m_perag_lock);
-	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
-			     XFS_ICI_RECLAIM_TAG);
-	spin_unlock(&mp->m_perag_lock);
-	trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
-}
-
-
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_set_reclaim_tag(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			   XFS_ICI_RECLAIM_TAG);
-	xfs_perag_set_reclaim_tag(pag);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
-
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
-}
-
-STATIC void
-xfs_inode_clear_reclaim_tag(
-	struct xfs_perag	*pag,
-	xfs_ino_t		ino)
-{
-	radix_tree_tag_clear(&pag->pag_ici_root,
-			     XFS_INO_TO_AGINO(pag->pag_mount, ino),
-			     XFS_ICI_RECLAIM_TAG);
-	xfs_perag_clear_reclaim_tag(pag);
-}
-
 static void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
@@ -397,17 +322,15 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
-		spin_lock(&pag->pag_ici_lock);
-		spin_lock(&ip->i_flags_lock);
 
 		/*
 		 * Clear the per-lifetime state in the inode as we are now
 		 * effectively a new inode and need to return to the initial
 		 * state before reuse occurs.
 		 */
+		spin_lock(&ip->i_flags_lock);
 		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 		ip->i_flags |= XFS_INEW;
-		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
 		inode->i_state = I_NEW;
 		ip->i_sick = 0;
 		ip->i_checked = 0;
@@ -416,7 +339,6 @@ xfs_iget_cache_hit(
 		init_rwsem(&inode->i_rwsem);
 
 		spin_unlock(&ip->i_flags_lock);
-		spin_unlock(&pag->pag_ici_lock);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -967,336 +889,6 @@ xfs_inode_ag_iterator_tag(
 	return last_error;
 }
 
-/*
- * Grab the inode for reclaim.
- *
- * Return false if we aren't going to reclaim it, true if it is a reclaim
- * candidate.
- *
- * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
- * caller it does not require flushing. Otherwise return the log item lsn of the
- * inode so the caller can determine it's inode flush target.  If we get the
- * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
- * have locks held.
- */
-STATIC bool
-xfs_reclaim_inode_grab(
-	struct xfs_inode	*ip,
-	int			flags,
-	xfs_lsn_t		*lsn)
-{
-	ASSERT(rcu_read_lock_held());
-	*lsn = 0;
-
-	/* quick check for stale RCU freed inode */
-	if (!ip->i_ino)
-		return false;
-
-	/*
-	 * Do unlocked checks to see if the inode already is being flushed or in
-	 * reclaim to avoid lock traffic. If the inode is not clean, return the
-	 * it's position in the AIL for the caller to push to.
-	 */
-	if (!xfs_inode_clean(ip)) {
-		*lsn = ip->i_itemp->ili_item.li_lsn;
-		return false;
-	}
-
-	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return false;
-
-	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 *
-	 * Due to RCU lookup, we may find inodes that have been freed and only
-	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
-	 * aren't candidates for reclaim at all, so we must check the
-	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
-	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
-		/* not a reclaim candidate. */
-		spin_unlock(&ip->i_flags_lock);
-		return false;
-	}
-	__xfs_iflags_set(ip, XFS_IRECLAIM);
-	spin_unlock(&ip->i_flags_lock);
-	return true;
-}
-
-/*
- * Inodes in different states need to be treated differently. The following
- * table lists the inode states and the reclaim actions necessary:
- *
- *	inode state	     iflush ret		required action
- *      ---------------      ----------         ---------------
- *	bad			-		reclaim
- *	shutdown		EIO		unpin and reclaim
- *	clean, unpinned		0		reclaim
- *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		requeue
- *	stale, pinned		EAGAIN		requeue
- *	dirty, async		-		requeue
- *	dirty, sync		0		reclaim
- *
- * (*) dgc: I don't think the clean, pinned state is possible but it gets
- * handled anyway given the order of checks implemented.
- *
- * Also, because we get the flush lock first, we know that any inode that has
- * been flushed delwri has had the flush completed by the time we check that
- * the inode is clean.
- *
- * Note that because the inode is flushed delayed write by AIL pushing, the
- * flush lock may already be held here and waiting on it can result in very
- * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
- * the caller should push the AIL first before trying to reclaim inodes to
- * minimise the amount of time spent waiting.  For background relaim, we only
- * bother to reclaim clean inodes anyway.
- *
- * Hence the order of actions after gaining the locks should be:
- *	bad		=> reclaim
- *	shutdown	=> unpin and reclaim
- *	pinned, async	=> requeue
- *	pinned, sync	=> unpin
- *	stale		=> reclaim
- *	clean		=> reclaim
- *	dirty, async	=> requeue
- *	dirty, sync	=> flush, wait and reclaim
- *
- * Returns true if the inode was reclaimed, false otherwise.
- */
-STATIC bool
-xfs_reclaim_inode(
-	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	xfs_lsn_t		*lsn)
-{
-	xfs_ino_t		ino;
-
-	*lsn = 0;
-
-	/*
-	 * Don't try to flush the inode if another inode in this cluster has
-	 * already flushed it after we did the initial checks in
-	 * xfs_reclaim_inode_grab().
-	 */
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
-		goto out;
-	if (!xfs_iflock_nowait(ip))
-		goto out_unlock;
-
-	/* If we are in shutdown, we don't care about blocking. */
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-		xfs_iunpin_wait(ip);
-		/* xfs_iflush_abort() drops the flush lock */
-		xfs_iflush_abort(ip, false);
-		goto reclaim;
-	}
-
-	/*
-	 * If it is pinned, we only want to flush this if there's nothing else
-	 * to be flushed as it requires a log force. Hence we essentially set
-	 * the LSN to flush the entire AIL which will end up triggering a log
-	 * force to unpin this inode, but that will only happen if there are not
-	 * other inodes in the scan that only need writeback.
-	 */
-	if (xfs_ipincount(ip)) {
-		*lsn = ip->i_itemp->ili_last_lsn;
-		goto out_ifunlock;
-	}
-
-	/*
-	 * Dirty inode we didn't catch, skip it.
-	 */
-	if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) {
-		*lsn = ip->i_itemp->ili_item.li_lsn;
-		goto out_ifunlock;
-	}
-
-	/*
-	 * It's clean, we have it locked, we can now drop the flush lock
-	 * and reclaim it.
-	 */
-	xfs_ifunlock(ip);
-
-reclaim:
-	ASSERT(!xfs_isiflocked(ip));
-	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
-	ASSERT(ip->i_ino != 0);
-
-	/*
-	 * Because we use RCU freeing we need to ensure the inode always appears
-	 * to be reclaimed with an invalid inode number when in the free state.
-	 * We do this as early as possible under the ILOCK so that
-	 * xfs_iflush_cluster() and xfs_ifree_cluster() can be guaranteed to
-	 * detect races with us here. By doing this, we guarantee that once
-	 * xfs_iflush_cluster() or xfs_ifree_cluster() has locked XFS_ILOCK that
-	 * it will see either a valid inode that will serialise correctly, or it
-	 * will see an invalid inode that it can skip.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	ino = ip->i_ino; /* for radix_tree_delete */
-	ip->i_flags = XFS_IRECLAIM;
-	ip->i_ino = 0;
-
-	/* XXX: temporary until lru based reclaim */
-	list_lru_del(&pag->pag_mount->m_inode_lru, &VFS_I(ip)->i_lru);
-	spin_unlock(&ip->i_flags_lock);
-
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
-	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
-	/*
-	 * Remove the inode from the per-AG radix tree.
-	 *
-	 * Because radix_tree_delete won't complain even if the item was never
-	 * added to the tree assert that it's been there before to catch
-	 * problems with the inode life time early on.
-	 */
-	spin_lock(&pag->pag_ici_lock);
-	if (!radix_tree_delete(&pag->pag_ici_root,
-				XFS_INO_TO_AGINO(ip->i_mount, ino)))
-		ASSERT(0);
-	xfs_perag_clear_reclaim_tag(pag);
-	spin_unlock(&pag->pag_ici_lock);
-
-	/*
-	 * Here we do an (almost) spurious inode lock in order to coordinate
-	 * with inode cache radix tree lookups.  This is because the lookup
-	 * can reference the inodes in the cache without taking references.
-	 *
-	 * We make that OK here by ensuring that we wait until the inode is
-	 * unlocked after the lookup before we go ahead and free it.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_qm_dqdetach(ip);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
-	__xfs_inode_free(ip);
-	return true;
-
-out_ifunlock:
-	xfs_ifunlock(ip);
-out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	return false;
-}
-
-/*
- * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
- * corrupted, we still want to try to reclaim all the inodes. If we don't,
- * then a shut down during filesystem unmount reclaim walk leak all the
- * unreclaimed inodes.
- *
- * Return the number of inodes freed.
- */
-int
-xfs_reclaim_inodes_ag(
-	struct xfs_mount	*mp,
-	int			flags,
-	int			nr_to_scan)
-{
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag;
-	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
-	long			freed = 0;
-
-	ag = 0;
-	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
-		unsigned long	first_index = 0;
-		int		done = 0;
-		int		nr_found = 0;
-
-		ag = pag->pag_agno + 1;
-		first_index = pag->pag_ici_reclaim_cursor;
-
-		do {
-			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
-			int	i;
-
-			rcu_read_lock();
-			nr_found = radix_tree_gang_lookup_tag(
-					&pag->pag_ici_root,
-					(void **)batch, first_index,
-					XFS_LOOKUP_BATCH,
-					XFS_ICI_RECLAIM_TAG);
-			if (!nr_found) {
-				done = 1;
-				rcu_read_unlock();
-				break;
-			}
-
-			/*
-			 * Grab the inodes before we drop the lock. if we found
-			 * nothing, nr == 0 and the loop will be skipped.
-			 */
-			for (i = 0; i < nr_found; i++) {
-				struct xfs_inode *ip = batch[i];
-
-				if (done ||
-				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
-					batch[i] = NULL;
-
-				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
-					lowest_lsn = lsn;
-
-				/*
-				 * Update the index for the next lookup. Catch
-				 * overflows into the next AG range which can
-				 * occur if we have inodes in the last block of
-				 * the AG and we are currently pointing to the
-				 * last inode.
-				 *
-				 * Because we may see inodes that are from the
-				 * wrong AG due to RCU freeing and
-				 * reallocation, only update the index if it
-				 * lies in this AG. It was a race that lead us
-				 * to see this inode, so another lookup from
-				 * the same index will not find it again.
-				 */
-				if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
-								pag->pag_agno)
-					continue;
-				first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
-				if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
-					done = 1;
-			}
-
-			/* unlock now we've grabbed the inodes. */
-			rcu_read_unlock();
-
-			for (i = 0; i < nr_found; i++) {
-				if (!batch[i])
-					continue;
-				if (xfs_reclaim_inode(batch[i], pag, &lsn))
-					freed++;
-				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
-					lowest_lsn = lsn;
-			}
-
-			nr_to_scan -= XFS_LOOKUP_BATCH;
-			cond_resched();
-
-		} while (nr_found && !done && nr_to_scan > 0);
-
-		if (!done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
-		xfs_perag_put(pag);
-	}
-
-	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
-		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
-
-	return freed;
-}
-
 enum lru_status
 xfs_inode_reclaim_isolate(
 	struct list_head	*item,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index dadc69a30f33..0b4d06691275 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -25,9 +25,8 @@ struct xfs_eofblocks {
  */
 #define XFS_ICI_NO_TAG		(-1)	/* special flag for an untagged lookup
 					   in xfs_inode_ag_iterator */
-#define XFS_ICI_RECLAIM_TAG	0	/* inode is to be reclaimed */
-#define XFS_ICI_EOFBLOCKS_TAG	1	/* inode has blocks beyond EOF */
-#define XFS_ICI_COWBLOCKS_TAG	2	/* inode can have cow blocks to gc */
+#define XFS_ICI_EOFBLOCKS_TAG	0	/* inode has blocks beyond EOF */
+#define XFS_ICI_COWBLOCKS_TAG	1	/* inode can have cow blocks to gc */
 
 /*
  * Flags for xfs_iget()
@@ -60,8 +59,6 @@ enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
 void xfs_dispose_inodes(struct list_head *freeable);
 void xfs_reclaim_inodes(struct xfs_mount *mp);
 
-void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
-
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
 int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4a4ecbc22246..ef63357da7af 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -383,8 +383,6 @@ typedef struct xfs_perag {
 
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
-	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
 	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e3e898a2896c..0559fb686e9d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -955,7 +955,6 @@ xfs_fs_destroy_inode(
 	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
 	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
 	spin_unlock(&ip->i_flags_lock);
-	xfs_inode_set_reclaim_tag(ip);
 }
 
 static void
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/24] xfs:: account for memory freed from metadata buffers
  2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
@ 2019-08-01  8:16   ` Christoph Hellwig
  2019-08-01  9:21     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2019-08-01  8:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

> +
> +		/*
> +		 * Account for the buffer memory freed here so memory reclaim
> +		 * sees this and not just the xfs_buf slab entry being freed.
> +		 */
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_pages += bp->b_page_count;
> +

I think this wants a mm-layer helper ala:

static inline void shrinker_mark_pages_reclaimed(unsigned long nr_pages)
{
	if (current->reclaim_state)
		current->reclaim_state->reclaimed_pages += nr_pages;
}

plus good documentation on when to use it.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/24] xfs:: account for memory freed from metadata buffers
  2019-08-01  8:16   ` Christoph Hellwig
@ 2019-08-01  9:21     ` Dave Chinner
  2019-08-06  5:51       ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-01  9:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 01:16:03AM -0700, Christoph Hellwig wrote:
> > +
> > +		/*
> > +		 * Account for the buffer memory freed here so memory reclaim
> > +		 * sees this and not just the xfs_buf slab entry being freed.
> > +		 */
> > +		if (current->reclaim_state)
> > +			current->reclaim_state->reclaimed_pages += bp->b_page_count;
> > +
> 
> I think this wants a mm-layer helper ala:
> 
> static inline void shrinker_mark_pages_reclaimed(unsigned long nr_pages)
> {
> 	if (current->reclaim_state)
> 		current->reclaim_state->reclaimed_pages += nr_pages;
> }
> 
> plus good documentation on when to use it.

Sure, but that's something for patch 6, not this one :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
@ 2019-08-01 13:39   ` Chris Mason
  2019-08-01 23:58     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Chris Mason @ 2019-08-01 13:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On 31 Jul 2019, at 22:17, Dave Chinner wrote:

> From: Dave Chinner <dchinner@redhat.com>
>
> Running metadata intensive workloads, I've been seeing the AIL
> pushing getting stuck on pinned buffers and triggering log forces.
> The log force is taking a long time to run because the log IO is
> getting throttled by wbt_wait() - the block layer writeback
> throttle. It's being throttled because there is a huge amount of
> metadata writeback going on which is filling the request queue.
>
> IOWs, we have a priority inversion problem here.
>
> Mark the log IO bios with REQ_IDLE so they don't get throttled
> by the block layer writeback throttle. When we are forcing the CIL,
> we are likely to need to to tens of log IOs, and they are issued as
> fast as they can be build and IO completed. Hence REQ_IDLE is
> appropriate - it's an indication that more IO will follow shortly.
>
> And because we also set REQ_SYNC, the writeback throttle will no
> treat log IO the same way it treats direct IO writes - it will not
> throttle them at all. Hence we solve the priority inversion problem
> caused by the writeback throttle being unable to distinguish between
> high priority log IO and background metadata writeback.
>
  [ cc Jens ]

We spent a lot of time getting rid of these inversions in io.latency 
(and the new io.cost), where REQ_META just blows through the throttling 
and goes into back charging instead.

It feels awkward to have one set of prio inversion workarounds for io.* 
and another for wbt.  Jens, should we make an explicit one that doesn't 
rely on magic side effects, or just decide that metadata is meta enough 
to break all the rules?

-chris

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-01 13:39   ` Chris Mason
@ 2019-08-01 23:58     ` Dave Chinner
  2019-08-02  8:12       ` Christoph Hellwig
  2019-08-02 14:11       ` Chris Mason
  0 siblings, 2 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-01 23:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On Thu, Aug 01, 2019 at 01:39:34PM +0000, Chris Mason wrote:
> On 31 Jul 2019, at 22:17, Dave Chinner wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Running metadata intensive workloads, I've been seeing the AIL
> > pushing getting stuck on pinned buffers and triggering log forces.
> > The log force is taking a long time to run because the log IO is
> > getting throttled by wbt_wait() - the block layer writeback
> > throttle. It's being throttled because there is a huge amount of
> > metadata writeback going on which is filling the request queue.
> >
> > IOWs, we have a priority inversion problem here.
> >
> > Mark the log IO bios with REQ_IDLE so they don't get throttled
> > by the block layer writeback throttle. When we are forcing the CIL,
> > we are likely to need to to tens of log IOs, and they are issued as
> > fast as they can be build and IO completed. Hence REQ_IDLE is
> > appropriate - it's an indication that more IO will follow shortly.
> >
> > And because we also set REQ_SYNC, the writeback throttle will no
> > treat log IO the same way it treats direct IO writes - it will not
> > throttle them at all. Hence we solve the priority inversion problem
> > caused by the writeback throttle being unable to distinguish between
> > high priority log IO and background metadata writeback.
> >
>   [ cc Jens ]
> 
> We spent a lot of time getting rid of these inversions in io.latency 
> (and the new io.cost), where REQ_META just blows through the throttling 
> and goes into back charging instead.

Which simply reinforces the fact that that request type based
throttling is a fundamentally broken architecture.

> It feels awkward to have one set of prio inversion workarounds for io.* 
> and another for wbt.  Jens, should we make an explicit one that doesn't 
> rely on magic side effects, or just decide that metadata is meta enough 
> to break all the rules?

The problem isn't REQ_META blows throw the throttling, the problem
is that different REQ_META IOs have different priority.

IOWs, the problem here is that we are trying to infer priority from
the request type rather than an actual priority assigned by the
submitter. There is no way direct IO has higher priority in a
filesystem than log IO tagged with REQ_META as direct IO can require
log IO to make progress. Priority is a policy determined by the
submitter, not the mechanism doing the throttling.

Can we please move this all over to priorites based on
bio->b_ioprio? And then document how the range of priorities are
managed, such as:

(99 = highest prio to 0 = lowest)

swap out
swap in				>90
User hard RT max		89
User hard RT min		80
filesystem max			79
ionice max			60
background data writeback	40
ionice min			20
filesystem min			10
idle				0

So that we can appropriately prioritise different types of kernel
internal IO w.r.t user controlled IO priorities? This way we can
still tag the bios with the type of data they contain, but we
no longer use that to determine whether to throttle that IO or not -
throttling/scheduling should be done entirely on a priority basis.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-01 23:58     ` Dave Chinner
@ 2019-08-02  8:12       ` Christoph Hellwig
  2019-08-02 14:11       ` Chris Mason
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2019-08-02  8:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Mason, linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On Fri, Aug 02, 2019 at 09:58:49AM +1000, Dave Chinner wrote:
> Which simply reinforces the fact that that request type based
> throttling is a fundamentally broken architecture.
> 
> > It feels awkward to have one set of prio inversion workarounds for io.* 
> > and another for wbt.  Jens, should we make an explicit one that doesn't 
> > rely on magic side effects, or just decide that metadata is meta enough 
> > to break all the rules?
> 
> The problem isn't REQ_META blows throw the throttling, the problem
> is that different REQ_META IOs have different priority.
> 
> IOWs, the problem here is that we are trying to infer priority from
> the request type rather than an actual priority assigned by the
> submitter. There is no way direct IO has higher priority in a
> filesystem than log IO tagged with REQ_META as direct IO can require
> log IO to make progress. Priority is a policy determined by the
> submitter, not the mechanism doing the throttling.
> 
> Can we please move this all over to priorites based on
> bio->b_ioprio? And then document how the range of priorities are
> managed, such as:

Yes, we need to fix the magic deducted throttling behavior, especiall
the odd REQ_IDLE that in its various incarnations has been a massive
source of toruble and confusion.  Not sure tons of priorities are
really helping, given that even hardware with priority level support
usually just supports about two priorit levels.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-01 23:58     ` Dave Chinner
  2019-08-02  8:12       ` Christoph Hellwig
@ 2019-08-02 14:11       ` Chris Mason
  2019-08-02 18:34         ` Matthew Wilcox
  2019-08-02 23:28         ` Dave Chinner
  1 sibling, 2 replies; 87+ messages in thread
From: Chris Mason @ 2019-08-02 14:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On 1 Aug 2019, at 19:58, Dave Chinner wrote:

> On Thu, Aug 01, 2019 at 01:39:34PM +0000, Chris Mason wrote:
>> On 31 Jul 2019, at 22:17, Dave Chinner wrote:
>>
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> Running metadata intensive workloads, I've been seeing the AIL
>>> pushing getting stuck on pinned buffers and triggering log forces.
>>> The log force is taking a long time to run because the log IO is
>>> getting throttled by wbt_wait() - the block layer writeback
>>> throttle. It's being throttled because there is a huge amount of
>>> metadata writeback going on which is filling the request queue.
>>>
>>> IOWs, we have a priority inversion problem here.
>>>
>>> Mark the log IO bios with REQ_IDLE so they don't get throttled
>>> by the block layer writeback throttle. When we are forcing the CIL,
>>> we are likely to need to to tens of log IOs, and they are issued as
>>> fast as they can be build and IO completed. Hence REQ_IDLE is
>>> appropriate - it's an indication that more IO will follow shortly.
>>>
>>> And because we also set REQ_SYNC, the writeback throttle will no
>>> treat log IO the same way it treats direct IO writes - it will not
>>> throttle them at all. Hence we solve the priority inversion problem
>>> caused by the writeback throttle being unable to distinguish between
>>> high priority log IO and background metadata writeback.
>>>
>>   [ cc Jens ]
>>
>> We spent a lot of time getting rid of these inversions in io.latency
>> (and the new io.cost), where REQ_META just blows through the 
>> throttling
>> and goes into back charging instead.
>
> Which simply reinforces the fact that that request type based
> throttling is a fundamentally broken architecture.
>
>> It feels awkward to have one set of prio inversion workarounds for 
>> io.*
>> and another for wbt.  Jens, should we make an explicit one that 
>> doesn't
>> rely on magic side effects, or just decide that metadata is meta 
>> enough
>> to break all the rules?
>
> The problem isn't REQ_META blows throw the throttling, the problem
> is that different REQ_META IOs have different priority.

Yes and no.  At some point important FS threads have the potential to 
wait on every single REQ_META IO on the box, so every single REQ_META IO 
has the potential to create priority inversions.

>
> IOWs, the problem here is that we are trying to infer priority from
> the request type rather than an actual priority assigned by the
> submitter. There is no way direct IO has higher priority in a
> filesystem than log IO tagged with REQ_META as direct IO can require
> log IO to make progress. Priority is a policy determined by the
> submitter, not the mechanism doing the throttling.
>
> Can we please move this all over to priorites based on
> bio->b_ioprio? And then document how the range of priorities are
> managed, such as:
>
> (99 = highest prio to 0 = lowest)
>
> swap out
> swap in				>90
> User hard RT max		89
> User hard RT min		80
> filesystem max			79
> ionice max			60
> background data writeback	40
> ionice min			20
> filesystem min			10
> idle				0
>
> So that we can appropriately prioritise different types of kernel
> internal IO w.r.t user controlled IO priorities? This way we can
> still tag the bios with the type of data they contain, but we
> no longer use that to determine whether to throttle that IO or not -
> throttling/scheduling should be done entirely on a priority basis.

I think you and I are describing solutions to different problems.  The 
reason the back charging works so well in io.latency and io.cost is 
because the IO controllers are able to remember that a given cgroup 
created X amount of IO, and then just make that cgroup wait at a safe 
time, instead of trying to assign priority to things that have infinite 
priority.

I can't really see bio->b_ioprio working without the rest of the IO 
controller logic creating a sensible system, and giving userland the 
framework to define weights etc.  My question is if it's worth trying 
inside of the wbt code, or if we should just let the metadata go 
through.

Tejun reminded me that in a lot of ways, swap is user IO and it's 
actually fine to have it prioritized at the same level as user IO.  We 
don't want to let a low prio app thrash the drive swapping things in and 
out all the time, and it's actually fine to make them wait as long as 
other higher priority processes aren't waiting for the memory.  This 
depends on the cgroup config, so wrt your current patches it probably 
sounds crazy, but we have a lot of data around this from the fleet.

-chris

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/24] mm: factor shrinker work calculations
  2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
@ 2019-08-02 15:08   ` Nikolay Borisov
  2019-08-04  2:05     ` Dave Chinner
  2019-08-02 15:31   ` Brian Foster
  1 sibling, 1 reply; 87+ messages in thread
From: Nikolay Borisov @ 2019-08-02 15:08 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-mm, linux-fsdevel



On 1.08.19 г. 5:17 ч., Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Start to clean up the shrinker code by factoring out the calculation
> that determines how much work to do. This separates the calculation
> from clamping and other adjustments that are done before the
> shrinker work is run.
> 
> Also convert the calculation for the amount of work to be done to
> use 64 bit logic so we don't have to keep jumping through hoops to
> keep calculations within 32 bits on 32 bit systems.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c | 74 ++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 47 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ae3035fe94bc..b7472953b0e6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -464,13 +464,45 @@ EXPORT_SYMBOL(unregister_shrinker);
>  
>  #define SHRINK_BATCH 128
>  
> +/*
> + * Calculate the number of new objects to scan this time around. Return
> + * the work to be done. If there are freeable objects, return that number in
> + * @freeable_objects.
> + */
> +static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
> +			    struct shrinker *shrinker, int priority,
> +			    int64_t *freeable_objects)

nit: make the return parm definition also uin64_t, also we have u64 types.

> +{
> +	uint64_t delta;
> +	uint64_t freeable;
> +
> +	freeable = shrinker->count_objects(shrinker, shrinkctl);
> +	if (freeable == 0 || freeable == SHRINK_EMPTY)
> +		return freeable;
> +
> +	if (shrinker->seeks) {
> +		delta = freeable >> (priority - 2);
> +		do_div(delta, shrinker->seeks);

a comment about the reasoning behind this calculation would be nice.

> +	} else {
> +		/*
> +		 * These objects don't require any IO to create. Trim
> +		 * them aggressively under memory pressure to keep
> +		 * them from causing refetches in the IO caches.
> +		 */
> +		delta = freeable / 2;
> +	}
> +
> +	*freeable_objects = freeable;
> +	return delta > 0 ? delta : 0;
> +}
> +
>  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
>  	unsigned long freed = 0;
> -	unsigned long long delta;
>  	long total_scan;
> -	long freeable;
> +	int64_t freeable_objects = 0;
> +	int64_t scan_count;

why int and not uint64 ? We can never have negative object count, right?

>  	long nr;
>  	long new_nr;
>  	int nid = shrinkctl->nid;
> @@ -481,9 +513,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>  		nid = 0;
>  
> -	freeable = shrinker->count_objects(shrinker, shrinkctl);
> -	if (freeable == 0 || freeable == SHRINK_EMPTY)
> -		return freeable;
> +	scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
> +					&freeable_objects);
> +	if (scan_count == 0 || scan_count == SHRINK_EMPTY)
> +		return scan_count;
>  
>  	/*
>  	 * copy the current shrinker scan count into a local variable
> @@ -492,25 +525,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 */
>  	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>  
> -	total_scan = nr;
> -	if (shrinker->seeks) {
> -		delta = freeable >> priority;
> -		delta *= 4;
> -		do_div(delta, shrinker->seeks);
> -	} else {
> -		/*
> -		 * These objects don't require any IO to create. Trim
> -		 * them aggressively under memory pressure to keep
> -		 * them from causing refetches in the IO caches.
> -		 */
> -		delta = freeable / 2;
> -	}
> -
> -	total_scan += delta;
> +	total_scan = nr + scan_count;
>  	if (total_scan < 0) {
>  		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
>  		       shrinker->scan_objects, total_scan);
> -		total_scan = freeable;
> +		total_scan = scan_count;
>  		next_deferred = nr;
>  	} else
>  		next_deferred = total_scan;
> @@ -527,19 +546,20 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * Hence only allow the shrinker to scan the entire cache when
>  	 * a large delta change is calculated directly.
>  	 */
> -	if (delta < freeable / 4)
> -		total_scan = min(total_scan, freeable / 2);
> +	if (scan_count < freeable_objects / 4)
> +		total_scan = min_t(long, total_scan, freeable_objects / 2);
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */
> -	if (total_scan > freeable * 2)
> -		total_scan = freeable * 2;
> +	if (total_scan > freeable_objects * 2)
> +		total_scan = freeable_objects * 2;
>  
>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> -				   freeable, delta, total_scan, priority);
> +				   freeable_objects, scan_count,
> +				   total_scan, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -564,7 +584,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * possible.
>  	 */
>  	while (total_scan >= batch_size ||
> -	       total_scan >= freeable) {
> +	       total_scan >= freeable_objects) {
>  		unsigned long ret;
>  		unsigned long nr_to_scan = min(batch_size, total_scan);
>  
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
@ 2019-08-02 15:27   ` Brian Foster
  2019-08-04  1:49     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-02 15:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce a mechanism for ->count_objects() to indicate to the
> shrinker infrastructure that the reclaim context will not allow
> scanning work to be done and so the work it decides is necessary
> needs to be deferred.
> 
> This simplifies the code by separating out the accounting of
> deferred work from the actual doing of the work, and allows better
> decisions to be made by the shrinekr control logic on what action it
> can take.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/shrinker.h | 7 +++++++
>  mm/vmscan.c              | 8 ++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 9443cafd1969..af78c475fc32 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -31,6 +31,13 @@ struct shrink_control {
>  
>  	/* current memcg being shrunk (for memcg aware shrinkers) */
>  	struct mem_cgroup *memcg;
> +
> +	/*
> +	 * set by ->count_objects if reclaim context prevents reclaim from
> +	 * occurring. This allows the shrinker to immediately defer all the
> +	 * work and not even attempt to scan the cache.
> +	 */
> +	bool will_defer;

Functionality wise this seems fairly straightforward. FWIW, I find the
'will_defer' name a little confusing because it implies to me that the
shrinker is telling the caller about something it would do if called as
opposed to explicitly telling the caller to defer. I'd just call it
'defer' I guess, but that's just my .02. ;P

>  };
>  
>  #define SHRINK_STOP (~0UL)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 44df66a98f2a..ae3035fe94bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>  				   freeable, delta, total_scan, priority);
>  
> +	/*
> +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> +	 * defer the work to a context that can scan the cache.
> +	 */
> +	if (shrinkctl->will_defer)
> +		goto done;
> +

Who's responsible for clearing the flag? Perhaps we should do so here
once it's acted upon since we don't call into the shrinker again?

Note that I see this structure is reinitialized on every iteration in
the caller, but there already is the SHRINK_EMPTY case where we call
back into do_shrink_slab(). Granted the deferred state likely hasn't
changed, but the fact that we'd call back into the count callback to set
it again implies the logic could be a bit more explicit, particularly if
this will eventually be used for more dynamic shrinker state that might
change call to call (i.e., object dirty state, etc.).

BTW, do we need to care about the ->nr_cached_objects() call from the
generic superblock shrinker (super_cache_scan())?

Brian

>  	/*
>  	 * Normally, we should not scan less than batch_size objects in one
>  	 * pass to avoid too frequent shrinker calls, but if the slab has less
> @@ -575,6 +582,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		cond_resched();
>  	}
>  
> +done:
>  	if (next_deferred >= scanned)
>  		next_deferred -= scanned;
>  	else
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers
  2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
@ 2019-08-02 15:27   ` Brian Foster
  2019-08-04  1:50     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-02 15:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:30PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For shrinkers that currently avoid scanning when called under
> GFP_NOFS contexts, conver them to use the new ->will_defer flag
> rather than checking and returning errors during scans.
> 
> This makes it very clear that these shrinkers are not doing any work
> because of the context limitations, not because there is no work
> that can be done.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  drivers/staging/android/ashmem.c |  8 ++++----
>  fs/gfs2/glock.c                  |  5 +++--
>  fs/gfs2/quota.c                  |  6 +++---
>  fs/nfs/dir.c                     |  6 +++---
>  fs/super.c                       |  6 +++---
>  fs/xfs/xfs_buf.c                 |  4 ++++
>  fs/xfs/xfs_qm.c                  | 11 ++++++++---
>  net/sunrpc/auth.c                |  5 ++---
>  8 files changed, 30 insertions(+), 21 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index ca0849043f54..6e0f76532535 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1680,6 +1680,10 @@ xfs_buftarg_shrink_count(
>  {
>  	struct xfs_buftarg	*btp = container_of(shrink,
>  					struct xfs_buftarg, bt_shrinker);
> +
> +	if (!(sc->gfp_mask & __GFP_FS))
> +		sc->will_defer = true;
> +
>  	return list_lru_shrink_count(&btp->bt_lru, sc);
>  }

This hunk looks like a behavior change / bug fix..? The rest of the
patch converts existing logic to bail out of scans to use the new count
time defer mechanism. The change is probably fine, but I think we should
have a separate patch to introduce this behavior in the first place
(which BTW could be sent as a standalone patch and just picked up by
this on eventual rebase).

Brian

>  
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 5e7a37f0cf84..13c842e8f13b 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -502,9 +502,6 @@ xfs_qm_shrink_scan(
>  	unsigned long		freed;
>  	int			error;
>  
> -	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
> -		return 0;
> -
>  	INIT_LIST_HEAD(&isol.buffers);
>  	INIT_LIST_HEAD(&isol.dispose);
>  
> @@ -534,6 +531,14 @@ xfs_qm_shrink_count(
>  	struct xfs_quotainfo	*qi = container_of(shrink,
>  					struct xfs_quotainfo, qi_shrinker);
>  
> +	/*
> +	 * __GFP_DIRECT_RECLAIM is used here to avoid blocking kswapd
> +	 */
> +	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) !=
> +					(__GFP_FS|__GFP_DIRECT_RECLAIM)) {
> +		sc->will_defer = true;
> +	}
> +
>  	return list_lru_shrink_count(&qi->qi_lru, sc);
>  }
>  
> diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
> index cdb05b48de44..6babcbac4a00 100644
> --- a/net/sunrpc/auth.c
> +++ b/net/sunrpc/auth.c
> @@ -527,9 +527,6 @@ static unsigned long
>  rpcauth_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  
>  {
> -	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
> -		return SHRINK_STOP;
> -
>  	/* nothing left, don't come back */
>  	if (list_empty(&cred_unused))
>  		return SHRINK_STOP;
> @@ -541,6 +538,8 @@ static unsigned long
>  rpcauth_cache_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  
>  {
> +	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
> +		sc->will_defer = true;
>  	return number_cred_unused * sysctl_vfs_cache_pressure / 100;
>  }
>  
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/24] mm: factor shrinker work calculations
  2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
  2019-08-02 15:08   ` Nikolay Borisov
@ 2019-08-02 15:31   ` Brian Foster
  1 sibling, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-02 15:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:31PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Start to clean up the shrinker code by factoring out the calculation
> that determines how much work to do. This separates the calculation
> from clamping and other adjustments that are done before the
> shrinker work is run.
> 
> Also convert the calculation for the amount of work to be done to
> use 64 bit logic so we don't have to keep jumping through hoops to
> keep calculations within 32 bits on 32 bit systems.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c | 74 ++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 47 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ae3035fe94bc..b7472953b0e6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -464,13 +464,45 @@ EXPORT_SYMBOL(unregister_shrinker);
>  
>  #define SHRINK_BATCH 128
>  
> +/*
> + * Calculate the number of new objects to scan this time around. Return
> + * the work to be done. If there are freeable objects, return that number in
> + * @freeable_objects.
> + */
> +static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
> +			    struct shrinker *shrinker, int priority,
> +			    int64_t *freeable_objects)
> +{
> +	uint64_t delta;
> +	uint64_t freeable;
> +
> +	freeable = shrinker->count_objects(shrinker, shrinkctl);
> +	if (freeable == 0 || freeable == SHRINK_EMPTY)
> +		return freeable;
> +
> +	if (shrinker->seeks) {
> +		delta = freeable >> (priority - 2);
> +		do_div(delta, shrinker->seeks);
> +	} else {
> +		/*
> +		 * These objects don't require any IO to create. Trim
> +		 * them aggressively under memory pressure to keep
> +		 * them from causing refetches in the IO caches.
> +		 */
> +		delta = freeable / 2;
> +	}
> +
> +	*freeable_objects = freeable;
> +	return delta > 0 ? delta : 0;

I see Nikolay had some similar comments but FWIW delta is unsigned so
I'm not sure the point of the > 0 check.

Brian

> +}
> +
>  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
>  	unsigned long freed = 0;
> -	unsigned long long delta;
>  	long total_scan;
> -	long freeable;
> +	int64_t freeable_objects = 0;
> +	int64_t scan_count;
>  	long nr;
>  	long new_nr;
>  	int nid = shrinkctl->nid;
> @@ -481,9 +513,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>  		nid = 0;
>  
> -	freeable = shrinker->count_objects(shrinker, shrinkctl);
> -	if (freeable == 0 || freeable == SHRINK_EMPTY)
> -		return freeable;
> +	scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
> +					&freeable_objects);
> +	if (scan_count == 0 || scan_count == SHRINK_EMPTY)
> +		return scan_count;
>  
>  	/*
>  	 * copy the current shrinker scan count into a local variable
> @@ -492,25 +525,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 */
>  	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>  
> -	total_scan = nr;
> -	if (shrinker->seeks) {
> -		delta = freeable >> priority;
> -		delta *= 4;
> -		do_div(delta, shrinker->seeks);
> -	} else {
> -		/*
> -		 * These objects don't require any IO to create. Trim
> -		 * them aggressively under memory pressure to keep
> -		 * them from causing refetches in the IO caches.
> -		 */
> -		delta = freeable / 2;
> -	}
> -
> -	total_scan += delta;
> +	total_scan = nr + scan_count;
>  	if (total_scan < 0) {
>  		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
>  		       shrinker->scan_objects, total_scan);
> -		total_scan = freeable;
> +		total_scan = scan_count;

Why the change from the (now) freeable_objects value to scan_count?

Brian

>  		next_deferred = nr;
>  	} else
>  		next_deferred = total_scan;
> @@ -527,19 +546,20 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * Hence only allow the shrinker to scan the entire cache when
>  	 * a large delta change is calculated directly.
>  	 */
> -	if (delta < freeable / 4)
> -		total_scan = min(total_scan, freeable / 2);
> +	if (scan_count < freeable_objects / 4)
> +		total_scan = min_t(long, total_scan, freeable_objects / 2);
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */
> -	if (total_scan > freeable * 2)
> -		total_scan = freeable * 2;
> +	if (total_scan > freeable_objects * 2)
> +		total_scan = freeable_objects * 2;
>  
>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> -				   freeable, delta, total_scan, priority);
> +				   freeable_objects, scan_count,
> +				   total_scan, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -564,7 +584,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * possible.
>  	 */
>  	while (total_scan >= batch_size ||
> -	       total_scan >= freeable) {
> +	       total_scan >= freeable_objects) {
>  		unsigned long ret;
>  		unsigned long nr_to_scan = min(batch_size, total_scan);
>  
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 04/24] shrinker: defer work only to kswapd
  2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
@ 2019-08-02 15:34   ` Brian Foster
  2019-08-04 16:48   ` Nikolay Borisov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-02 15:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:32PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now deferred work is picked up by whatever GFP_KERNEL context
> reclaimer that wins the race to empty the node's deferred work
> counter. However, if there are lots of direct reclaimers, that
> work might be continually picked up by contexts taht can't do any
> work and so the opportunities to do the work are missed by contexts
> that could do them.
> 
> A further problem with the current code is that the deferred work
> can be picked up by a random direct reclaimer, resulting in that
> specific process having to do all the deferred reclaim work and
> hence can take extremely long latencies if the reclaim work blocks
> regularly. This is not good for direct reclaim fairness or for
> minimising long tail latency events.
> 
> To avoid these problems, simply limit deferred work to kswapd
> contexts. We know kswapd is a context that can always do reclaim
> work, and hence deferring work to kswapd allows the deferred work to
> be done in the background and not adversely affect any specific
> process context doing direct reclaim.
> 
> The advantage of this is that amount of work to be done in direct
> reclaim is now bound and predictable - it is entirely based on
> the cache's freeable objects and the reclaim priority. hence all
> direct reclaimers running at the same time should be doing
> relatively equal amounts of work, thereby reducing the incidence of
> long tail latencies due to uneven reclaim workloads.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c | 93 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 50 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b7472953b0e6..c583b4efb9bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -500,15 +500,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
>  	unsigned long freed = 0;
> -	long total_scan;
>  	int64_t freeable_objects = 0;
>  	int64_t scan_count;
> -	long nr;
> +	int64_t scanned_objects = 0;
> +	int64_t next_deferred = 0;
> +	int64_t deferred_count = 0;
>  	long new_nr;
>  	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
> -	long scanned = 0, next_deferred;
>  
>  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>  		nid = 0;
> @@ -519,47 +519,53 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		return scan_count;
>  
>  	/*
> -	 * copy the current shrinker scan count into a local variable
> -	 * and zero it so that other concurrent shrinker invocations
> -	 * don't also do this scanning work.
> +	 * If kswapd, we take all the deferred work and do it here. We don't let
> +	 * direct reclaim do this, because then it means some poor sod is going
> +	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
> +	 * amount of reclaim work from concurrent kswapd operations. Hence we do
> +	 * the work in the wrong place, at the wrong time, and it's largely
> +	 * unpredictable.
> +	 *
> +	 * By doing the deferred work only in kswapd, we can schedule the work
> +	 * according the the reclaim priority - low priority reclaim will do
> +	 * less deferred work, hence we'll do more of the deferred work the more
> +	 * desperate we become for free memory. This avoids the need for needing
> +	 * to specifically avoid deferred work windup as low amount os memory
> +	 * pressure won't excessive trim caches anymore.
>  	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +	if (current_is_kswapd()) {
> +		int64_t	deferred_scan;
>  
> -	total_scan = nr + scan_count;
> -	if (total_scan < 0) {
> -		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> -		       shrinker->scan_objects, total_scan);
> -		total_scan = scan_count;
> -		next_deferred = nr;
> -	} else
> -		next_deferred = total_scan;
> +		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
>  
> -	/*
> -	 * We need to avoid excessive windup on filesystem shrinkers
> -	 * due to large numbers of GFP_NOFS allocations causing the
> -	 * shrinkers to return -1 all the time. This results in a large
> -	 * nr being built up so when a shrink that can do some work
> -	 * comes along it empties the entire cache due to nr >>>
> -	 * freeable. This is bad for sustaining a working set in
> -	 * memory.
> -	 *
> -	 * Hence only allow the shrinker to scan the entire cache when
> -	 * a large delta change is calculated directly.
> -	 */
> -	if (scan_count < freeable_objects / 4)
> -		total_scan = min_t(long, total_scan, freeable_objects / 2);
> +		/* we want to scan 5-10% of the deferred work here at minimum */
> +		deferred_scan = deferred_count;
> +		if (priority)
> +			do_div(deferred_scan, priority);
> +		scan_count += deferred_scan;
> +
> +		/*
> +		 * If there is more deferred work than the number of freeable
> +		 * items in the cache, limit the amount of work we will carry
> +		 * over to the next kswapd run on this cache. This prevents
> +		 * deferred work windup.
> +		 */
> +		if (deferred_count > freeable_objects * 2)
> +			deferred_count = freeable_objects * 2;
> +

Hmm, what's the purpose of this check? Is this not handled once the
deferred count is absorbed into scan_count (where we apply the same
logic a few lines below)? Perhaps the latter prevents too much scanning
in a single call into the shrinker whereas this check prevents kswapd
from getting too far behind?

> +	}
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */
> -	if (total_scan > freeable_objects * 2)
> -		total_scan = freeable_objects * 2;
> +	if (scan_count > freeable_objects * 2)
> +		scan_count = freeable_objects * 2;
>  
> -	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> +	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
>  				   freeable_objects, scan_count,
> -				   total_scan, priority);
> +				   scan_count, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -583,10 +589,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * scanning at high prio and therefore should try to reclaim as much as
>  	 * possible.
>  	 */
> -	while (total_scan >= batch_size ||
> -	       total_scan >= freeable_objects) {
> +	while (scan_count >= batch_size ||
> +	       scan_count >= freeable_objects) {
>  		unsigned long ret;
> -		unsigned long nr_to_scan = min(batch_size, total_scan);
> +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
>  
>  		shrinkctl->nr_to_scan = nr_to_scan;
>  		shrinkctl->nr_scanned = nr_to_scan;
> @@ -596,17 +602,17 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		freed += ret;
>  
>  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> -		total_scan -= shrinkctl->nr_scanned;
> -		scanned += shrinkctl->nr_scanned;
> +		scan_count -= shrinkctl->nr_scanned;
> +		scanned_objects += shrinkctl->nr_scanned;
>  
>  		cond_resched();
>  	}
>  
>  done:
> -	if (next_deferred >= scanned)
> -		next_deferred -= scanned;
> -	else
> -		next_deferred = 0;
> +	if (deferred_count)
> +		next_deferred = deferred_count - scanned_objects;
> +	else if (scan_count > 0)
> +		next_deferred = scan_count;

I was wondering why we dropped the >= scanned_objects check, but I see
that next_deferred is signed and we check for next_deferred > 0 below.
What is odd is that so is scan_count, yet we check > 0 here for
assignment to the same variable. Can we be a little more consistent here
one way or the other?

Brian

>  	/*
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates. If we exhausted the
> @@ -618,7 +624,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	else
>  		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>  
> -	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> +	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
> +					scan_count);
>  	return freed;
>  }
>  
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-02 14:11       ` Chris Mason
@ 2019-08-02 18:34         ` Matthew Wilcox
  2019-08-02 23:28         ` Dave Chinner
  1 sibling, 0 replies; 87+ messages in thread
From: Matthew Wilcox @ 2019-08-02 18:34 UTC (permalink / raw)
  To: Chris Mason; +Cc: Dave Chinner, linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On Fri, Aug 02, 2019 at 02:11:53PM +0000, Chris Mason wrote:
> Yes and no.  At some point important FS threads have the potential to 
> wait on every single REQ_META IO on the box, so every single REQ_META IO 
> has the potential to create priority inversions.

[...]

> Tejun reminded me that in a lot of ways, swap is user IO and it's 
> actually fine to have it prioritized at the same level as user IO.  We 
> don't want to let a low prio app thrash the drive swapping things in and 
> out all the time, and it's actually fine to make them wait as long as 
> other higher priority processes aren't waiting for the memory.  This 
> depends on the cgroup config, so wrt your current patches it probably 
> sounds crazy, but we have a lot of data around this from the fleet.

swap is only user IO if we're doing the swapping in response to an
allocation done on behalf of a user thread.  If one of the above-mentioned
important FS threads does a memory allocation which causes swapping,
that priority needs to be inherited by the IO.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-02 14:11       ` Chris Mason
  2019-08-02 18:34         ` Matthew Wilcox
@ 2019-08-02 23:28         ` Dave Chinner
  2019-08-05 18:32           ` Chris Mason
  1 sibling, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-02 23:28 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On Fri, Aug 02, 2019 at 02:11:53PM +0000, Chris Mason wrote:
> On 1 Aug 2019, at 19:58, Dave Chinner wrote:
> I can't really see bio->b_ioprio working without the rest of the IO 
> controller logic creating a sensible system,

That's exactly the problem we need to solve. The current situation
is ... untenable. Regardless of whether the io.latency controller
works well, the fact is that the wbt subsystem is active on -all-
configurations and the way it "prioritises" is completely broken.

> framework to define weights etc.  My question is if it's worth trying 
> inside of the wbt code, or if we should just let the metadata go 
> through.

As I said, that doesn't  solve the problem. We /want/ critical
journal IO to have higher priority that background metadata
writeback. Just ignoring REQ_META doesn't help us there - it just
moves the priority inversion to blocking on request queue tags.

> Tejun reminded me that in a lot of ways, swap is user IO and it's 
> actually fine to have it prioritized at the same level as user IO.  We 

I think that's wrong. Swap *in* could have user priority but swap
*out* is global as there is no guarantee that the page being swapped
belongs to the user context that is reclaiming memory.

Lots of other user and kernel reclaim contexts may be waiting on
that swap to complete, so it's important that swap out is not
arbitrarily delayed or susceptible to priority inversions. i.e. swap
out must take priority over swap-in and other user IO because that
IO may require allocation to make progress via swapping to free
"user" file data cached in memory....

> don't want to let a low prio app thrash the drive swapping things in and 
> out all the time,

Low priority apps will be throttled on *swap in* IO - i.e. by their
incoming memory demand. High priority apps should be swapping out
low priority app memory if there are shortages - that's what priority
defines....

> other higher priority processes aren't waiting for the memory.  This 
> depends on the cgroup config, so wrt your current patches it probably 
> sounds crazy, but we have a lot of data around this from the fleet.

I'm not using cgroups.

Core infrastructure needs to work without cgroups being configured
to confine everything in userspace to "safe" bounds, and right now
just running things in the root cgroup doesn't appear to work very
well at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-02 15:27   ` Brian Foster
@ 2019-08-04  1:49     ` Dave Chinner
  2019-08-05 17:42       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-04  1:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Introduce a mechanism for ->count_objects() to indicate to the
> > shrinker infrastructure that the reclaim context will not allow
> > scanning work to be done and so the work it decides is necessary
> > needs to be deferred.
> > 
> > This simplifies the code by separating out the accounting of
> > deferred work from the actual doing of the work, and allows better
> > decisions to be made by the shrinekr control logic on what action it
> > can take.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/linux/shrinker.h | 7 +++++++
> >  mm/vmscan.c              | 8 ++++++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 9443cafd1969..af78c475fc32 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -31,6 +31,13 @@ struct shrink_control {
> >  
> >  	/* current memcg being shrunk (for memcg aware shrinkers) */
> >  	struct mem_cgroup *memcg;
> > +
> > +	/*
> > +	 * set by ->count_objects if reclaim context prevents reclaim from
> > +	 * occurring. This allows the shrinker to immediately defer all the
> > +	 * work and not even attempt to scan the cache.
> > +	 */
> > +	bool will_defer;
> 
> Functionality wise this seems fairly straightforward. FWIW, I find the
> 'will_defer' name a little confusing because it implies to me that the
> shrinker is telling the caller about something it would do if called as
> opposed to explicitly telling the caller to defer. I'd just call it
> 'defer' I guess, but that's just my .02. ;P

Ok, I'll change it to something like "defer_work" or "defer_scan"
here.

> >  };
> >  
> >  #define SHRINK_STOP (~0UL)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 44df66a98f2a..ae3035fe94bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> >  				   freeable, delta, total_scan, priority);
> >  
> > +	/*
> > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > +	 * defer the work to a context that can scan the cache.
> > +	 */
> > +	if (shrinkctl->will_defer)
> > +		goto done;
> > +
> 
> Who's responsible for clearing the flag? Perhaps we should do so here
> once it's acted upon since we don't call into the shrinker again?

Each shrinker invocation has it's own shrink_control context - they
are not shared between shrinkers - the higher level is responsible
for setting up the control state of each individual shrinker
invocation...

> Note that I see this structure is reinitialized on every iteration in
> the caller, but there already is the SHRINK_EMPTY case where we call
> back into do_shrink_slab().

.... because there is external state tracking in memcgs that
determine what shrinkers get run. See shrink_slab_memcg().

i.e. The SHRINK_EMPTY return value is a special hack for memcg
shrinkers so it can track whether there are freeable objects in the
cache externally to try to avoid calling into shrinkers where no
work can be done.  Think about having hundreds of shrinkers and
hundreds of memcgs...

Anyway, the tracking of the freeable bit is racy, so the
SHRINK_EMPTY hack where it clears the bit and calls back into the
shrinker is handling the case where objects were freed between the
shrinker running and shrink_slab_memcg() clearing the freeable bit
from the slab. Hence it has to call back into the shrinker again -
if it gets anything other than SHRINK_EMPTY returned, then it will
set the bit again.

In reality, SHRINK_EMPTY and deferring work are mutually exclusive.
Work only gets deferred when there's work that can be done and in
that case SHRINK_EMPTY will not be returned - a value of "0 freed
objects" will be returned when we defer work. So if the first call
returns SHRINK_EMPTY, the "defer" state has not been touched and
so doesn't require resetting to zero here.

> Granted the deferred state likely hasn't
> changed, but the fact that we'd call back into the count callback to set
> it again implies the logic could be a bit more explicit, particularly if
> this will eventually be used for more dynamic shrinker state that might
> change call to call (i.e., object dirty state, etc.).
> 
> BTW, do we need to care about the ->nr_cached_objects() call from the
> generic superblock shrinker (super_cache_scan())?

No, and we never had to because it is inside the superblock shrinker
and the superblock shrinker does the GFP_NOFS context checks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers
  2019-08-02 15:27   ` Brian Foster
@ 2019-08-04  1:50     ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-04  1:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Fri, Aug 02, 2019 at 11:27:37AM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:30PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For shrinkers that currently avoid scanning when called under
> > GFP_NOFS contexts, conver them to use the new ->will_defer flag
> > rather than checking and returning errors during scans.
> > 
> > This makes it very clear that these shrinkers are not doing any work
> > because of the context limitations, not because there is no work
> > that can be done.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  drivers/staging/android/ashmem.c |  8 ++++----
> >  fs/gfs2/glock.c                  |  5 +++--
> >  fs/gfs2/quota.c                  |  6 +++---
> >  fs/nfs/dir.c                     |  6 +++---
> >  fs/super.c                       |  6 +++---
> >  fs/xfs/xfs_buf.c                 |  4 ++++
> >  fs/xfs/xfs_qm.c                  | 11 ++++++++---
> >  net/sunrpc/auth.c                |  5 ++---
> >  8 files changed, 30 insertions(+), 21 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index ca0849043f54..6e0f76532535 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1680,6 +1680,10 @@ xfs_buftarg_shrink_count(
> >  {
> >  	struct xfs_buftarg	*btp = container_of(shrink,
> >  					struct xfs_buftarg, bt_shrinker);
> > +
> > +	if (!(sc->gfp_mask & __GFP_FS))
> > +		sc->will_defer = true;
> > +
> >  	return list_lru_shrink_count(&btp->bt_lru, sc);
> >  }
> 
> This hunk looks like a behavior change / bug fix..? The rest of the

Yeah, forgot to move that to the patch that fixes the accounting for
the xfs_buf cache later on in the series. Will fix.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 03/24] mm: factor shrinker work calculations
  2019-08-02 15:08   ` Nikolay Borisov
@ 2019-08-04  2:05     ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-04  2:05 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Fri, Aug 02, 2019 at 06:08:37PM +0300, Nikolay Borisov wrote:
> 
> 
> On 1.08.19 г. 5:17 ч., Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Start to clean up the shrinker code by factoring out the calculation
> > that determines how much work to do. This separates the calculation
> > from clamping and other adjustments that are done before the
> > shrinker work is run.
> > 
> > Also convert the calculation for the amount of work to be done to
> > use 64 bit logic so we don't have to keep jumping through hoops to
> > keep calculations within 32 bits on 32 bit systems.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  mm/vmscan.c | 74 ++++++++++++++++++++++++++++++++++-------------------
> >  1 file changed, 47 insertions(+), 27 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ae3035fe94bc..b7472953b0e6 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -464,13 +464,45 @@ EXPORT_SYMBOL(unregister_shrinker);
> >  
> >  #define SHRINK_BATCH 128
> >  
> > +/*
> > + * Calculate the number of new objects to scan this time around. Return
> > + * the work to be done. If there are freeable objects, return that number in
> > + * @freeable_objects.
> > + */
> > +static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
> > +			    struct shrinker *shrinker, int priority,
> > +			    int64_t *freeable_objects)
> 
> nit: make the return parm definition also uin64_t, also we have u64 types.

SHRINK_EMPTY is actually a negative number (-2), and it gets whacked
back into a signed long value in the caller. So returning a signed
integer is actually correct.

> > +{
> > +	uint64_t delta;
> > +	uint64_t freeable;
> > +
> > +	freeable = shrinker->count_objects(shrinker, shrinkctl);
> > +	if (freeable == 0 || freeable == SHRINK_EMPTY)
> > +		return freeable;
> > +
> > +	if (shrinker->seeks) {
> > +		delta = freeable >> (priority - 2);
> > +		do_div(delta, shrinker->seeks);
> 
> a comment about the reasoning behind this calculation would be nice.

I'm just moving code here.

The reason for this calculation requires an awfully long description
that isn't actually appropriate here or in this patch set.

If there should be any comment describing how shrinker work biasing
should be configured, it needs to be in include/linux/shrinker.h
around the definition of DEFAULT_SEEKS and shrinker->seeks as this
code requires shrinker->seeks to be configured appropriately by the
code that registers the shrinker.

> >  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  				    struct shrinker *shrinker, int priority)
> >  {
> >  	unsigned long freed = 0;
> > -	unsigned long long delta;
> >  	long total_scan;
> > -	long freeable;
> > +	int64_t freeable_objects = 0;
> > +	int64_t scan_count;
> 
> why int and not uint64 ? We can never have negative object count, right?

SHRINK_STOP, SHRINK_EMPTY are negative numbers, and the higher level
interface uses longs, not unsigned longs. So we have to treat
numbers greater than LONG_MAX as invalid for object/scan counts.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 04/24] shrinker: defer work only to kswapd
  2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
  2019-08-02 15:34   ` Brian Foster
@ 2019-08-04 16:48   ` Nikolay Borisov
  2019-08-04 21:37     ` Dave Chinner
  2019-08-07 16:12   ` kbuild test robot
  2019-08-07 18:00   ` kbuild test robot
  3 siblings, 1 reply; 87+ messages in thread
From: Nikolay Borisov @ 2019-08-04 16:48 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-mm, linux-fsdevel



On 1.08.19 г. 5:17 ч., Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now deferred work is picked up by whatever GFP_KERNEL context
> reclaimer that wins the race to empty the node's deferred work
> counter. However, if there are lots of direct reclaimers, that
> work might be continually picked up by contexts taht can't do any
> work and so the opportunities to do the work are missed by contexts
> that could do them.
> 
> A further problem with the current code is that the deferred work
> can be picked up by a random direct reclaimer, resulting in that
> specific process having to do all the deferred reclaim work and
> hence can take extremely long latencies if the reclaim work blocks
> regularly. This is not good for direct reclaim fairness or for
> minimising long tail latency events.
> 
> To avoid these problems, simply limit deferred work to kswapd
> contexts. We know kswapd is a context that can always do reclaim
> work, and hence deferring work to kswapd allows the deferred work to
> be done in the background and not adversely affect any specific
> process context doing direct reclaim.
> 
> The advantage of this is that amount of work to be done in direct
> reclaim is now bound and predictable - it is entirely based on
> the cache's freeable objects and the reclaim priority. hence all
> direct reclaimers running at the same time should be doing
> relatively equal amounts of work, thereby reducing the incidence of
> long tail latencies due to uneven reclaim workloads.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c | 93 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 50 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b7472953b0e6..c583b4efb9bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -500,15 +500,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
>  	unsigned long freed = 0;
> -	long total_scan;
>  	int64_t freeable_objects = 0;
>  	int64_t scan_count;
> -	long nr;
> +	int64_t scanned_objects = 0;
> +	int64_t next_deferred = 0;
> +	int64_t deferred_count = 0;
>  	long new_nr;
>  	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
> -	long scanned = 0, next_deferred;
>  
>  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>  		nid = 0;
> @@ -519,47 +519,53 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		return scan_count;
>  
>  	/*
> -	 * copy the current shrinker scan count into a local variable
> -	 * and zero it so that other concurrent shrinker invocations
> -	 * don't also do this scanning work.
> +	 * If kswapd, we take all the deferred work and do it here. We don't let
> +	 * direct reclaim do this, because then it means some poor sod is going
> +	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
> +	 * amount of reclaim work from concurrent kswapd operations. Hence we do
> +	 * the work in the wrong place, at the wrong time, and it's largely
> +	 * unpredictable.
> +	 *
> +	 * By doing the deferred work only in kswapd, we can schedule the work
> +	 * according the the reclaim priority - low priority reclaim will do
> +	 * less deferred work, hence we'll do more of the deferred work the more
> +	 * desperate we become for free memory. This avoids the need for needing
> +	 * to specifically avoid deferred work windup as low amount os memory
> +	 * pressure won't excessive trim caches anymore.
>  	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +	if (current_is_kswapd()) {
> +		int64_t	deferred_scan;
>  
> -	total_scan = nr + scan_count;
> -	if (total_scan < 0) {
> -		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> -		       shrinker->scan_objects, total_scan);
> -		total_scan = scan_count;
> -		next_deferred = nr;
> -	} else
> -		next_deferred = total_scan;
> +		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
>  
> -	/*
> -	 * We need to avoid excessive windup on filesystem shrinkers
> -	 * due to large numbers of GFP_NOFS allocations causing the
> -	 * shrinkers to return -1 all the time. This results in a large
> -	 * nr being built up so when a shrink that can do some work
> -	 * comes along it empties the entire cache due to nr >>>
> -	 * freeable. This is bad for sustaining a working set in
> -	 * memory.
> -	 *
> -	 * Hence only allow the shrinker to scan the entire cache when
> -	 * a large delta change is calculated directly.
> -	 */
> -	if (scan_count < freeable_objects / 4)
> -		total_scan = min_t(long, total_scan, freeable_objects / 2);
> +		/* we want to scan 5-10% of the deferred work here at minimum */
> +		deferred_scan = deferred_count;
> +		if (priority)
> +			do_div(deferred_scan, priority);
> +		scan_count += deferred_scan;
> +
> +		/*
> +		 * If there is more deferred work than the number of freeable
> +		 * items in the cache, limit the amount of work we will carry
> +		 * over to the next kswapd run on this cache. This prevents
> +		 * deferred work windup.
> +		 */
> +		if (deferred_count > freeable_objects * 2)
> +			deferred_count = freeable_objects * 2;

nit : deferred_count = min(deferred_count, freeable_objects * 2).

How can we have more deferred objects than are currently on the LRU?
Aren't deferred objects always some part of freeable objects. Shouldn't
this mean that for a particular shrinker deferred_count <= freeable_objects?

> +
> +	}
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */
> -	if (total_scan > freeable_objects * 2)
> -		total_scan = freeable_objects * 2;
> +	if (scan_count > freeable_objects * 2)
> +		scan_count = freeable_objects * 2;

nit: scan_count = min(scan_count, freeable_objects * 2);
>  
> -	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> +	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
>  				   freeable_objects, scan_count,
> -				   total_scan, priority);
> +				   scan_count, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -583,10 +589,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * scanning at high prio and therefore should try to reclaim as much as
>  	 * possible.
>  	 */
> -	while (total_scan >= batch_size ||
> -	       total_scan >= freeable_objects) {
> +	while (scan_count >= batch_size ||
> +	       scan_count >= freeable_objects) {
>  		unsigned long ret;
> -		unsigned long nr_to_scan = min(batch_size, total_scan);
> +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
>  
>  		shrinkctl->nr_to_scan = nr_to_scan;
>  		shrinkctl->nr_scanned = nr_to_scan;
> @@ -596,17 +602,17 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		freed += ret;
>  
>  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> -		total_scan -= shrinkctl->nr_scanned;
> -		scanned += shrinkctl->nr_scanned;
> +		scan_count -= shrinkctl->nr_scanned;
> +		scanned_objects += shrinkctl->nr_scanned;
>  
>  		cond_resched();
>  	}
>  
>  done:
> -	if (next_deferred >= scanned)
> -		next_deferred -= scanned;
> -	else
> -		next_deferred = 0;
> +	if (deferred_count)
> +		next_deferred = deferred_count - scanned_objects;
> +	else if (scan_count > 0)
> +		next_deferred = scan_count;
>  	/*
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates. If we exhausted the
> @@ -618,7 +624,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	else
>  		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>  
> -	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> +	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
> +					scan_count);
>  	return freed;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 16/24] xfs: Lower CIL flush limit for large logs
  2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
@ 2019-08-04 17:12   ` Nikolay Borisov
  0 siblings, 0 replies; 87+ messages in thread
From: Nikolay Borisov @ 2019-08-04 17:12 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-mm, linux-fsdevel



On 1.08.19 г. 5:17 ч., Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The current CIL size aggregation limit is 1/8th the log size. This
> means for large logs we might be aggregating at least 250MB of dirty objects
> in memory before the CIL is flushed to the journal. With CIL shadow
> buffers sitting around, this means the CIL is often consuming >500MB
> of temporary memory that is all allocated under GFP_NOFS conditions.
> 
> FLushing the CIL can take some time to do if there is other IO
> ongoing, and can introduce substantial log force latency by itself.
> It also pins the memory until the objects are in the AIL and can be
> written back and reclaimed by shrinkers. Hence this threshold also
> tends to determine the minimum amount of memory XFS can operate in
> under heavy modification without triggering the OOM killer.
> 
> Modify the CIL space limit to prevent such huge amounts of pinned
> metadata from aggregating. We can 2MB of log IO in flight at once,
There is a word missing between 'can' and '2MB'
> so limit aggregation to 8x this size (arbitrary). This has some
> impact on performance (5-10% decrease on 16-way fsmark) and
> increases the amount of log traffic (~50% on same workload) but it
> is necessary to prevent rampant OOM killing under iworkloads that
> modify large amounts of metadata under heavy memory pressure.
> 
> This was found via trace analysis or AIL behaviour. e.g. insertion
s/or/of/

> from a single CIL flush:

<snip>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 04/24] shrinker: defer work only to kswapd
  2019-08-04 16:48   ` Nikolay Borisov
@ 2019-08-04 21:37     ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-04 21:37 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Sun, Aug 04, 2019 at 07:48:01PM +0300, Nikolay Borisov wrote:
> 
> 
> On 1.08.19 г. 5:17 ч., Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Right now deferred work is picked up by whatever GFP_KERNEL context
> > reclaimer that wins the race to empty the node's deferred work
> > counter. However, if there are lots of direct reclaimers, that
> > work might be continually picked up by contexts taht can't do any
> > work and so the opportunities to do the work are missed by contexts
> > that could do them.
> > 
> > A further problem with the current code is that the deferred work
> > can be picked up by a random direct reclaimer, resulting in that
> > specific process having to do all the deferred reclaim work and
> > hence can take extremely long latencies if the reclaim work blocks
> > regularly. This is not good for direct reclaim fairness or for
> > minimising long tail latency events.
> > 
> > To avoid these problems, simply limit deferred work to kswapd
> > contexts. We know kswapd is a context that can always do reclaim
> > work, and hence deferring work to kswapd allows the deferred work to
> > be done in the background and not adversely affect any specific
> > process context doing direct reclaim.
> > 
> > The advantage of this is that amount of work to be done in direct
> > reclaim is now bound and predictable - it is entirely based on
> > the cache's freeable objects and the reclaim priority. hence all
> > direct reclaimers running at the same time should be doing
> > relatively equal amounts of work, thereby reducing the incidence of
> > long tail latencies due to uneven reclaim workloads.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  mm/vmscan.c | 93 ++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 50 insertions(+), 43 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index b7472953b0e6..c583b4efb9bf 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -500,15 +500,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  				    struct shrinker *shrinker, int priority)
> >  {
> >  	unsigned long freed = 0;
> > -	long total_scan;
> >  	int64_t freeable_objects = 0;
> >  	int64_t scan_count;
> > -	long nr;
> > +	int64_t scanned_objects = 0;
> > +	int64_t next_deferred = 0;
> > +	int64_t deferred_count = 0;
> >  	long new_nr;
> >  	int nid = shrinkctl->nid;
> >  	long batch_size = shrinker->batch ? shrinker->batch
> >  					  : SHRINK_BATCH;
> > -	long scanned = 0, next_deferred;
> >  
> >  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> >  		nid = 0;
> > @@ -519,47 +519,53 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  		return scan_count;
> >  
> >  	/*
> > -	 * copy the current shrinker scan count into a local variable
> > -	 * and zero it so that other concurrent shrinker invocations
> > -	 * don't also do this scanning work.
> > +	 * If kswapd, we take all the deferred work and do it here. We don't let
> > +	 * direct reclaim do this, because then it means some poor sod is going
> > +	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
> > +	 * amount of reclaim work from concurrent kswapd operations. Hence we do
> > +	 * the work in the wrong place, at the wrong time, and it's largely
> > +	 * unpredictable.
> > +	 *
> > +	 * By doing the deferred work only in kswapd, we can schedule the work
> > +	 * according the the reclaim priority - low priority reclaim will do
> > +	 * less deferred work, hence we'll do more of the deferred work the more
> > +	 * desperate we become for free memory. This avoids the need for needing
> > +	 * to specifically avoid deferred work windup as low amount os memory
> > +	 * pressure won't excessive trim caches anymore.
> >  	 */
> > -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +	if (current_is_kswapd()) {
> > +		int64_t	deferred_scan;
> >  
> > -	total_scan = nr + scan_count;
> > -	if (total_scan < 0) {
> > -		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> > -		       shrinker->scan_objects, total_scan);
> > -		total_scan = scan_count;
> > -		next_deferred = nr;
> > -	} else
> > -		next_deferred = total_scan;
> > +		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
> >  
> > -	/*
> > -	 * We need to avoid excessive windup on filesystem shrinkers
> > -	 * due to large numbers of GFP_NOFS allocations causing the
> > -	 * shrinkers to return -1 all the time. This results in a large
> > -	 * nr being built up so when a shrink that can do some work
> > -	 * comes along it empties the entire cache due to nr >>>
> > -	 * freeable. This is bad for sustaining a working set in
> > -	 * memory.
> > -	 *
> > -	 * Hence only allow the shrinker to scan the entire cache when
> > -	 * a large delta change is calculated directly.
> > -	 */
> > -	if (scan_count < freeable_objects / 4)
> > -		total_scan = min_t(long, total_scan, freeable_objects / 2);
> > +		/* we want to scan 5-10% of the deferred work here at minimum */
> > +		deferred_scan = deferred_count;
> > +		if (priority)
> > +			do_div(deferred_scan, priority);
> > +		scan_count += deferred_scan;
> > +
> > +		/*
> > +		 * If there is more deferred work than the number of freeable
> > +		 * items in the cache, limit the amount of work we will carry
> > +		 * over to the next kswapd run on this cache. This prevents
> > +		 * deferred work windup.
> > +		 */
> > +		if (deferred_count > freeable_objects * 2)
> > +			deferred_count = freeable_objects * 2;
> 
> nit : deferred_count = min(deferred_count, freeable_objects * 2).

*nod*

> How can we have more deferred objects than are currently on the LRU?

deferred work is aggregated. Put enough direct reclaimers in action
in GFP_NOFS context (e.g. fsmark create workload) and it will wind
up the deferred count much faster than kswapd can drain it.

> Aren't deferred objects always some part of freeable objects.

For a single scan, yes. In aggregate, no.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-04  1:49     ` Dave Chinner
@ 2019-08-05 17:42       ` Brian Foster
  2019-08-05 23:43         ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-05 17:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Sun, Aug 04, 2019 at 11:49:30AM +1000, Dave Chinner wrote:
> On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Introduce a mechanism for ->count_objects() to indicate to the
> > > shrinker infrastructure that the reclaim context will not allow
> > > scanning work to be done and so the work it decides is necessary
> > > needs to be deferred.
> > > 
> > > This simplifies the code by separating out the accounting of
> > > deferred work from the actual doing of the work, and allows better
> > > decisions to be made by the shrinekr control logic on what action it
> > > can take.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  include/linux/shrinker.h | 7 +++++++
> > >  mm/vmscan.c              | 8 ++++++++
> > >  2 files changed, 15 insertions(+)
> > > 
> > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > index 9443cafd1969..af78c475fc32 100644
> > > --- a/include/linux/shrinker.h
> > > +++ b/include/linux/shrinker.h
> > > @@ -31,6 +31,13 @@ struct shrink_control {
> > >  
> > >  	/* current memcg being shrunk (for memcg aware shrinkers) */
> > >  	struct mem_cgroup *memcg;
> > > +
> > > +	/*
> > > +	 * set by ->count_objects if reclaim context prevents reclaim from
> > > +	 * occurring. This allows the shrinker to immediately defer all the
> > > +	 * work and not even attempt to scan the cache.
> > > +	 */
> > > +	bool will_defer;
> > 
> > Functionality wise this seems fairly straightforward. FWIW, I find the
> > 'will_defer' name a little confusing because it implies to me that the
> > shrinker is telling the caller about something it would do if called as
> > opposed to explicitly telling the caller to defer. I'd just call it
> > 'defer' I guess, but that's just my .02. ;P
> 
> Ok, I'll change it to something like "defer_work" or "defer_scan"
> here.
> 

Either sounds better to me, thanks.

> > >  };
> > >  
> > >  #define SHRINK_STOP (~0UL)
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 44df66a98f2a..ae3035fe94bc 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > >  				   freeable, delta, total_scan, priority);
> > >  
> > > +	/*
> > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > +	 * defer the work to a context that can scan the cache.
> > > +	 */
> > > +	if (shrinkctl->will_defer)
> > > +		goto done;
> > > +
> > 
> > Who's responsible for clearing the flag? Perhaps we should do so here
> > once it's acted upon since we don't call into the shrinker again?
> 
> Each shrinker invocation has it's own shrink_control context - they
> are not shared between shrinkers - the higher level is responsible
> for setting up the control state of each individual shrinker
> invocation...
> 

Yes, but more specifically, it appears to me that each level is
responsible for setting up control state managed by that level. E.g.,
shrink_slab_memcg() initializes the unchanging state per iteration and
do_shrink_slab() (re)sets the scan state prior to ->scan_objects().

> > Note that I see this structure is reinitialized on every iteration in
> > the caller, but there already is the SHRINK_EMPTY case where we call
> > back into do_shrink_slab().
> 
> .... because there is external state tracking in memcgs that
> determine what shrinkers get run. See shrink_slab_memcg().
> 
> i.e. The SHRINK_EMPTY return value is a special hack for memcg
> shrinkers so it can track whether there are freeable objects in the
> cache externally to try to avoid calling into shrinkers where no
> work can be done.  Think about having hundreds of shrinkers and
> hundreds of memcgs...
> 
> Anyway, the tracking of the freeable bit is racy, so the
> SHRINK_EMPTY hack where it clears the bit and calls back into the
> shrinker is handling the case where objects were freed between the
> shrinker running and shrink_slab_memcg() clearing the freeable bit
> from the slab. Hence it has to call back into the shrinker again -
> if it gets anything other than SHRINK_EMPTY returned, then it will
> set the bit again.
> 

Yeah, I grokked most of that from the code. The current implementation
looks fine to me, but I could easily see how changes in the higher level
do_shrink_slab() caller(s) or lower level shrinker callbacks could
quietly break this in the future. IOW, once this code hits the tree any
shrinker across the kernel is free to try and defer slab reclaim work
for any reason.

> In reality, SHRINK_EMPTY and deferring work are mutually exclusive.
> Work only gets deferred when there's work that can be done and in
> that case SHRINK_EMPTY will not be returned - a value of "0 freed
> objects" will be returned when we defer work. So if the first call
> returns SHRINK_EMPTY, the "defer" state has not been touched and
> so doesn't require resetting to zero here.
> 

Yep. The high level semantics make sense, but note that that the generic
superblock shrinker can now set ->will_defer true and return
SHRINK_EMPTY so that last bit about defer state not being touched is not
technically true.

> > Granted the deferred state likely hasn't
> > changed, but the fact that we'd call back into the count callback to set
> > it again implies the logic could be a bit more explicit, particularly if
> > this will eventually be used for more dynamic shrinker state that might
> > change call to call (i.e., object dirty state, etc.).
> > 
> > BTW, do we need to care about the ->nr_cached_objects() call from the
> > generic superblock shrinker (super_cache_scan())?
> 
> No, and we never had to because it is inside the superblock shrinker
> and the superblock shrinker does the GFP_NOFS context checks.
> 

Ok. Though tbh this topic has me wondering whether a shrink_control
boolean is the right approach here. Do you envision ->will_defer being
used for anything other than allocation context restrictions? If not,
perhaps we should do something like optionally set alloc flags required
for direct scanning in the struct shrinker itself and let the core
shrinker code decide when to defer to kswapd based on the shrink_control
flags and the current shrinker. That way an arbitrary shrinker can't
muck around with core behavior in unintended ways. Hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/24] xfs: synchronous AIL pushing
  2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
@ 2019-08-05 17:51   ` Brian Foster
  2019-08-05 23:21     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-05 17:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:41PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Provide an interface to push the AIL to a target LSN and wait for
> the tail of the log to move past that LSN. This is used to wait for
> all items older than a specific LSN to either be cleaned (written
> back) or relogged to a higher LSN in the AIL. The primary use for
> this is to allow IO free inode reclaim throttling.
> 
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode_item.c | 12 +------
>  fs/xfs/xfs_trans_ail.c  | 69 +++++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_trans_priv.h |  6 +++-
>  3 files changed, 62 insertions(+), 25 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 6ccfd75d3c24..9e3102179221 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -654,6 +654,37 @@ xfs_ail_push_all(
>  		xfs_ail_push(ailp, threshold_lsn);
>  }
>  
> +/*
> + * Push the AIL to a specific lsn and wait for it to complete.
> + */
> +void
> +xfs_ail_push_sync(
> +	struct xfs_ail		*ailp,
> +	xfs_lsn_t		threshold_lsn)
> +{
> +	struct xfs_log_item	*lip;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = xfs_ail_min(ailp)) != NULL) {
> +		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> +		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> +			break;
> +		/* XXX: cmpxchg? */
> +		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> +			xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);

Why the need to repeatedly copy the ail_target like this? If the push
target only ever moves forward, we should only need to do this once at
the start of the function. In fact I'm kind of wondering why we can't
just call xfs_ail_push(). If we check the tail item after grabbing the
spin lock, we should be able to avoid any races with the waker, no?

> +		wake_up_process(ailp->ail_task);
> +		spin_unlock(&ailp->ail_lock);
> +		schedule();
> +		spin_lock(&ailp->ail_lock);
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +	finish_wait(&ailp->ail_push, &wait);
> +}
> +
> +
>  /*
>   * Push out all items in the AIL immediately and wait until the AIL is empty.
>   */
> @@ -764,6 +795,28 @@ xfs_ail_delete_one(
>  	return mlip == lip;
>  }
>  
> +void
> +xfs_ail_delete_finish(
> +	struct xfs_ail		*ailp,
> +	bool			do_tail_update) __releases(ailp->ail_lock)
> +{
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +
> +	if (!do_tail_update) {
> +		spin_unlock(&ailp->ail_lock);
> +		return;
> +	}
> +

Hmm.. so while what we really care about here are tail updates, this
logic is currently driven by removing the min ail log item. That seems
like a lot of potential churn if we're waking the pusher on every object
written back covered by a single log record / checkpoint. Perhaps we
should implement a bit more coarse wakeup logic such as only when the
tail lsn actually changes, for example?

FWIW, it also doesn't look like you've handled the case of relogged
items moving the tail forward anywhere that I can see, so we might be
missing some wakeups here. See xfs_trans_ail_update_bulk() for
additional AIL manipulation.

> +	if (!XFS_FORCED_SHUTDOWN(mp))
> +		xlog_assign_tail_lsn_locked(mp);
> +
> +	wake_up_all(&ailp->ail_push);
> +	if (list_empty(&ailp->ail_head))
> +		wake_up_all(&ailp->ail_empty);
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(mp);
> +}
> +
>  /**
>   * Remove a log items from the AIL
>   *
> @@ -789,10 +842,9 @@ void
>  xfs_trans_ail_delete(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip,
> -	int			shutdown_type) __releases(ailp->ail_lock)
> +	int			shutdown_type)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			mlip_changed;
>  
>  	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>  		spin_unlock(&ailp->ail_lock);
> @@ -805,17 +857,7 @@ xfs_trans_ail_delete(
>  		return;
>  	}
>  
> -	mlip_changed = xfs_ail_delete_one(ailp, lip);
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(mp))
> -			xlog_assign_tail_lsn_locked(mp);
> -		if (list_empty(&ailp->ail_head))
> -			wake_up_all(&ailp->ail_empty);
> -	}
> -
> -	spin_unlock(&ailp->ail_lock);
> -	if (mlip_changed)
> -		xfs_log_space_wake(ailp->ail_mount);
> +	xfs_ail_delete_finish(ailp, xfs_ail_delete_one(ailp, lip));

Nit, but I'm not a fan of the function call buried in a function call
parameter pattern. I tend to read over it at a glance so to me it's not
worth the line of code it saves.

Brian

>  }
>  
>  int
> @@ -834,6 +876,7 @@ xfs_trans_ail_init(
>  	spin_lock_init(&ailp->ail_lock);
>  	INIT_LIST_HEAD(&ailp->ail_buf_list);
>  	init_waitqueue_head(&ailp->ail_empty);
> +	init_waitqueue_head(&ailp->ail_push);
>  
>  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>  			ailp->ail_mount->m_fsname);
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614..5ab70b9b896f 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> +	wait_queue_head_t	ail_push;
>  };
>  
>  /*
> @@ -92,8 +93,10 @@ xfs_trans_ail_update(
>  }
>  
>  bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_delete_finish(struct xfs_ail *ailp, bool do_tail_update)
> +			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
> -		int shutdown_type) __releases(ailp->ail_lock);
> +		int shutdown_type);
>  
>  static inline void
>  xfs_trans_ail_remove(
> @@ -111,6 +114,7 @@ xfs_trans_ail_remove(
>  }
>  
>  void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
> +void			xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
>  void			xfs_ail_push_all(struct xfs_ail *);
>  void			xfs_ail_push_all_sync(struct xfs_ail *);
>  struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/24] xfs: tail updates only need to occur when LSN changes
  2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
@ 2019-08-05 17:53   ` Brian Foster
  2019-08-05 23:28     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-05 17:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:42PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently wake anything waiting on the log tail to move whenever
> the log item at the tail of the log is removed. Historically this
> was fine behaviour because there were very few items at any given
> LSN. But with delayed logging, there may be thousands of items at
> any given LSN, and we can't move the tail until they are all gone.
> 
> Hence if we are removing them in near tail-first order, we might be
> waking up processes waiting on the tail LSN to change (e.g. log
> space waiters) repeatedly without them being able to make progress.
> This also occurs with the new sync push waiters, and can result in
> thousands of spurious wakeups every second when under heavy direct
> reclaim pressure.
> 
> To fix this, check that the tail LSN has actually changed on the
> AIL before triggering wakeups. This will reduce the number of
> spurious wakeups when doing bulk AIL removal and make this code much
> more efficient.
> 
> XXX: occasionally get a temporary hang in xfs_ail_push_sync() with
> this change - log force from log worker gets things moving again.
> Only happens under extreme memory pressure - possibly push racing
> with a tail update on an empty log. Needs further investigation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Ok, this addresses the wakeup granularity issue mentioned in the
previous patch. Note that I was kind of wondering why we wouldn't base
this on the l_tail_lsn update in xlog_assign_tail_lsn_locked() as
opposed to the current approach.

For example, xlog_assign_tail_lsn_locked() could simply check the
current min item against the current l_tail_lsn before it does the
assignment and use that to trigger tail change events. If we wanted to
also filter out the other wakeups (as this patch does) then we could
just pass a bool pointer or something that returns whether the tail
actually changed.

Brian

>  fs/xfs/xfs_inode_item.c | 18 +++++++++++++-----
>  fs/xfs/xfs_trans_ail.c  | 37 ++++++++++++++++++++++++++++---------
>  fs/xfs/xfs_trans_priv.h |  4 ++--
>  3 files changed, 43 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 7b942a63e992..16a7d6f752c9 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -731,19 +731,27 @@ xfs_iflush_done(
>  	 * holding the lock before removing the inode from the AIL.
>  	 */
>  	if (need_ail) {
> -		bool			mlip_changed = false;
> +		xfs_lsn_t	tail_lsn = 0;
>  
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(blip, &tmp, li_bio_list) {
>  			if (INODE_ITEM(blip)->ili_logged &&
> -			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> -				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> -			else {
> +			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> +				/*
> +				 * xfs_ail_delete_finish() only cares about the
> +				 * lsn of the first tail item removed, any others
> +				 * will be at the same or higher lsn so we just
> +				 * ignore them.
> +				 */
> +				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
> +				if (!tail_lsn && lsn)
> +					tail_lsn = lsn;
> +			} else {
>  				xfs_clear_li_failed(blip);
>  			}
>  		}
> -		xfs_ail_delete_finish(ailp, mlip_changed);
> +		xfs_ail_delete_finish(ailp, tail_lsn);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 9e3102179221..00d66175f41a 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -108,17 +108,25 @@ xfs_ail_next(
>   * We need the AIL lock in order to get a coherent read of the lsn of the last
>   * item in the AIL.
>   */
> +static xfs_lsn_t
> +__xfs_ail_min_lsn(
> +	struct xfs_ail		*ailp)
> +{
> +	struct xfs_log_item	*lip = xfs_ail_min(ailp);
> +
> +	if (lip)
> +		return lip->li_lsn;
> +	return 0;
> +}
> +
>  xfs_lsn_t
>  xfs_ail_min_lsn(
>  	struct xfs_ail		*ailp)
>  {
> -	xfs_lsn_t		lsn = 0;
> -	struct xfs_log_item	*lip;
> +	xfs_lsn_t		lsn;
>  
>  	spin_lock(&ailp->ail_lock);
> -	lip = xfs_ail_min(ailp);
> -	if (lip)
> -		lsn = lip->li_lsn;
> +	lsn = __xfs_ail_min_lsn(ailp);
>  	spin_unlock(&ailp->ail_lock);
>  
>  	return lsn;
> @@ -779,12 +787,20 @@ xfs_trans_ail_update_bulk(
>  	}
>  }
>  
> -bool
> +/*
> + * Delete one log item from the AIL.
> + *
> + * If this item was at the tail of the AIL, return the LSN of the log item so
> + * that we can use it to check if the LSN of the tail of the log has moved
> + * when finishing up the AIL delete process in xfs_ail_delete_finish().
> + */
> +xfs_lsn_t
>  xfs_ail_delete_one(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip)
>  {
>  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> +	xfs_lsn_t		lsn = lip->li_lsn;
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
> @@ -792,17 +808,20 @@ xfs_ail_delete_one(
>  	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
>  	lip->li_lsn = 0;
>  
> -	return mlip == lip;
> +	if (mlip == lip)
> +		return lsn;
> +	return 0;
>  }
>  
>  void
>  xfs_ail_delete_finish(
>  	struct xfs_ail		*ailp,
> -	bool			do_tail_update) __releases(ailp->ail_lock)
> +	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
>  
> -	if (!do_tail_update) {
> +	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
> +	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
>  		spin_unlock(&ailp->ail_lock);
>  		return;
>  	}

> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 5ab70b9b896f..db589bb7468d 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -92,8 +92,8 @@ xfs_trans_ail_update(
>  	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
>  }
>  
> -bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> -void xfs_ail_delete_finish(struct xfs_ail *ailp, bool do_tail_update)
> +xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_delete_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
>  			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
>  		int shutdown_type);
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint
  2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
@ 2019-08-05 18:03   ` Brian Foster
  2019-08-05 23:33     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-05 18:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:43PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The CIL can pin a lot of memory and effectively defines the lower
> free memory boundary of operation for XFS. The way we hang onto
> log item shadow buffers "just in case" effectively doubles the
> memory footprint of the CIL for dubious reasons.
> 
> That is, we hang onto the old shadow buffer in case the next time
> we log the item it will fit into the shadow buffer and we won't have
> to allocate a new one. However, we only ever tend to grow dirty
> objects in the CIL through relogging, so once we've allocated a
> larger buffer the old buffer we set as a shadow buffer will never
> get reused as the amount we log never decreases until the item is
> clean. And then for buffer items we free the log item and the shadow
> buffers, anyway. Inode items will hold onto their shadow buffer
> until they are reclaimed - this could double the inode's memory
> footprint for it's lifetime...
> 
> Hence we should just free the old log item buffer when we replace it
> with a new shadow buffer rather than storing it for later use. It's
> not useful, get rid of it as early as possible.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index fa5602d0fd7f..1863a9bdf4a9 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -238,9 +238,7 @@ xfs_cil_prepare_item(
>  	/*
>  	 * If there is no old LV, this is the first time we've seen the item in
>  	 * this CIL context and so we need to pin it. If we are replacing the
> -	 * old_lv, then remove the space it accounts for and make it the shadow
> -	 * buffer for later freeing. In both cases we are now switching to the
> -	 * shadow buffer, so update the the pointer to it appropriately.
> +	 * old_lv, then remove the space it accounts for and free it.
>  	 */

The comment above xlog_cil_alloc_shadow_bufs() needs a similar update
around how we handle the old buffer when the shadow buffer is used.

>  	if (!old_lv) {
>  		if (lv->lv_item->li_ops->iop_pin)
> @@ -251,7 +249,8 @@ xfs_cil_prepare_item(
>  
>  		*diff_len -= old_lv->lv_bytes;
>  		*diff_iovecs -= old_lv->lv_niovecs;
> -		lv->lv_item->li_lv_shadow = old_lv;
> +		kmem_free(old_lv);
> +		lv->lv_item->li_lv_shadow = NULL;
>  	}

So IIUC this is the case where we allocated a shadow buffer, the item
was already pinned (so old_lv is still around) but we ended up using the
shadow buffer for this relog. Instead of keeping the old buffer around
as a new shadow, we toss it. That makes sense, but if the objective is
to not leave dangling shadow buffers around as such, what about the case
where we allocated a shadow buffer but didn't end up using it because
old_lv was reusable? It looks like we still keep the shadow buffer
around in that scenario with a similar lifetime as the swapout scenario
this patch removes. Hm?

Brian

>  
>  	/* attach new log vector to log item */
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-02 23:28         ` Dave Chinner
@ 2019-08-05 18:32           ` Chris Mason
  2019-08-05 23:09             ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Chris Mason @ 2019-08-05 18:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On 2 Aug 2019, at 19:28, Dave Chinner wrote:

> On Fri, Aug 02, 2019 at 02:11:53PM +0000, Chris Mason wrote:
>> On 1 Aug 2019, at 19:58, Dave Chinner wrote:
>> I can't really see bio->b_ioprio working without the rest of the IO
>> controller logic creating a sensible system,
>
> That's exactly the problem we need to solve. The current situation
> is ... untenable. Regardless of whether the io.latency controller
> works well, the fact is that the wbt subsystem is active on -all-
> configurations and the way it "prioritises" is completely broken.

Completely broken is probably a little strong.   Before wbt, it was 
impossible to do buffered IO without periodically saturating the drive 
in unexpected ways.  We've got a lot of data showing it helping, and 
it's pretty easy to setup a new A/B experiment to demonstrate it's 
usefulness in current kernels.  But that doesn't mean it's perfect.

>
>> framework to define weights etc.  My question is if it's worth trying
>> inside of the wbt code, or if we should just let the metadata go
>> through.
>
> As I said, that doesn't  solve the problem. We /want/ critical
> journal IO to have higher priority that background metadata
> writeback. Just ignoring REQ_META doesn't help us there - it just
> moves the priority inversion to blocking on request queue tags.

Does XFS background metadata IO ever get waited on by critical journal 
threads?  My understanding is that all of the filesystems do this from 
time to time.  Without a way to bump the priority of throttled 
background metadata IO, I can't see how to avoid prio inversions without 
running background metadata at the same prio as all of the critical 
journal IO.

>
>> Tejun reminded me that in a lot of ways, swap is user IO and it's
>> actually fine to have it prioritized at the same level as user IO.  
>> We
>
> I think that's wrong. Swap *in* could have user priority but swap
> *out* is global as there is no guarantee that the page being swapped
> belongs to the user context that is reclaiming memory.
>
> Lots of other user and kernel reclaim contexts may be waiting on
> that swap to complete, so it's important that swap out is not
> arbitrarily delayed or susceptible to priority inversions. i.e. swap
> out must take priority over swap-in and other user IO because that
> IO may require allocation to make progress via swapping to free
> "user" file data cached in memory....
>
>> don't want to let a low prio app thrash the drive swapping things in 
>> and
>> out all the time,
>
> Low priority apps will be throttled on *swap in* IO - i.e. by their
> incoming memory demand. High priority apps should be swapping out
> low priority app memory if there are shortages - that's what priority
> defines....
>
>> other higher priority processes aren't waiting for the memory.  This
>> depends on the cgroup config, so wrt your current patches it probably
>> sounds crazy, but we have a lot of data around this from the fleet.
>
> I'm not using cgroups.
>
> Core infrastructure needs to work without cgroups being configured
> to confine everything in userspace to "safe" bounds, and right now
> just running things in the root cgroup doesn't appear to work very
> well at all.

I'm not disagreeing with this part, my real point is there isn't a 
single answer.  It's possible for swap to be critical to the running of 
the box in some workloads, and totally unimportant in others.

-chris

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/24] xfs: don't allow log IO to be throttled
  2019-08-05 18:32           ` Chris Mason
@ 2019-08-05 23:09             ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-05 23:09 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs, linux-mm, linux-fsdevel, Jens Axboe

On Mon, Aug 05, 2019 at 06:32:51PM +0000, Chris Mason wrote:
> On 2 Aug 2019, at 19:28, Dave Chinner wrote:
> 
> > On Fri, Aug 02, 2019 at 02:11:53PM +0000, Chris Mason wrote:
> >> On 1 Aug 2019, at 19:58, Dave Chinner wrote:
> >> I can't really see bio->b_ioprio working without the rest of the IO
> >> controller logic creating a sensible system,
> >
> > That's exactly the problem we need to solve. The current situation
> > is ... untenable. Regardless of whether the io.latency controller
> > works well, the fact is that the wbt subsystem is active on -all-
> > configurations and the way it "prioritises" is completely broken.
> 
> Completely broken is probably a little strong.   Before wbt, it was 
> impossible to do buffered IO without periodically saturating the drive 
> in unexpected ways.  We've got a lot of data showing it helping, and 
> it's pretty easy to setup a new A/B experiment to demonstrate it's 
> usefulness in current kernels.  But that doesn't mean it's perfect.

I'm not arguing that wbt is useless, I'm just saying that it's
design w.r.t. IO prioritisation is fundamentally broken. Using
request types to try to infer priority just doesn't work, as I've
been trying to explain.

> >> framework to define weights etc.  My question is if it's worth trying
> >> inside of the wbt code, or if we should just let the metadata go
> >> through.
> >
> > As I said, that doesn't  solve the problem. We /want/ critical
> > journal IO to have higher priority that background metadata
> > writeback. Just ignoring REQ_META doesn't help us there - it just
> > moves the priority inversion to blocking on request queue tags.
> 
> Does XFS background metadata IO ever get waited on by critical journal 
> threads?

No. Background writeback (which, with this series, is the only way
metadata gets written in XFS) is almost entirely non-blocking until
IO submission occurs. It will force the log if pinned items are
prevents the log tail from moving (hence blocking on log IO) but
largely it doesn't block on anything except IO submission.

The only thing that blocks on journal IO is CIL flushing and,
subsequently, anything that is waiting on a journal flush to
complete. CIL flushing happens in it's own workqueue, so it doesn't
block anything directly. The only operations that wait for log IO
require items to be stable in the journal (e.g. fsync()).

Starting a transactional change may block on metadata writeback. If
there isn't space in the log for the new transaction, it will kick
and wait for background metadata writeback to make progress and push
the tail of the log forwards.  And this may wait on journal IO if
pinned items need to be flushed to the log before writeback can
occur.

This is the way we prevent transactions requiring journal IO to
blocking on metadata writeback to make progress - we don't allow a
transaction to start until it is guaranteed that it can complete
without requiring journal IO to flush other metadata to the journal.
That way there is always space available in the log for all pending
journal IO to complete with a dependency no metadata writeback
making progress.

This "block on metadata writeback at transaction start" design means
data writeback can block on metadata writeback because we do
allocation transactions in the IO path. Which means data IO can
block behind metadata IO, which can block behind log IO, and that
largely defines the IO heirarchy in XFS.

Hence the IO priority order is very clear in XFS - it was designed
this way because you can't support things like guaranteed rate IO
storage applications (one of the prime use cases XFS was originally
designed for) without having a clear model for avoiding priority
inversions between data, metadata and the journal.

I'm not guessing about any of this - I know how all this is supposed
to work because I spent years at SGI working with people far smarter
than me supporting real-time IO applications working along with
real-time IO schedulers in a real time kernel (i.e.  Irix). I don't
make this stuff up for fun or to argue, I say stuff because I know
how it's supposed to work.

And, FWIW, Irix also had a block layer writeback throttling
mechanism to prevent bulk data writeback from thrashing disks and
starving higher priority IO. It was also fully IO priority aware -
this stuff isn't rocket science, and Linux is not the first OS to
ever implement this sort of functionality. Linux was not my first
rodeo....

> My understanding is that all of the filesystems do this from 
> time to time.  Without a way to bump the priority of throttled 
> background metadata IO, I can't see how to avoid prio inversions without 
> running background metadata at the same prio as all of the critical 
> journal IO.

Perhaps you just haven't thought about it enough. :)

> > Core infrastructure needs to work without cgroups being configured
> > to confine everything in userspace to "safe" bounds, and right now
> > just running things in the root cgroup doesn't appear to work very
> > well at all.
> 
> I'm not disagreeing with this part, my real point is there isn't a 
> single answer.  It's possible for swap to be critical to the running of 
> the box in some workloads, and totally unimportant in others.

Sure, but that only indicates that we need to be able to adjust the
priority of IO within certain bounds.

The problem is right now is that the default behaviour is pretty
nasty and core functionality is non-functional. It doesn't matter if
swap priority is adjustable or not, users should not have to tune
the kernel to use an esoteric cgroup configuration in order for the
kernel to function correctly out of the box.

I'm not sure when we lost sight of the fact we need to make the
default configurations work correctly first, and only then do we
worry about how tunable somethign is when the default behaviour has
been proven to be insufficient. Hiding bad behaviour behind custom
cgroup configuration does nobody any favours.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/24] xfs: synchronous AIL pushing
  2019-08-05 17:51   ` Brian Foster
@ 2019-08-05 23:21     ` Dave Chinner
  2019-08-06 12:29       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-05 23:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Mon, Aug 05, 2019 at 01:51:53PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:41PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Provide an interface to push the AIL to a target LSN and wait for
> > the tail of the log to move past that LSN. This is used to wait for
> > all items older than a specific LSN to either be cleaned (written
> > back) or relogged to a higher LSN in the AIL. The primary use for
> > this is to allow IO free inode reclaim throttling.
> > 
> > Factor the common AIL deletion code that does all the wakeups into a
> > helper so we only have one copy of this somewhat tricky code to
> > interface with all the wakeups necessary when the LSN of the log
> > tail changes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode_item.c | 12 +------
> >  fs/xfs/xfs_trans_ail.c  | 69 +++++++++++++++++++++++++++++++++--------
> >  fs/xfs/xfs_trans_priv.h |  6 +++-
> >  3 files changed, 62 insertions(+), 25 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 6ccfd75d3c24..9e3102179221 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -654,6 +654,37 @@ xfs_ail_push_all(
> >  		xfs_ail_push(ailp, threshold_lsn);
> >  }
> >  
> > +/*
> > + * Push the AIL to a specific lsn and wait for it to complete.
> > + */
> > +void
> > +xfs_ail_push_sync(
> > +	struct xfs_ail		*ailp,
> > +	xfs_lsn_t		threshold_lsn)
> > +{
> > +	struct xfs_log_item	*lip;
> > +	DEFINE_WAIT(wait);
> > +
> > +	spin_lock(&ailp->ail_lock);
> > +	while ((lip = xfs_ail_min(ailp)) != NULL) {
> > +		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> > +		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> > +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> > +			break;
> > +		/* XXX: cmpxchg? */
> > +		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> > +			xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);
> 
> Why the need to repeatedly copy the ail_target like this? If the push

It's a hack because the other updates are done unlocked and this
doesn't contain the memroy barriers needed to make it correct
and race free.

Hence the comment "XXX: cmpxchg" to ensure that:

	a) we only ever move the target forwards;
	b) we resolve update races in an obvious, simple manner; and
	c) we can get rid of the possibly incorrect memory
	   barriers around this (unlocked) update.

RFC. WIP. :)

> target only ever moves forward, we should only need to do this once at
> the start of the function. In fact I'm kind of wondering why we can't
> just call xfs_ail_push(). If we check the tail item after grabbing the
> spin lock, we should be able to avoid any races with the waker, no?

I didn't use xfs_ail_push() because of having to prepare to wait
between determining if the AIL is empty and checking if we need
to update the target.

I also didn't want to affect the existing xfs_ail_push() as I was
modifying the xfs_ail_push_sync() code to do what was needed.
Eventually they can probably come back together, but for now I'm not
100% sure that the code is correct and race free.

> > +void
> > +xfs_ail_delete_finish(
> > +	struct xfs_ail		*ailp,
> > +	bool			do_tail_update) __releases(ailp->ail_lock)
> > +{
> > +	struct xfs_mount	*mp = ailp->ail_mount;
> > +
> > +	if (!do_tail_update) {
> > +		spin_unlock(&ailp->ail_lock);
> > +		return;
> > +	}
> > +
> 
> Hmm.. so while what we really care about here are tail updates, this
> logic is currently driven by removing the min ail log item. That seems
> like a lot of potential churn if we're waking the pusher on every object
> written back covered by a single log record / checkpoint. Perhaps we
> should implement a bit more coarse wakeup logic such as only when the
> tail lsn actually changes, for example?

You mean the next patch?

> FWIW, it also doesn't look like you've handled the case of relogged
> items moving the tail forward anywhere that I can see, so we might be
> missing some wakeups here. See xfs_trans_ail_update_bulk() for
> additional AIL manipulation.

Good catch. That might be the race the next patch exposes :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/24] xfs: tail updates only need to occur when LSN changes
  2019-08-05 17:53   ` Brian Foster
@ 2019-08-05 23:28     ` Dave Chinner
  2019-08-06  5:33       ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-05 23:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Mon, Aug 05, 2019 at 01:53:26PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:42PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We currently wake anything waiting on the log tail to move whenever
> > the log item at the tail of the log is removed. Historically this
> > was fine behaviour because there were very few items at any given
> > LSN. But with delayed logging, there may be thousands of items at
> > any given LSN, and we can't move the tail until they are all gone.
> > 
> > Hence if we are removing them in near tail-first order, we might be
> > waking up processes waiting on the tail LSN to change (e.g. log
> > space waiters) repeatedly without them being able to make progress.
> > This also occurs with the new sync push waiters, and can result in
> > thousands of spurious wakeups every second when under heavy direct
> > reclaim pressure.
> > 
> > To fix this, check that the tail LSN has actually changed on the
> > AIL before triggering wakeups. This will reduce the number of
> > spurious wakeups when doing bulk AIL removal and make this code much
> > more efficient.
> > 
> > XXX: occasionally get a temporary hang in xfs_ail_push_sync() with
> > this change - log force from log worker gets things moving again.
> > Only happens under extreme memory pressure - possibly push racing
> > with a tail update on an empty log. Needs further investigation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Ok, this addresses the wakeup granularity issue mentioned in the
> previous patch. Note that I was kind of wondering why we wouldn't base
> this on the l_tail_lsn update in xlog_assign_tail_lsn_locked() as
> opposed to the current approach.

Because I didn't think of it? :)

There's so much other stuff in this patch set I didn't spend a
lot of time thinking about other alternatives. this was a simple
code transformation that did what I wanted, and I went on to burning
brain cells on other more complex issues that needs to be solved...

> For example, xlog_assign_tail_lsn_locked() could simply check the
> current min item against the current l_tail_lsn before it does the
> assignment and use that to trigger tail change events. If we wanted to
> also filter out the other wakeups (as this patch does) then we could
> just pass a bool pointer or something that returns whether the tail
> actually changed.

Yeah, I'll have a look at this - I might rework it as additional
patches now the code is looking at decisions based on LSN rather
than if the tail log item changed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint
  2019-08-05 18:03   ` Brian Foster
@ 2019-08-05 23:33     ` Dave Chinner
  2019-08-06 12:57       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-05 23:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Mon, Aug 05, 2019 at 02:03:01PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:43PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The CIL can pin a lot of memory and effectively defines the lower
> > free memory boundary of operation for XFS. The way we hang onto
> > log item shadow buffers "just in case" effectively doubles the
> > memory footprint of the CIL for dubious reasons.
> > 
> > That is, we hang onto the old shadow buffer in case the next time
> > we log the item it will fit into the shadow buffer and we won't have
> > to allocate a new one. However, we only ever tend to grow dirty
> > objects in the CIL through relogging, so once we've allocated a
> > larger buffer the old buffer we set as a shadow buffer will never
> > get reused as the amount we log never decreases until the item is
> > clean. And then for buffer items we free the log item and the shadow
> > buffers, anyway. Inode items will hold onto their shadow buffer
> > until they are reclaimed - this could double the inode's memory
> > footprint for it's lifetime...
> > 
> > Hence we should just free the old log item buffer when we replace it
> > with a new shadow buffer rather than storing it for later use. It's
> > not useful, get rid of it as early as possible.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index fa5602d0fd7f..1863a9bdf4a9 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -238,9 +238,7 @@ xfs_cil_prepare_item(
> >  	/*
> >  	 * If there is no old LV, this is the first time we've seen the item in
> >  	 * this CIL context and so we need to pin it. If we are replacing the
> > -	 * old_lv, then remove the space it accounts for and make it the shadow
> > -	 * buffer for later freeing. In both cases we are now switching to the
> > -	 * shadow buffer, so update the the pointer to it appropriately.
> > +	 * old_lv, then remove the space it accounts for and free it.
> >  	 */
> 
> The comment above xlog_cil_alloc_shadow_bufs() needs a similar update
> around how we handle the old buffer when the shadow buffer is used.

*nod*

> 
> >  	if (!old_lv) {
> >  		if (lv->lv_item->li_ops->iop_pin)
> > @@ -251,7 +249,8 @@ xfs_cil_prepare_item(
> >  
> >  		*diff_len -= old_lv->lv_bytes;
> >  		*diff_iovecs -= old_lv->lv_niovecs;
> > -		lv->lv_item->li_lv_shadow = old_lv;
> > +		kmem_free(old_lv);
> > +		lv->lv_item->li_lv_shadow = NULL;
> >  	}
> 
> So IIUC this is the case where we allocated a shadow buffer, the item
> was already pinned (so old_lv is still around) but we ended up using the
> shadow buffer for this relog. Instead of keeping the old buffer around
> as a new shadow, we toss it. That makes sense, but if the objective is
> to not leave dangling shadow buffers around as such, what about the case
> where we allocated a shadow buffer but didn't end up using it because
> old_lv was reusable? It looks like we still keep the shadow buffer
> around in that scenario with a similar lifetime as the swapout scenario
> this patch removes. Hm?

Of the top of my head, we shouldn't allocate a new shadow buffer in
that case (see xlog_cil_alloc_shadow_bufs()). i.e. we check up front
if the formatted size of the item will fit in the existing buffer,
and if it does we do not allocate a new shadow buffer as we just
reuse the existing one. SO we should only have to free a shadow
buffer when we switch them, not when we overwrite.

I'll recheck this, but I'm pretty sure overwrite won't leave a
shadow buffer around.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-05 17:42       ` Brian Foster
@ 2019-08-05 23:43         ` Dave Chinner
  2019-08-06 12:27           ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-05 23:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Mon, Aug 05, 2019 at 01:42:26PM -0400, Brian Foster wrote:
> On Sun, Aug 04, 2019 at 11:49:30AM +1000, Dave Chinner wrote:
> > On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> > > On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > > >  };
> > > >  
> > > >  #define SHRINK_STOP (~0UL)
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 44df66a98f2a..ae3035fe94bc 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > > >  				   freeable, delta, total_scan, priority);
> > > >  
> > > > +	/*
> > > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > > +	 * defer the work to a context that can scan the cache.
> > > > +	 */
> > > > +	if (shrinkctl->will_defer)
> > > > +		goto done;
> > > > +
> > > 
> > > Who's responsible for clearing the flag? Perhaps we should do so here
> > > once it's acted upon since we don't call into the shrinker again?
> > 
> > Each shrinker invocation has it's own shrink_control context - they
> > are not shared between shrinkers - the higher level is responsible
> > for setting up the control state of each individual shrinker
> > invocation...
> > 
> 
> Yes, but more specifically, it appears to me that each level is
> responsible for setting up control state managed by that level. E.g.,
> shrink_slab_memcg() initializes the unchanging state per iteration and
> do_shrink_slab() (re)sets the scan state prior to ->scan_objects().

do_shrink_slab() is responsible for iterating the scan in
shrinker->batch sizes, that's all it's doing there. We have to do
some accounting work from scan to scan. However, if ->will_defer is
set, we skip that entire loop, so it's largely irrelevant IMO.

> > > Granted the deferred state likely hasn't
> > > changed, but the fact that we'd call back into the count callback to set
> > > it again implies the logic could be a bit more explicit, particularly if
> > > this will eventually be used for more dynamic shrinker state that might
> > > change call to call (i.e., object dirty state, etc.).
> > > 
> > > BTW, do we need to care about the ->nr_cached_objects() call from the
> > > generic superblock shrinker (super_cache_scan())?
> > 
> > No, and we never had to because it is inside the superblock shrinker
> > and the superblock shrinker does the GFP_NOFS context checks.
> > 
> 
> Ok. Though tbh this topic has me wondering whether a shrink_control
> boolean is the right approach here. Do you envision ->will_defer being
> used for anything other than allocation context restrictions? If not,

Not at this point. If there are other control flags needed, we can
ad them in future - I don't like the idea of having a single control
flag mean different things in different contexts.

> perhaps we should do something like optionally set alloc flags required
> for direct scanning in the struct shrinker itself and let the core
> shrinker code decide when to defer to kswapd based on the shrink_control
> flags and the current shrinker. That way an arbitrary shrinker can't
> muck around with core behavior in unintended ways. Hm?

Arbitrary shrinkers can't "muck about" with the core behaviour any
more than they already could with this code. If you want to screw up
the core reclaim by always returning SHRINK_STOP to ->scan_objects
instead of doing work, then there is nothing stopping you from doing
that right now. Formalising there work deferral into a flag in the
shrink_control doesn't really change that at all, adn as such I
don't see any need for over-complicating the mechanism here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/24] xfs: tail updates only need to occur when LSN changes
  2019-08-05 23:28     ` Dave Chinner
@ 2019-08-06  5:33       ` Dave Chinner
  2019-08-06 12:53         ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-06  5:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 09:28:26AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 01:53:26PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:42PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > We currently wake anything waiting on the log tail to move whenever
> > > the log item at the tail of the log is removed. Historically this
> > > was fine behaviour because there were very few items at any given
> > > LSN. But with delayed logging, there may be thousands of items at
> > > any given LSN, and we can't move the tail until they are all gone.
> > > 
> > > Hence if we are removing them in near tail-first order, we might be
> > > waking up processes waiting on the tail LSN to change (e.g. log
> > > space waiters) repeatedly without them being able to make progress.
> > > This also occurs with the new sync push waiters, and can result in
> > > thousands of spurious wakeups every second when under heavy direct
> > > reclaim pressure.
> > > 
> > > To fix this, check that the tail LSN has actually changed on the
> > > AIL before triggering wakeups. This will reduce the number of
> > > spurious wakeups when doing bulk AIL removal and make this code much
> > > more efficient.
> > > 
> > > XXX: occasionally get a temporary hang in xfs_ail_push_sync() with
> > > this change - log force from log worker gets things moving again.
> > > Only happens under extreme memory pressure - possibly push racing
> > > with a tail update on an empty log. Needs further investigation.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > Ok, this addresses the wakeup granularity issue mentioned in the
> > previous patch. Note that I was kind of wondering why we wouldn't base
> > this on the l_tail_lsn update in xlog_assign_tail_lsn_locked() as
> > opposed to the current approach.
> 
> Because I didn't think of it? :)
> 
> There's so much other stuff in this patch set I didn't spend a
> lot of time thinking about other alternatives. this was a simple
> code transformation that did what I wanted, and I went on to burning
> brain cells on other more complex issues that needs to be solved...
> 
> > For example, xlog_assign_tail_lsn_locked() could simply check the
> > current min item against the current l_tail_lsn before it does the
> > assignment and use that to trigger tail change events. If we wanted to
> > also filter out the other wakeups (as this patch does) then we could
> > just pass a bool pointer or something that returns whether the tail
> > actually changed.
> 
> Yeah, I'll have a look at this - I might rework it as additional
> patches now the code is looking at decisions based on LSN rather
> than if the tail log item changed...

Ok, this is not worth the complexity. The wakeup code has to be able
to tell the difference between a changed tail lsn and an empty AIL
so that wakeups can be issued when the AIL is finally emptied.
Unmount (xfs_ail_push_all_sync()) relies on this, and
xlog_assign_tail_lsn_locked() hides the empty AIL from the caller
by returning log->l_last_sync_lsn to the caller.

Hence the wakeup code still has to check for an empty AIL if the
tail has changed if we use the return value of
xlog_assign_tail_lsn_locked() as the tail LSN. At which point, the
logic becomes somewhat convoluted, and it's far simpler to use
__xfs_ail_min_lsn as it returns when the log is empty.

So, nice idea, but it doesn't make the code simpler or easier to
understand....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 11/24] xfs:: account for memory freed from metadata buffers
  2019-08-01  9:21     ` Dave Chinner
@ 2019-08-06  5:51       ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2019-08-06  5:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 07:21:33PM +1000, Dave Chinner wrote:
> > static inline void shrinker_mark_pages_reclaimed(unsigned long nr_pages)
> > {
> > 	if (current->reclaim_state)
> > 		current->reclaim_state->reclaimed_pages += nr_pages;
> > }
> > 
> > plus good documentation on when to use it.
> 
> Sure, but that's something for patch 6, not this one :)

Sounds good, I just skimmend through the XFS patches.  While we are at
it:  there is a double : in the subject line.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 12/24] xfs: correctly acount for reclaimable slabs
  2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
@ 2019-08-06  5:52   ` Christoph Hellwig
  2019-08-06 21:05     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2019-08-06  5:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:40PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The XFS inode item slab actually reclaimed by inode shrinker
> callbacks from the memory reclaim subsystem. These should be marked
> as reclaimable so the mm subsystem has the full picture of how much
> memory it can actually reclaim from the XFS slab caches.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

Btw, I wonder if we should just kill off our KM_ZONE_* defined.  They
just make it a little harder to figure out what is actually going on
without a real benefit.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim
  2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (23 preceding siblings ...)
  2019-08-01  2:17 ` [PATCH 24/24] xfs: remove unusued old inode reclaim code Dave Chinner
@ 2019-08-06  5:57 ` Christoph Hellwig
  2019-08-06 21:37   ` Dave Chinner
  24 siblings, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2019-08-06  5:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

Hi Dave,

do you have a git tree available to look over the whole series?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-05 23:43         ` Dave Chinner
@ 2019-08-06 12:27           ` Brian Foster
  2019-08-06 22:22             ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-06 12:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 09:43:18AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 01:42:26PM -0400, Brian Foster wrote:
> > On Sun, Aug 04, 2019 at 11:49:30AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> > > > On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > > > >  };
> > > > >  
> > > > >  #define SHRINK_STOP (~0UL)
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 44df66a98f2a..ae3035fe94bc 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > > > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > > > >  				   freeable, delta, total_scan, priority);
> > > > >  
> > > > > +	/*
> > > > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > > > +	 * defer the work to a context that can scan the cache.
> > > > > +	 */
> > > > > +	if (shrinkctl->will_defer)
> > > > > +		goto done;
> > > > > +
> > > > 
> > > > Who's responsible for clearing the flag? Perhaps we should do so here
> > > > once it's acted upon since we don't call into the shrinker again?
> > > 
> > > Each shrinker invocation has it's own shrink_control context - they
> > > are not shared between shrinkers - the higher level is responsible
> > > for setting up the control state of each individual shrinker
> > > invocation...
> > > 
> > 
> > Yes, but more specifically, it appears to me that each level is
> > responsible for setting up control state managed by that level. E.g.,
> > shrink_slab_memcg() initializes the unchanging state per iteration and
> > do_shrink_slab() (re)sets the scan state prior to ->scan_objects().
> 
> do_shrink_slab() is responsible for iterating the scan in
> shrinker->batch sizes, that's all it's doing there. We have to do
> some accounting work from scan to scan. However, if ->will_defer is
> set, we skip that entire loop, so it's largely irrelevant IMO.
> 

The point is very simply that there are scenarios where ->will_defer
might be true or might be false on do_shrink_slab() entry and I'm just
noting it as a potential landmine. It's not a bug in the current code
from what I can tell. I can't imagine why we wouldn't just reset the
flag prior to the ->count_objects() call, but alas I'm not a maintainer
of this code so I'll leave it to other reviewers/maintainers at this
point..

> > > > Granted the deferred state likely hasn't
> > > > changed, but the fact that we'd call back into the count callback to set
> > > > it again implies the logic could be a bit more explicit, particularly if
> > > > this will eventually be used for more dynamic shrinker state that might
> > > > change call to call (i.e., object dirty state, etc.).
> > > > 
> > > > BTW, do we need to care about the ->nr_cached_objects() call from the
> > > > generic superblock shrinker (super_cache_scan())?
> > > 
> > > No, and we never had to because it is inside the superblock shrinker
> > > and the superblock shrinker does the GFP_NOFS context checks.
> > > 
> > 
> > Ok. Though tbh this topic has me wondering whether a shrink_control
> > boolean is the right approach here. Do you envision ->will_defer being
> > used for anything other than allocation context restrictions? If not,
> 
> Not at this point. If there are other control flags needed, we can
> ad them in future - I don't like the idea of having a single control
> flag mean different things in different contexts.
> 

I don't think we're talking about the same thing here..

> > perhaps we should do something like optionally set alloc flags required
> > for direct scanning in the struct shrinker itself and let the core
> > shrinker code decide when to defer to kswapd based on the shrink_control
> > flags and the current shrinker. That way an arbitrary shrinker can't
> > muck around with core behavior in unintended ways. Hm?
> 
> Arbitrary shrinkers can't "muck about" with the core behaviour any
> more than they already could with this code. If you want to screw up
> the core reclaim by always returning SHRINK_STOP to ->scan_objects
> instead of doing work, then there is nothing stopping you from doing
> that right now. Formalising there work deferral into a flag in the
> shrink_control doesn't really change that at all, adn as such I
> don't see any need for over-complicating the mechanism here....
> 

If you add a generic "defer work" knob to the shrinker mechanism, but
only process it as an "allocation context" check, I expect it could be
easily misused. For example, some shrinkers may decide to set the the
flag dynamically based on in-core state. This will work when called from
some contexts but not from others (unrelated to allocation context),
which is confusing. Therefore, what I'm saying is that if the only
current use case is to defer work from shrinkers that currently skip
work due to allocation context restraints, this might be better codified
with something like the appended (untested) example patch. This may or
may not be a preferable interface to the flag, but it's certainly not an
overcomplication...

Brian

--- 8< ---

diff --git a/fs/super.c b/fs/super.c
index 113c58f19425..4e05ed9d6154 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -69,13 +69,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
-	/*
-	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
-	 * to recurse into the FS that called us in clear_inode() and friends..
-	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!trylock_super(sb))
 		return SHRINK_STOP;
 
@@ -264,6 +257,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
 	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+	s->s_shrink.direct_mask = __GFP_FS;
 	if (prealloc_shrinker(&s->s_shrink))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..e94e4edf7f1e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -75,6 +75,8 @@ struct shrinker {
 #endif
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
+
+	gfp_t	direct_mask;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44df66a98f2a..fb339399e26a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -541,6 +541,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
 
+	/*
+	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+	 * defer the work to a context that can scan the cache.
+	 */
+	if (shrinker->direct_mask &&
+	    ((shrinkctl->gfp_mask & shrinker->direct_mask) !=
+	     shrinker->direct_mask))
+		goto done;
+
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
 	 * pass to avoid too frequent shrinker calls, but if the slab has less
@@ -575,6 +584,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
+done:
 	if (next_deferred >= scanned)
 		next_deferred -= scanned;
 	else

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 13/24] xfs: synchronous AIL pushing
  2019-08-05 23:21     ` Dave Chinner
@ 2019-08-06 12:29       ` Brian Foster
  0 siblings, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-06 12:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 09:21:32AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 01:51:53PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:41PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Provide an interface to push the AIL to a target LSN and wait for
> > > the tail of the log to move past that LSN. This is used to wait for
> > > all items older than a specific LSN to either be cleaned (written
> > > back) or relogged to a higher LSN in the AIL. The primary use for
> > > this is to allow IO free inode reclaim throttling.
> > > 
> > > Factor the common AIL deletion code that does all the wakeups into a
> > > helper so we only have one copy of this somewhat tricky code to
> > > interface with all the wakeups necessary when the LSN of the log
> > > tail changes.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_inode_item.c | 12 +------
> > >  fs/xfs/xfs_trans_ail.c  | 69 +++++++++++++++++++++++++++++++++--------
> > >  fs/xfs/xfs_trans_priv.h |  6 +++-
> > >  3 files changed, 62 insertions(+), 25 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > > index 6ccfd75d3c24..9e3102179221 100644
> > > --- a/fs/xfs/xfs_trans_ail.c
> > > +++ b/fs/xfs/xfs_trans_ail.c
> > > @@ -654,6 +654,37 @@ xfs_ail_push_all(
> > >  		xfs_ail_push(ailp, threshold_lsn);
> > >  }
> > >  
> > > +/*
> > > + * Push the AIL to a specific lsn and wait for it to complete.
> > > + */
> > > +void
> > > +xfs_ail_push_sync(
> > > +	struct xfs_ail		*ailp,
> > > +	xfs_lsn_t		threshold_lsn)
> > > +{
> > > +	struct xfs_log_item	*lip;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +	spin_lock(&ailp->ail_lock);
> > > +	while ((lip = xfs_ail_min(ailp)) != NULL) {
> > > +		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> > > +		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> > > +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> > > +			break;
> > > +		/* XXX: cmpxchg? */
> > > +		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> > > +			xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);
> > 
> > Why the need to repeatedly copy the ail_target like this? If the push
> 
> It's a hack because the other updates are done unlocked and this
> doesn't contain the memroy barriers needed to make it correct
> and race free.
> 
> Hence the comment "XXX: cmpxchg" to ensure that:
> 
> 	a) we only ever move the target forwards;
> 	b) we resolve update races in an obvious, simple manner; and
> 	c) we can get rid of the possibly incorrect memory
> 	   barriers around this (unlocked) update.
> 
> RFC. WIP. :)
> 

Ack..

> > target only ever moves forward, we should only need to do this once at
> > the start of the function. In fact I'm kind of wondering why we can't
> > just call xfs_ail_push(). If we check the tail item after grabbing the
> > spin lock, we should be able to avoid any races with the waker, no?
> 
> I didn't use xfs_ail_push() because of having to prepare to wait
> between determining if the AIL is empty and checking if we need
> to update the target.
> 
> I also didn't want to affect the existing xfs_ail_push() as I was
> modifying the xfs_ail_push_sync() code to do what was needed.
> Eventually they can probably come back together, but for now I'm not
> 100% sure that the code is correct and race free.
> 

Ok, just chalk this up to general feedback around seeing if we can
improve the factoring between the several AIL pushing functions we now
have once this mechanism is working/correct.

> > > +void
> > > +xfs_ail_delete_finish(
> > > +	struct xfs_ail		*ailp,
> > > +	bool			do_tail_update) __releases(ailp->ail_lock)
> > > +{
> > > +	struct xfs_mount	*mp = ailp->ail_mount;
> > > +
> > > +	if (!do_tail_update) {
> > > +		spin_unlock(&ailp->ail_lock);
> > > +		return;
> > > +	}
> > > +
> > 
> > Hmm.. so while what we really care about here are tail updates, this
> > logic is currently driven by removing the min ail log item. That seems
> > like a lot of potential churn if we're waking the pusher on every object
> > written back covered by a single log record / checkpoint. Perhaps we
> > should implement a bit more coarse wakeup logic such as only when the
> > tail lsn actually changes, for example?
> 
> You mean the next patch?
> 

Yep, hadn't got to that point yet. FWIW, I think the next patch should
probably come before this one so there isn't a transient state with
whatever behavior results from this patch by itself.

Brian

> > FWIW, it also doesn't look like you've handled the case of relogged
> > items moving the tail forward anywhere that I can see, so we might be
> > missing some wakeups here. See xfs_trans_ail_update_bulk() for
> > additional AIL manipulation.
> 
> Good catch. That might be the race the next patch exposes :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/24] xfs: tail updates only need to occur when LSN changes
  2019-08-06  5:33       ` Dave Chinner
@ 2019-08-06 12:53         ` Brian Foster
  2019-08-06 21:11           ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-06 12:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 03:33:38PM +1000, Dave Chinner wrote:
> On Tue, Aug 06, 2019 at 09:28:26AM +1000, Dave Chinner wrote:
> > On Mon, Aug 05, 2019 at 01:53:26PM -0400, Brian Foster wrote:
> > > On Thu, Aug 01, 2019 at 12:17:42PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > We currently wake anything waiting on the log tail to move whenever
> > > > the log item at the tail of the log is removed. Historically this
> > > > was fine behaviour because there were very few items at any given
> > > > LSN. But with delayed logging, there may be thousands of items at
> > > > any given LSN, and we can't move the tail until they are all gone.
> > > > 
> > > > Hence if we are removing them in near tail-first order, we might be
> > > > waking up processes waiting on the tail LSN to change (e.g. log
> > > > space waiters) repeatedly without them being able to make progress.
> > > > This also occurs with the new sync push waiters, and can result in
> > > > thousands of spurious wakeups every second when under heavy direct
> > > > reclaim pressure.
> > > > 
> > > > To fix this, check that the tail LSN has actually changed on the
> > > > AIL before triggering wakeups. This will reduce the number of
> > > > spurious wakeups when doing bulk AIL removal and make this code much
> > > > more efficient.
> > > > 
> > > > XXX: occasionally get a temporary hang in xfs_ail_push_sync() with
> > > > this change - log force from log worker gets things moving again.
> > > > Only happens under extreme memory pressure - possibly push racing
> > > > with a tail update on an empty log. Needs further investigation.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > 
> > > Ok, this addresses the wakeup granularity issue mentioned in the
> > > previous patch. Note that I was kind of wondering why we wouldn't base
> > > this on the l_tail_lsn update in xlog_assign_tail_lsn_locked() as
> > > opposed to the current approach.
> > 
> > Because I didn't think of it? :)
> > 
> > There's so much other stuff in this patch set I didn't spend a
> > lot of time thinking about other alternatives. this was a simple
> > code transformation that did what I wanted, and I went on to burning
> > brain cells on other more complex issues that needs to be solved...
> > 
> > > For example, xlog_assign_tail_lsn_locked() could simply check the
> > > current min item against the current l_tail_lsn before it does the
> > > assignment and use that to trigger tail change events. If we wanted to
> > > also filter out the other wakeups (as this patch does) then we could
> > > just pass a bool pointer or something that returns whether the tail
> > > actually changed.
> > 
> > Yeah, I'll have a look at this - I might rework it as additional
> > patches now the code is looking at decisions based on LSN rather
> > than if the tail log item changed...
> 
> Ok, this is not worth the complexity. The wakeup code has to be able
> to tell the difference between a changed tail lsn and an empty AIL
> so that wakeups can be issued when the AIL is finally emptied.
> Unmount (xfs_ail_push_all_sync()) relies on this, and
> xlog_assign_tail_lsn_locked() hides the empty AIL from the caller
> by returning log->l_last_sync_lsn to the caller.
> 

Wouldn't either case just be a wakeup from xlog_assign_tail_lsn_locked()
(which should probably be renamed if we took that approach)? It's called
when we've removed the min item from the AIL and so potentially need to
update the tail lsn. 

> Hence the wakeup code still has to check for an empty AIL if the
> tail has changed if we use the return value of
> xlog_assign_tail_lsn_locked() as the tail LSN. At which point, the
> logic becomes somewhat convoluted, and it's far simpler to use
> __xfs_ail_min_lsn as it returns when the log is empty.
> 
> So, nice idea, but it doesn't make the code simpler or easier to
> understand....
> 

It's not that big of a deal either way. BTW on another quick look, I
think something like xfs_ail_update_tail(ailp, old_tail) is a bit more
self-documenting that xfs_ail_delete_finish(ailp, old_lsn).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint
  2019-08-05 23:33     ` Dave Chinner
@ 2019-08-06 12:57       ` Brian Foster
  2019-08-06 21:21         ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-06 12:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 09:33:26AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 02:03:01PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:43PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The CIL can pin a lot of memory and effectively defines the lower
> > > free memory boundary of operation for XFS. The way we hang onto
> > > log item shadow buffers "just in case" effectively doubles the
> > > memory footprint of the CIL for dubious reasons.
> > > 
> > > That is, we hang onto the old shadow buffer in case the next time
> > > we log the item it will fit into the shadow buffer and we won't have
> > > to allocate a new one. However, we only ever tend to grow dirty
> > > objects in the CIL through relogging, so once we've allocated a
> > > larger buffer the old buffer we set as a shadow buffer will never
> > > get reused as the amount we log never decreases until the item is
> > > clean. And then for buffer items we free the log item and the shadow
> > > buffers, anyway. Inode items will hold onto their shadow buffer
> > > until they are reclaimed - this could double the inode's memory
> > > footprint for it's lifetime...
> > > 
> > > Hence we should just free the old log item buffer when we replace it
> > > with a new shadow buffer rather than storing it for later use. It's
> > > not useful, get rid of it as early as possible.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 7 +++----
> > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index fa5602d0fd7f..1863a9bdf4a9 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -238,9 +238,7 @@ xfs_cil_prepare_item(
> > >  	/*
> > >  	 * If there is no old LV, this is the first time we've seen the item in
> > >  	 * this CIL context and so we need to pin it. If we are replacing the
> > > -	 * old_lv, then remove the space it accounts for and make it the shadow
> > > -	 * buffer for later freeing. In both cases we are now switching to the
> > > -	 * shadow buffer, so update the the pointer to it appropriately.
> > > +	 * old_lv, then remove the space it accounts for and free it.
> > >  	 */
> > 
> > The comment above xlog_cil_alloc_shadow_bufs() needs a similar update
> > around how we handle the old buffer when the shadow buffer is used.
> 
> *nod*
> 
> > 
> > >  	if (!old_lv) {
> > >  		if (lv->lv_item->li_ops->iop_pin)
> > > @@ -251,7 +249,8 @@ xfs_cil_prepare_item(
> > >  
> > >  		*diff_len -= old_lv->lv_bytes;
> > >  		*diff_iovecs -= old_lv->lv_niovecs;
> > > -		lv->lv_item->li_lv_shadow = old_lv;
> > > +		kmem_free(old_lv);
> > > +		lv->lv_item->li_lv_shadow = NULL;
> > >  	}
> > 
> > So IIUC this is the case where we allocated a shadow buffer, the item
> > was already pinned (so old_lv is still around) but we ended up using the
> > shadow buffer for this relog. Instead of keeping the old buffer around
> > as a new shadow, we toss it. That makes sense, but if the objective is
> > to not leave dangling shadow buffers around as such, what about the case
> > where we allocated a shadow buffer but didn't end up using it because
> > old_lv was reusable? It looks like we still keep the shadow buffer
> > around in that scenario with a similar lifetime as the swapout scenario
> > this patch removes. Hm?
> 
> Of the top of my head, we shouldn't allocate a new shadow buffer in
> that case (see xlog_cil_alloc_shadow_bufs()). i.e. we check up front
> if the formatted size of the item will fit in the existing buffer,
> and if it does we do not allocate a new shadow buffer as we just
> reuse the existing one. SO we should only have to free a shadow
> buffer when we switch them, not when we overwrite.
> 

We have such a check in xlog_cil_insert_format_items(), so we'd reuse
->li_lv if it will suffice even if we have a shadow buffer available.

> I'll recheck this, but I'm pretty sure overwrite won't leave a
> shadow buffer around.
> 

But before that we have the following logic:

static void
xlog_cil_alloc_shadow_bufs(
	...

	if (!lip->li_lv_shadow ||
	    buf_size > lip->li_lv_shadow->lv_size) {
		...
		lv = kmem_alloc_large(buf_size, KM_SLEEP | KM_NOFS);
		...
		lip->li_lv_shadow = lv;
	} else {
		<reuse shadow>
	}
	...
}

... which always allocates a shadow buffer if one doesn't exist. We
don't look at the currently used (lip->li_lv) buffer at all here. IIUC,
that has to do with the TOCTOU race described in the big comment above
the function.. hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 17/24] xfs: don't block kswapd in inode reclaim
  2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
@ 2019-08-06 18:21   ` Brian Foster
  2019-08-06 21:27     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-06 18:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:45PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We have a number of reasons for blocking kswapd in XFS inode
> reclaim, mainly all to do with the fact that memory reclaim has no
> feedback mechanisms to throttle on dirty slab objects that need IO
> to reclaim.
> 
> As a result, we currently throttle inode reclaim by issuing IO in
> the reclaim context. The unfortunate side effect of this is that it
> can cause long tail latencies in reclaim and for some workloads this
> can be a problem.
> 
> Now that the shrinkers finally have a method of telling kswapd to
> back off, we can start the process of making inode reclaim in XFS
> non-blocking. The first thing we need to do is not block kswapd, but
> so that doesn't cause immediate serious problems, make sure inode
> writeback is always underway when kswapd is running.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0b0fd10a36d4..2fa2f8dcf86b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr(
>  	struct xfs_mount	*mp,
>  	int			nr_to_scan)
>  {
> -	/* kick background reclaimer and push the AIL */
> +	int			sync_mode = SYNC_TRYLOCK;
> +
> +	/* kick background reclaimer */
>  	xfs_reclaim_work_queue(mp);
> -	xfs_ail_push_all(mp->m_ail);
>  
> -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> +	/*
> +	 * For kswapd, we kick background inode writeback. For direct
> +	 * reclaim, we issue and wait on inode writeback to throttle
> +	 * reclaim rates and avoid shouty OOM-death.
> +	 */
> +	if (current_is_kswapd())
> +		xfs_ail_push_all(mp->m_ail);

So we're unblocking kswapd from dirty items, but we already kick the AIL
regardless of kswapd or not in inode reclaim. Why the change to no
longer kick the AIL in the !kswapd case? Whatever the reasoning, a
mention in the commit log would be helpful...

Brian

> +	else
> +		sync_mode |= SYNC_WAIT;
> +
> +	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
>  }
>  
>  /*
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 18/24] xfs: reduce kswapd blocking on inode locking.
  2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
@ 2019-08-06 18:22   ` Brian Foster
  2019-08-06 21:33     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-06 18:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:46PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When doing async node reclaiming, we grab a batch of inodes that we
> are likely able to reclaim and ignore those that are already
> flushing. However, when we actually go to reclaim them, the first
> thing we do is lock the inode. If we are racing with something
> else reclaiming the inode or flushing it because it is dirty,
> we block on the inode lock. Hence we can still block kswapd here.
> 
> Further, if we flush an inode, we also cluster all the other dirty
> inodes in that cluster into the same IO, flush locking them all.
> However, if the workload is operating on sequential inodes (e.g.
> created by a tarball extraction) most of these inodes will be
> sequntial in the cache and so in the same batch
> we've already grabbed for reclaim scanning.
> 
> As a result, it is common for all the inodes in the batch to be
> dirty and it is common for the first inode flushed to also flush all
> the inodes in the reclaim batch. In which case, they are now all
> going to be flush locked and we do not want to block on them.
> 

Hmm... I think I'm missing something with this description. For dirty
inodes that are flushed in a cluster via reclaim as described, aren't we
already blocking on all of the flush locks by virtue of the synchronous
I/O associated with the flush of the first dirty inode in that
particular cluster?

Brian

> Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use
> trylock semantics and abort reclaim of an inode as quickly as we can
> without blocking kswapd.
> 
> Found via tracing and finding big batches of repeated lock/unlock
> runs on inodes that we just flushed by write clustering during
> reclaim.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 2fa2f8dcf86b..e6b9030875b9 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1104,11 +1104,23 @@ xfs_reclaim_inode(
>  
>  restart:
>  	error = 0;
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	if (!xfs_iflock_nowait(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> +	/*
> +	 * Don't try to flush the inode if another inode in this cluster has
> +	 * already flushed it after we did the initial checks in
> +	 * xfs_reclaim_inode_grab().
> +	 */
> +	if (sync_mode & SYNC_TRYLOCK) {
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>  			goto out;
> -		xfs_iflock(ip);
> +		if (!xfs_iflock_nowait(ip))
> +			goto out_unlock;
> +	} else {
> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
> +		if (!xfs_iflock_nowait(ip)) {
> +			if (!(sync_mode & SYNC_WAIT))
> +				goto out_unlock;
> +			xfs_iflock(ip);
> +		}
>  	}
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> @@ -1215,9 +1227,10 @@ xfs_reclaim_inode(
>  
>  out_ifunlock:
>  	xfs_ifunlock(ip);
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  out:
>  	xfs_iflags_clear(ip, XFS_IRECLAIM);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	/*
>  	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
>  	 * a short while. However, this just burns CPU time scanning the tree
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 12/24] xfs: correctly acount for reclaimable slabs
  2019-08-06  5:52   ` Christoph Hellwig
@ 2019-08-06 21:05     ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 21:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Mon, Aug 05, 2019 at 10:52:49PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 01, 2019 at 12:17:40PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The XFS inode item slab actually reclaimed by inode shrinker
> > callbacks from the memory reclaim subsystem. These should be marked
> > as reclaimable so the mm subsystem has the full picture of how much
> > memory it can actually reclaim from the XFS slab caches.
> 
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Btw, I wonder if we should just kill off our KM_ZONE_* defined.  They
> just make it a little harder to figure out what is actually going on
> without a real benefit.

Yeah, they don't serve much purpose now, it might be worth cleaning
up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 14/24] xfs: tail updates only need to occur when LSN changes
  2019-08-06 12:53         ` Brian Foster
@ 2019-08-06 21:11           ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 21:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 08:53:21AM -0400, Brian Foster wrote:
> On Tue, Aug 06, 2019 at 03:33:38PM +1000, Dave Chinner wrote:
> > On Tue, Aug 06, 2019 at 09:28:26AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 05, 2019 at 01:53:26PM -0400, Brian Foster wrote:
> > > > On Thu, Aug 01, 2019 at 12:17:42PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > We currently wake anything waiting on the log tail to move whenever
> > > > > the log item at the tail of the log is removed. Historically this
> > > > > was fine behaviour because there were very few items at any given
> > > > > LSN. But with delayed logging, there may be thousands of items at
> > > > > any given LSN, and we can't move the tail until they are all gone.
> > > > > 
> > > > > Hence if we are removing them in near tail-first order, we might be
> > > > > waking up processes waiting on the tail LSN to change (e.g. log
> > > > > space waiters) repeatedly without them being able to make progress.
> > > > > This also occurs with the new sync push waiters, and can result in
> > > > > thousands of spurious wakeups every second when under heavy direct
> > > > > reclaim pressure.
> > > > > 
> > > > > To fix this, check that the tail LSN has actually changed on the
> > > > > AIL before triggering wakeups. This will reduce the number of
> > > > > spurious wakeups when doing bulk AIL removal and make this code much
> > > > > more efficient.
> > > > > 
> > > > > XXX: occasionally get a temporary hang in xfs_ail_push_sync() with
> > > > > this change - log force from log worker gets things moving again.
> > > > > Only happens under extreme memory pressure - possibly push racing
> > > > > with a tail update on an empty log. Needs further investigation.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > 
> > > > Ok, this addresses the wakeup granularity issue mentioned in the
> > > > previous patch. Note that I was kind of wondering why we wouldn't base
> > > > this on the l_tail_lsn update in xlog_assign_tail_lsn_locked() as
> > > > opposed to the current approach.
> > > 
> > > Because I didn't think of it? :)
> > > 
> > > There's so much other stuff in this patch set I didn't spend a
> > > lot of time thinking about other alternatives. this was a simple
> > > code transformation that did what I wanted, and I went on to burning
> > > brain cells on other more complex issues that needs to be solved...
> > > 
> > > > For example, xlog_assign_tail_lsn_locked() could simply check the
> > > > current min item against the current l_tail_lsn before it does the
> > > > assignment and use that to trigger tail change events. If we wanted to
> > > > also filter out the other wakeups (as this patch does) then we could
> > > > just pass a bool pointer or something that returns whether the tail
> > > > actually changed.
> > > 
> > > Yeah, I'll have a look at this - I might rework it as additional
> > > patches now the code is looking at decisions based on LSN rather
> > > than if the tail log item changed...
> > 
> > Ok, this is not worth the complexity. The wakeup code has to be able
> > to tell the difference between a changed tail lsn and an empty AIL
> > so that wakeups can be issued when the AIL is finally emptied.
> > Unmount (xfs_ail_push_all_sync()) relies on this, and
> > xlog_assign_tail_lsn_locked() hides the empty AIL from the caller
> > by returning log->l_last_sync_lsn to the caller.
> > 
> 
> Wouldn't either case just be a wakeup from xlog_assign_tail_lsn_locked()
> (which should probably be renamed if we took that approach)? It's called
> when we've removed the min item from the AIL and so potentially need to
> update the tail lsn. 

Not easily, because xlog_assign_tail_lsn_locked() is also used to
grab the current tail when we are formatting the log header during a
CIL checkpoint. We do not want to be doing wakeups there.

And, to tell the truth, I don't really want to screw with a function
that provides on-disk information for log recovery in this
series. That brings a whole new level of jeopardy to this patch set
I'd prefer to avoid....

> > Hence the wakeup code still has to check for an empty AIL if the
> > tail has changed if we use the return value of
> > xlog_assign_tail_lsn_locked() as the tail LSN. At which point, the
> > logic becomes somewhat convoluted, and it's far simpler to use
> > __xfs_ail_min_lsn as it returns when the log is empty.
> > 
> > So, nice idea, but it doesn't make the code simpler or easier to
> > understand....
> 
> It's not that big of a deal either way. BTW on another quick look, I
> think something like xfs_ail_update_tail(ailp, old_tail) is a bit more
> self-documenting that xfs_ail_delete_finish(ailp, old_lsn).

I had already renamed it to xfs_ail_update_finish() when I updated
the last patch to include the bulk update case.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint
  2019-08-06 12:57       ` Brian Foster
@ 2019-08-06 21:21         ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 21:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 08:57:27AM -0400, Brian Foster wrote:
> On Tue, Aug 06, 2019 at 09:33:26AM +1000, Dave Chinner wrote:
> > I'll recheck this, but I'm pretty sure overwrite won't leave a
> > shadow buffer around.
> > 
> 
> But before that we have the following logic:
> 
> static void
> xlog_cil_alloc_shadow_bufs(
> 	...
> 
> 	if (!lip->li_lv_shadow ||
> 	    buf_size > lip->li_lv_shadow->lv_size) {
> 		...
> 		lv = kmem_alloc_large(buf_size, KM_SLEEP | KM_NOFS);
> 		...
> 		lip->li_lv_shadow = lv;
> 	} else {
> 		<reuse shadow>
> 	}
> 	...
> }
> 
> ... which always allocates a shadow buffer if one doesn't exist. We
> don't look at the currently used (lip->li_lv) buffer at all here. IIUC,
> that has to do with the TOCTOU race described in the big comment above
> the function.. hm?

You might be right there. I haven't had a chance to follow up on
this from yesterday yet, so I'll keep this in mind when I look at it
again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 17/24] xfs: don't block kswapd in inode reclaim
  2019-08-06 18:21   ` Brian Foster
@ 2019-08-06 21:27     ` Dave Chinner
  2019-08-07 11:14       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 21:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 02:21:31PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:45PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We have a number of reasons for blocking kswapd in XFS inode
> > reclaim, mainly all to do with the fact that memory reclaim has no
> > feedback mechanisms to throttle on dirty slab objects that need IO
> > to reclaim.
> > 
> > As a result, we currently throttle inode reclaim by issuing IO in
> > the reclaim context. The unfortunate side effect of this is that it
> > can cause long tail latencies in reclaim and for some workloads this
> > can be a problem.
> > 
> > Now that the shrinkers finally have a method of telling kswapd to
> > back off, we can start the process of making inode reclaim in XFS
> > non-blocking. The first thing we need to do is not block kswapd, but
> > so that doesn't cause immediate serious problems, make sure inode
> > writeback is always underway when kswapd is running.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 0b0fd10a36d4..2fa2f8dcf86b 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr(
> >  	struct xfs_mount	*mp,
> >  	int			nr_to_scan)
> >  {
> > -	/* kick background reclaimer and push the AIL */
> > +	int			sync_mode = SYNC_TRYLOCK;
> > +
> > +	/* kick background reclaimer */
> >  	xfs_reclaim_work_queue(mp);
> > -	xfs_ail_push_all(mp->m_ail);
> >  
> > -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> > +	/*
> > +	 * For kswapd, we kick background inode writeback. For direct
> > +	 * reclaim, we issue and wait on inode writeback to throttle
> > +	 * reclaim rates and avoid shouty OOM-death.
> > +	 */
> > +	if (current_is_kswapd())
> > +		xfs_ail_push_all(mp->m_ail);
> 
> So we're unblocking kswapd from dirty items, but we already kick the AIL
> regardless of kswapd or not in inode reclaim. Why the change to no
> longer kick the AIL in the !kswapd case? Whatever the reasoning, a
> mention in the commit log would be helpful...

Because we used to block reclaim, we never knew how long it would be
before it came back (say it had to write 1024 inode buffers), so
every time we entered reclaim here we kicked the AIL in case we did
get blocked for a long time.

Now kswapd doesn't block at all, we know it's going to enter this
code repeatedly while direct reclaim is blocked, and so we only need
it to kick background inode writeback via kswapd rather than all
reclaim now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 18/24] xfs: reduce kswapd blocking on inode locking.
  2019-08-06 18:22   ` Brian Foster
@ 2019-08-06 21:33     ` Dave Chinner
  2019-08-07 11:30       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 21:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 02:22:13PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:46PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When doing async node reclaiming, we grab a batch of inodes that we
> > are likely able to reclaim and ignore those that are already
> > flushing. However, when we actually go to reclaim them, the first
> > thing we do is lock the inode. If we are racing with something
> > else reclaiming the inode or flushing it because it is dirty,
> > we block on the inode lock. Hence we can still block kswapd here.
> > 
> > Further, if we flush an inode, we also cluster all the other dirty
> > inodes in that cluster into the same IO, flush locking them all.
> > However, if the workload is operating on sequential inodes (e.g.
> > created by a tarball extraction) most of these inodes will be
> > sequntial in the cache and so in the same batch
> > we've already grabbed for reclaim scanning.
> > 
> > As a result, it is common for all the inodes in the batch to be
> > dirty and it is common for the first inode flushed to also flush all
> > the inodes in the reclaim batch. In which case, they are now all
> > going to be flush locked and we do not want to block on them.
> > 
> 
> Hmm... I think I'm missing something with this description. For dirty
> inodes that are flushed in a cluster via reclaim as described, aren't we
> already blocking on all of the flush locks by virtue of the synchronous
> I/O associated with the flush of the first dirty inode in that
> particular cluster?

Currently we end up issuing IO and waiting for it, so by the time we
get to the next inode in the cluster, it's already been cleaned and
unlocked.

However, as we go to non-blocking scanning, if we hit one
flush-locked inode in a batch, it's entirely likely that the rest of
the inodes in the batch are also flush locked, and so we should
always try to skip over them in non-blocking reclaim.

This is really just a stepping stone in the logic to the way the
LRU isolation function works - it's entirely non-blocking and full
of lock order inversions, so everything has to run under try-lock
semantics. This is essentially starting that restructuring, based on
the observation that sequential inodes are flushed in batches...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim
  2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
@ 2019-08-06 21:37   ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 21:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Mon, Aug 05, 2019 at 10:57:44PM -0700, Christoph Hellwig wrote:
> Hi Dave,
> 
> do you have a git tree available to look over the whole series?

Not yet, I'll get one up for the next version of the patchset and I
have done some page cache vs inode cache balance testing. That,
FWIW, is not looking good - the vanilla 5.3-rc3 kernel is unable to
maintain a balanced page cache/inode cache working set under steady
state tarball-extraction workloads. Memory reclaim looks to be have
been completely borked from a system balance perspective since I
last looked at it maybe a year ago....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-06 12:27           ` Brian Foster
@ 2019-08-06 22:22             ` Dave Chinner
  2019-08-07 11:13               ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-06 22:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Tue, Aug 06, 2019 at 08:27:54AM -0400, Brian Foster wrote:
> If you add a generic "defer work" knob to the shrinker mechanism, but
> only process it as an "allocation context" check, I expect it could be
> easily misused. For example, some shrinkers may decide to set the the
> flag dynamically based on in-core state.

Which is already the case. e.g. There are shrinkers that don't do
anything because a try-lock fails.  I haven't attempted to change
them, but they are a clear example of how even ->scan_object to
->scan_object the shrinker context can change. 

> This will work when called from
> some contexts but not from others (unrelated to allocation context),
> which is confusing. Therefore, what I'm saying is that if the only
> current use case is to defer work from shrinkers that currently skip
> work due to allocation context restraints, this might be better codified
> with something like the appended (untested) example patch. This may or
> may not be a preferable interface to the flag, but it's certainly not an
> overcomplication...

I don't think this is the right way to go.

I want the filesystem shrinkers to become entirely non-blocking so
that we can dynamically decide on an object-by-object basis whether
we can reclaim the object in GFP_NOFS context.

That is, a clean XFS inode that requires no special cleanup can be
reclaimed even in GFP_NOFS context. The problem we have is that
dentry reclaim can drop the last reference to an inode, causing
inactivation and hence modification. However, if it's only going to
move to the inode LRU and not evict the inode, we can reclaim that
dentry. Similarly for inodes - if evicting the inode is not going to
block or modify the inode, we can reclaim the inode even under
GFP_NOFS constraints. And the same for XFS indoes - it if's clean
we can reclaim it, GFP_NOFS context or not.

IMO, that's the direction we need to be heading in, and in those
cases the "deferred work" tends towards a count of objects we could
not reclaim during the scan because they require blocking work to be
done. i.e. deferred work is a boolean now because the GFP_NOFS
decision is boolean, but it's lays the ground work for deferred work
to be integrated at a much finer-grained level in the shrinker
scanning routines in future...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/24] mm: directed shrinker work deferral
  2019-08-06 22:22             ` Dave Chinner
@ 2019-08-07 11:13               ` Brian Foster
  0 siblings, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-07 11:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Wed, Aug 07, 2019 at 08:22:20AM +1000, Dave Chinner wrote:
> On Tue, Aug 06, 2019 at 08:27:54AM -0400, Brian Foster wrote:
> > If you add a generic "defer work" knob to the shrinker mechanism, but
> > only process it as an "allocation context" check, I expect it could be
> > easily misused. For example, some shrinkers may decide to set the the
> > flag dynamically based on in-core state.
> 
> Which is already the case. e.g. There are shrinkers that don't do
> anything because a try-lock fails.  I haven't attempted to change
> them, but they are a clear example of how even ->scan_object to
> ->scan_object the shrinker context can change. 
> 

That's a similar point to what I'm trying to make wrt to
->count_objects() and the new defer state..

> > This will work when called from
> > some contexts but not from others (unrelated to allocation context),
> > which is confusing. Therefore, what I'm saying is that if the only
> > current use case is to defer work from shrinkers that currently skip
> > work due to allocation context restraints, this might be better codified
> > with something like the appended (untested) example patch. This may or
> > may not be a preferable interface to the flag, but it's certainly not an
> > overcomplication...
> 
> I don't think this is the right way to go.
> 
> I want the filesystem shrinkers to become entirely non-blocking so
> that we can dynamically decide on an object-by-object basis whether
> we can reclaim the object in GFP_NOFS context.
> 

This is why I was asking about whether/how you envisioned the defer flag
looking in the future. Though I think this is somewhat orthogonal to the
discussion between having a bool or internal alloc mask set, because
both are of the same granularity and would need to change to operate on
a per objects basis.

> That is, a clean XFS inode that requires no special cleanup can be
> reclaimed even in GFP_NOFS context. The problem we have is that
> dentry reclaim can drop the last reference to an inode, causing
> inactivation and hence modification. However, if it's only going to
> move to the inode LRU and not evict the inode, we can reclaim that
> dentry. Similarly for inodes - if evicting the inode is not going to
> block or modify the inode, we can reclaim the inode even under
> GFP_NOFS constraints. And the same for XFS indoes - it if's clean
> we can reclaim it, GFP_NOFS context or not.
> 
> IMO, that's the direction we need to be heading in, and in those
> cases the "deferred work" tends towards a count of objects we could
> not reclaim during the scan because they require blocking work to be
> done. i.e. deferred work is a boolean now because the GFP_NOFS
> decision is boolean, but it's lays the ground work for deferred work
> to be integrated at a much finer-grained level in the shrinker
> scanning routines in future...
> 

Yeah, this sounds more like it warrants a ->nr_deferred field or some
such, which could ultimately replace either of the previously discussed
options for deferring the entire instance. BTW, ISTM we could use that
kind of interface now for exactly what this patch is trying to
accomplish by changing those shrinkers with allocation context
restrictions to just transfer the entire scan count to the deferred
count in ->scan_objects() instead of setting the flag. That's somewhat
less churn in the long run because we aren't shifting the defer logic
back and forth between the count and scan callbacks unnecessarily. IMO,
it's also a cleaner interface than both options above.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 17/24] xfs: don't block kswapd in inode reclaim
  2019-08-06 21:27     ` Dave Chinner
@ 2019-08-07 11:14       ` Brian Foster
  0 siblings, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-07 11:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Wed, Aug 07, 2019 at 07:27:04AM +1000, Dave Chinner wrote:
> On Tue, Aug 06, 2019 at 02:21:31PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:45PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > We have a number of reasons for blocking kswapd in XFS inode
> > > reclaim, mainly all to do with the fact that memory reclaim has no
> > > feedback mechanisms to throttle on dirty slab objects that need IO
> > > to reclaim.
> > > 
> > > As a result, we currently throttle inode reclaim by issuing IO in
> > > the reclaim context. The unfortunate side effect of this is that it
> > > can cause long tail latencies in reclaim and for some workloads this
> > > can be a problem.
> > > 
> > > Now that the shrinkers finally have a method of telling kswapd to
> > > back off, we can start the process of making inode reclaim in XFS
> > > non-blocking. The first thing we need to do is not block kswapd, but
> > > so that doesn't cause immediate serious problems, make sure inode
> > > writeback is always underway when kswapd is running.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_icache.c | 17 ++++++++++++++---
> > >  1 file changed, 14 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 0b0fd10a36d4..2fa2f8dcf86b 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr(
> > >  	struct xfs_mount	*mp,
> > >  	int			nr_to_scan)
> > >  {
> > > -	/* kick background reclaimer and push the AIL */
> > > +	int			sync_mode = SYNC_TRYLOCK;
> > > +
> > > +	/* kick background reclaimer */
> > >  	xfs_reclaim_work_queue(mp);
> > > -	xfs_ail_push_all(mp->m_ail);
> > >  
> > > -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> > > +	/*
> > > +	 * For kswapd, we kick background inode writeback. For direct
> > > +	 * reclaim, we issue and wait on inode writeback to throttle
> > > +	 * reclaim rates and avoid shouty OOM-death.
> > > +	 */
> > > +	if (current_is_kswapd())
> > > +		xfs_ail_push_all(mp->m_ail);
> > 
> > So we're unblocking kswapd from dirty items, but we already kick the AIL
> > regardless of kswapd or not in inode reclaim. Why the change to no
> > longer kick the AIL in the !kswapd case? Whatever the reasoning, a
> > mention in the commit log would be helpful...
> 
> Because we used to block reclaim, we never knew how long it would be
> before it came back (say it had to write 1024 inode buffers), so
> every time we entered reclaim here we kicked the AIL in case we did
> get blocked for a long time.
> 
> Now kswapd doesn't block at all, we know it's going to enter this
> code repeatedly while direct reclaim is blocked, and so we only need
> it to kick background inode writeback via kswapd rather than all
> reclaim now.
> 

Got it. Can you include this reasoning in the commit log description
please?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 18/24] xfs: reduce kswapd blocking on inode locking.
  2019-08-06 21:33     ` Dave Chinner
@ 2019-08-07 11:30       ` Brian Foster
  2019-08-07 23:16         ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-07 11:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Wed, Aug 07, 2019 at 07:33:53AM +1000, Dave Chinner wrote:
> On Tue, Aug 06, 2019 at 02:22:13PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:46PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When doing async node reclaiming, we grab a batch of inodes that we
> > > are likely able to reclaim and ignore those that are already
> > > flushing. However, when we actually go to reclaim them, the first
> > > thing we do is lock the inode. If we are racing with something
> > > else reclaiming the inode or flushing it because it is dirty,
> > > we block on the inode lock. Hence we can still block kswapd here.
> > > 
> > > Further, if we flush an inode, we also cluster all the other dirty
> > > inodes in that cluster into the same IO, flush locking them all.
> > > However, if the workload is operating on sequential inodes (e.g.
> > > created by a tarball extraction) most of these inodes will be
> > > sequntial in the cache and so in the same batch
> > > we've already grabbed for reclaim scanning.
> > > 
> > > As a result, it is common for all the inodes in the batch to be
> > > dirty and it is common for the first inode flushed to also flush all
> > > the inodes in the reclaim batch. In which case, they are now all
> > > going to be flush locked and we do not want to block on them.
> > > 
> > 
> > Hmm... I think I'm missing something with this description. For dirty
> > inodes that are flushed in a cluster via reclaim as described, aren't we
> > already blocking on all of the flush locks by virtue of the synchronous
> > I/O associated with the flush of the first dirty inode in that
> > particular cluster?
> 
> Currently we end up issuing IO and waiting for it, so by the time we
> get to the next inode in the cluster, it's already been cleaned and
> unlocked.
> 

Right..

> However, as we go to non-blocking scanning, if we hit one
> flush-locked inode in a batch, it's entirely likely that the rest of
> the inodes in the batch are also flush locked, and so we should
> always try to skip over them in non-blocking reclaim.
> 

This makes more sense. Note that the description is confusing because it
assumes context that doesn't exist in the code as of yet (i.e., no
mention of the nonblocking mode) and so isn't clear to the reader. If
the purpose is preparation for a future change, please note that clearly
in the commit log.

Second (and not necessarily caused by this patch), the ireclaim flag
semantics are kind of a mess. As you've already noted, we currently
block on some locks even with SYNC_TRYLOCK, yet the cluster flushing
code has no concept of these flags (so we always trylock, never wait on
unpin, for some reason use the shared ilock vs. the exclusive ilock,
etc.). Further, with this patch TRYLOCK|WAIT basically means that if we
happen to get the lock, we flush and wait on I/O so we can free the
inode(s), but if somebody else has flushed the inode (we don't get the
flush lock) we decide not to wait on the I/O that might (or might not)
already be in progress. I find that a bit inconsistent. It also makes me
slightly concerned that we're (ab)using flag semantics for a bug fix
(waiting on inodes we've just flushed from the same task), but it looks
like this is all going to change quite a bit still so I'm not going to
worry too much about this mostly existing mess until I grok the bigger
picture changes... :P

Brian

> This is really just a stepping stone in the logic to the way the
> LRU isolation function works - it's entirely non-blocking and full
> of lock order inversions, so everything has to run under try-lock
> semantics. This is essentially starting that restructuring, based on
> the observation that sequential inodes are flushed in batches...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 04/24] shrinker: defer work only to kswapd
  2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
  2019-08-02 15:34   ` Brian Foster
  2019-08-04 16:48   ` Nikolay Borisov
@ 2019-08-07 16:12   ` kbuild test robot
  2019-08-07 18:00   ` kbuild test robot
  3 siblings, 0 replies; 87+ messages in thread
From: kbuild test robot @ 2019-08-07 16:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kbuild-all, linux-xfs, linux-mm, linux-fsdevel

Hi Dave,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[cannot apply to v5.3-rc3 next-20190807]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dave-Chinner/mm-xfs-non-blocking-inode-reclaim/20190804-042311
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> mm/vmscan.c:539:70: sparse: sparse: incorrect type in argument 1 (different base types) @@    expected struct atomic64_t [usertype] *v @@    got ruct atomic64_t [usertype] *v @@
>> mm/vmscan.c:539:70: sparse:    expected struct atomic64_t [usertype] *v
>> mm/vmscan.c:539:70: sparse:    got struct atomic_t [usertype] *
   arch/x86/include/asm/irqflags.h:54:9: sparse: sparse: context imbalance in 'check_move_unevictable_pages' - unexpected unlock

vim +539 mm/vmscan.c

   498	
   499	static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
   500					    struct shrinker *shrinker, int priority)
   501	{
   502		unsigned long freed = 0;
   503		int64_t freeable_objects = 0;
   504		int64_t scan_count;
   505		int64_t scanned_objects = 0;
   506		int64_t next_deferred = 0;
   507		int64_t deferred_count = 0;
   508		long new_nr;
   509		int nid = shrinkctl->nid;
   510		long batch_size = shrinker->batch ? shrinker->batch
   511						  : SHRINK_BATCH;
   512	
   513		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
   514			nid = 0;
   515	
   516		scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
   517						&freeable_objects);
   518		if (scan_count == 0 || scan_count == SHRINK_EMPTY)
   519			return scan_count;
   520	
   521		/*
   522		 * If kswapd, we take all the deferred work and do it here. We don't let
   523		 * direct reclaim do this, because then it means some poor sod is going
   524		 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
   525		 * amount of reclaim work from concurrent kswapd operations. Hence we do
   526		 * the work in the wrong place, at the wrong time, and it's largely
   527		 * unpredictable.
   528		 *
   529		 * By doing the deferred work only in kswapd, we can schedule the work
   530		 * according the the reclaim priority - low priority reclaim will do
   531		 * less deferred work, hence we'll do more of the deferred work the more
   532		 * desperate we become for free memory. This avoids the need for needing
   533		 * to specifically avoid deferred work windup as low amount os memory
   534		 * pressure won't excessive trim caches anymore.
   535		 */
   536		if (current_is_kswapd()) {
   537			int64_t	deferred_scan;
   538	
 > 539			deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
   540	
   541			/* we want to scan 5-10% of the deferred work here at minimum */
   542			deferred_scan = deferred_count;
   543			if (priority)
   544				do_div(deferred_scan, priority);
   545			scan_count += deferred_scan;
   546	
   547			/*
   548			 * If there is more deferred work than the number of freeable
   549			 * items in the cache, limit the amount of work we will carry
   550			 * over to the next kswapd run on this cache. This prevents
   551			 * deferred work windup.
   552			 */
   553			if (deferred_count > freeable_objects * 2)
   554				deferred_count = freeable_objects * 2;
   555	
   556		}
   557	
   558		/*
   559		 * Avoid risking looping forever due to too large nr value:
   560		 * never try to free more than twice the estimate number of
   561		 * freeable entries.
   562		 */
   563		if (scan_count > freeable_objects * 2)
   564			scan_count = freeable_objects * 2;
   565	
   566		trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
   567					   freeable_objects, scan_count,
   568					   scan_count, priority);
   569	
   570		/*
   571		 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
   572		 * defer the work to a context that can scan the cache.
   573		 */
   574		if (shrinkctl->will_defer)
   575			goto done;
   576	
   577		/*
   578		 * Normally, we should not scan less than batch_size objects in one
   579		 * pass to avoid too frequent shrinker calls, but if the slab has less
   580		 * than batch_size objects in total and we are really tight on memory,
   581		 * we will try to reclaim all available objects, otherwise we can end
   582		 * up failing allocations although there are plenty of reclaimable
   583		 * objects spread over several slabs with usage less than the
   584		 * batch_size.
   585		 *
   586		 * We detect the "tight on memory" situations by looking at the total
   587		 * number of objects we want to scan (total_scan). If it is greater
   588		 * than the total number of objects on slab (freeable), we must be
   589		 * scanning at high prio and therefore should try to reclaim as much as
   590		 * possible.
   591		 */
   592		while (scan_count >= batch_size ||
   593		       scan_count >= freeable_objects) {
   594			unsigned long ret;
   595			unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
   596	
   597			shrinkctl->nr_to_scan = nr_to_scan;
   598			shrinkctl->nr_scanned = nr_to_scan;
   599			ret = shrinker->scan_objects(shrinker, shrinkctl);
   600			if (ret == SHRINK_STOP)
   601				break;
   602			freed += ret;
   603	
   604			count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
   605			scan_count -= shrinkctl->nr_scanned;
   606			scanned_objects += shrinkctl->nr_scanned;
   607	
   608			cond_resched();
   609		}
   610	
   611	done:
   612		if (deferred_count)
   613			next_deferred = deferred_count - scanned_objects;
   614		else if (scan_count > 0)
   615			next_deferred = scan_count;
   616		/*
   617		 * move the unused scan count back into the shrinker in a
   618		 * manner that handles concurrent updates. If we exhausted the
   619		 * scan, there is no need to do an update.
   620		 */
   621		if (next_deferred > 0)
   622			new_nr = atomic_long_add_return(next_deferred,
   623							&shrinker->nr_deferred[nid]);
   624		else
   625			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
   626	
   627		trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
   628						scan_count);
   629		return freed;
   630	}
   631	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 04/24] shrinker: defer work only to kswapd
  2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
                     ` (2 preceding siblings ...)
  2019-08-07 16:12   ` kbuild test robot
@ 2019-08-07 18:00   ` kbuild test robot
  3 siblings, 0 replies; 87+ messages in thread
From: kbuild test robot @ 2019-08-07 18:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kbuild-all, linux-xfs, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7795 bytes --]

Hi Dave,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[cannot apply to v5.3-rc3 next-20190807]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dave-Chinner/mm-xfs-non-blocking-inode-reclaim/20190804-042311
config: i386-randconfig-a003-201931 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.2-10+deb8u1) 4.9.2
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   mm/vmscan.c: In function 'do_shrink_slab':
>> mm/vmscan.c:539:34: warning: passing argument 1 of 'atomic64_xchg' from incompatible pointer type
      deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
                                     ^
   In file included from arch/x86/include/asm/atomic.h:265:0,
                    from include/linux/atomic.h:7,
                    from include/linux/jump_label.h:249,
                    from include/linux/static_key.h:1,
                    from arch/x86/include/asm/nospec-branch.h:6,
                    from arch/x86/include/asm/paravirt_types.h:46,
                    from arch/x86/include/asm/ptrace.h:94,
                    from arch/x86/include/asm/math_emu.h:5,
                    from arch/x86/include/asm/processor.h:12,
                    from arch/x86/include/asm/cpufeature.h:5,
                    from arch/x86/include/asm/thread_info.h:53,
                    from include/linux/thread_info.h:38,
                    from arch/x86/include/asm/preempt.h:7,
                    from include/linux/preempt.h:78,
                    from include/linux/spinlock.h:51,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:6,
                    from include/linux/mm.h:10,
                    from mm/vmscan.c:17:
   include/asm-generic/atomic-instrumented.h:1421:1: note: expected 'struct atomic64_t *' but argument is of type 'struct atomic_long_t *'
    atomic64_xchg(atomic64_t *v, s64 i)
    ^

vim +/atomic64_xchg +539 mm/vmscan.c

   498	
   499	static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
   500					    struct shrinker *shrinker, int priority)
   501	{
   502		unsigned long freed = 0;
   503		int64_t freeable_objects = 0;
   504		int64_t scan_count;
   505		int64_t scanned_objects = 0;
   506		int64_t next_deferred = 0;
   507		int64_t deferred_count = 0;
   508		long new_nr;
   509		int nid = shrinkctl->nid;
   510		long batch_size = shrinker->batch ? shrinker->batch
   511						  : SHRINK_BATCH;
   512	
   513		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
   514			nid = 0;
   515	
   516		scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
   517						&freeable_objects);
   518		if (scan_count == 0 || scan_count == SHRINK_EMPTY)
   519			return scan_count;
   520	
   521		/*
   522		 * If kswapd, we take all the deferred work and do it here. We don't let
   523		 * direct reclaim do this, because then it means some poor sod is going
   524		 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
   525		 * amount of reclaim work from concurrent kswapd operations. Hence we do
   526		 * the work in the wrong place, at the wrong time, and it's largely
   527		 * unpredictable.
   528		 *
   529		 * By doing the deferred work only in kswapd, we can schedule the work
   530		 * according the the reclaim priority - low priority reclaim will do
   531		 * less deferred work, hence we'll do more of the deferred work the more
   532		 * desperate we become for free memory. This avoids the need for needing
   533		 * to specifically avoid deferred work windup as low amount os memory
   534		 * pressure won't excessive trim caches anymore.
   535		 */
   536		if (current_is_kswapd()) {
   537			int64_t	deferred_scan;
   538	
 > 539			deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
   540	
   541			/* we want to scan 5-10% of the deferred work here at minimum */
   542			deferred_scan = deferred_count;
   543			if (priority)
   544				do_div(deferred_scan, priority);
   545			scan_count += deferred_scan;
   546	
   547			/*
   548			 * If there is more deferred work than the number of freeable
   549			 * items in the cache, limit the amount of work we will carry
   550			 * over to the next kswapd run on this cache. This prevents
   551			 * deferred work windup.
   552			 */
   553			if (deferred_count > freeable_objects * 2)
   554				deferred_count = freeable_objects * 2;
   555	
   556		}
   557	
   558		/*
   559		 * Avoid risking looping forever due to too large nr value:
   560		 * never try to free more than twice the estimate number of
   561		 * freeable entries.
   562		 */
   563		if (scan_count > freeable_objects * 2)
   564			scan_count = freeable_objects * 2;
   565	
   566		trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
   567					   freeable_objects, scan_count,
   568					   scan_count, priority);
   569	
   570		/*
   571		 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
   572		 * defer the work to a context that can scan the cache.
   573		 */
   574		if (shrinkctl->will_defer)
   575			goto done;
   576	
   577		/*
   578		 * Normally, we should not scan less than batch_size objects in one
   579		 * pass to avoid too frequent shrinker calls, but if the slab has less
   580		 * than batch_size objects in total and we are really tight on memory,
   581		 * we will try to reclaim all available objects, otherwise we can end
   582		 * up failing allocations although there are plenty of reclaimable
   583		 * objects spread over several slabs with usage less than the
   584		 * batch_size.
   585		 *
   586		 * We detect the "tight on memory" situations by looking at the total
   587		 * number of objects we want to scan (total_scan). If it is greater
   588		 * than the total number of objects on slab (freeable), we must be
   589		 * scanning at high prio and therefore should try to reclaim as much as
   590		 * possible.
   591		 */
   592		while (scan_count >= batch_size ||
   593		       scan_count >= freeable_objects) {
   594			unsigned long ret;
   595			unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
   596	
   597			shrinkctl->nr_to_scan = nr_to_scan;
   598			shrinkctl->nr_scanned = nr_to_scan;
   599			ret = shrinker->scan_objects(shrinker, shrinkctl);
   600			if (ret == SHRINK_STOP)
   601				break;
   602			freed += ret;
   603	
   604			count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
   605			scan_count -= shrinkctl->nr_scanned;
   606			scanned_objects += shrinkctl->nr_scanned;
   607	
   608			cond_resched();
   609		}
   610	
   611	done:
   612		if (deferred_count)
   613			next_deferred = deferred_count - scanned_objects;
   614		else if (scan_count > 0)
   615			next_deferred = scan_count;
   616		/*
   617		 * move the unused scan count back into the shrinker in a
   618		 * manner that handles concurrent updates. If we exhausted the
   619		 * scan, there is no need to do an update.
   620		 */
   621		if (next_deferred > 0)
   622			new_nr = atomic_long_add_return(next_deferred,
   623							&shrinker->nr_deferred[nid]);
   624		else
   625			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
   626	
   627		trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
   628						scan_count);
   629		return freed;
   630	}
   631	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33595 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO
  2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
@ 2019-08-07 18:09   ` Brian Foster
  2019-08-07 23:10     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-07 18:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim currently issues it's own inode IO when it comes
> across dirty inodes. This is used to throttle direct reclaim down to
> the rate at which we can reclaim dirty inodes. Failure to throttle
> in this manner results in the OOM killer being trivial to trigger
> even when there is lots of free memory available.
> 
> However, having direct reclaimers issue IO causes an amount of
> IO thrashing to occur. We can have up to the number of AGs in the
> filesystem concurrently issuing IO, plus the AIL pushing thread as
> well. This means we can many competing sources of IO and they all
> end up thrashing and competing for the request slots in the block
> device.
> 
> Similar to dirty page throttling and the BDI flusher thread, we can
> use the AIL pushing thread the sole place we issue inode writeback
> from and everything else waits for it to make progress. To do this,
> reclaim will skip over dirty inodes, but in doing so will record the
> lowest LSN of all the dirty inodes it skips. It will then push the
> AIL to this LSN and wait for it to complete that work.
> 
> In doing so, we block direct reclaim on the IO of at least one IO,
> thereby providing some level of throttling for when we encounter
> dirty inodes. However we gain the ability to scan and reclaim
> clean inodes in a non-blocking fashion. This allows us to
> remove all the per-ag reclaim locking that avoids excessive direct
> reclaim, as repeated concurrent direct reclaim will hit the same
> dirty inodes on block waiting on the same IO to complete.
> 

The last part of the above sentence sounds borked..

> Hence direct reclaim will be throttled directly by the rate at which
> dirty inodes are cleaned by AIL pushing, rather than by delays
> caused by competing IO submissions. This allows us to remove all the
> locking that limits direct reclaim concurrency and greatly
> simplifies the inode reclaim code now that it just skips dirty
> inodes.
> 
> Note: this patch by itself isn't completely able to throttle direct
> reclaim sufficiently to prevent OOM killer madness. We can't do that
> until we change the way we index reclaimable inodes in the next
> patch and can feed back state to the mm core sanely.  However, we
> can't change the way we index reclaimable inodes until we have
> IO-less non-blocking reclaim for both direct reclaim and kswapd
> reclaim.  Catch-22...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c    | 208 +++++++++++++++++------------------------
>  fs/xfs/xfs_mount.c     |   4 -
>  fs/xfs/xfs_mount.h     |   1 -
>  fs/xfs/xfs_trans_ail.c |   4 +-
>  4 files changed, 90 insertions(+), 127 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0bd4420a7e16..4c4c5bc12147 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -22,6 +22,7 @@
>  #include "xfs_dquot_item.h"
>  #include "xfs_dquot.h"
>  #include "xfs_reflink.h"
> +#include "xfs_log.h"
>  
>  #include <linux/iversion.h>
>  
> @@ -967,28 +968,42 @@ xfs_inode_ag_iterator_tag(
>  }
>  
>  /*
> - * Grab the inode for reclaim exclusively.
> - * Return 0 if we grabbed it, non-zero otherwise.
> + * Grab the inode for reclaim.
> + *
> + * Return false if we aren't going to reclaim it, true if it is a reclaim
> + * candidate.
> + *
> + * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
> + * caller it does not require flushing. Otherwise return the log item lsn of the
> + * inode so the caller can determine it's inode flush target.  If we get the
> + * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
> + * have locks held.
>   */
> -STATIC int
> +STATIC bool
>  xfs_reclaim_inode_grab(
>  	struct xfs_inode	*ip,
> -	int			flags)
> +	int			flags,
> +	xfs_lsn_t		*lsn)
>  {
>  	ASSERT(rcu_read_lock_held());
> +	*lsn = 0;

The comment above says we return NULLCOMMITLSN. Given the rest of the
code, I'm assuming we should just fix up the comment.

>  
>  	/* quick check for stale RCU freed inode */
>  	if (!ip->i_ino)
> -		return 1;
> +		return false;
>  
>  	/*
> -	 * If we are asked for non-blocking operation, do unlocked checks to
> -	 * see if the inode already is being flushed or in reclaim to avoid
> -	 * lock traffic.
> +	 * Do unlocked checks to see if the inode already is being flushed or in
> +	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> +	 * it's position in the AIL for the caller to push to.
>  	 */
> -	if ((flags & SYNC_TRYLOCK) &&
> -	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> -		return 1;
> +	if (!xfs_inode_clean(ip)) {
> +		*lsn = ip->i_itemp->ili_item.li_lsn;
> +		return false;
> +	}
> +
> +	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> +		return false;
>  
>  	/*
>  	 * The radix tree lock here protects a thread in xfs_iget from racing
...
> @@ -1050,92 +1065,67 @@ xfs_reclaim_inode_grab(
>   *	clean		=> reclaim
>   *	dirty, async	=> requeue
>   *	dirty, sync	=> flush, wait and reclaim
> + *
> + * Returns true if the inode was reclaimed, false otherwise.
>   */
> -STATIC int
> +STATIC bool
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
> -	int			sync_mode)
> +	xfs_lsn_t		*lsn)
>  {
> -	struct xfs_buf		*bp = NULL;
> -	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
> -	int			error;
> +	xfs_ino_t		ino;
> +
> +	*lsn = 0;
>  
> -restart:
> -	error = 0;
>  	/*
>  	 * Don't try to flush the inode if another inode in this cluster has
>  	 * already flushed it after we did the initial checks in
>  	 * xfs_reclaim_inode_grab().
>  	 */
> -	if (sync_mode & SYNC_TRYLOCK) {
> -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> -			goto out;
> -		if (!xfs_iflock_nowait(ip))
> -			goto out_unlock;
> -	} else {
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		if (!xfs_iflock_nowait(ip)) {
> -			if (!(sync_mode & SYNC_WAIT))
> -				goto out_unlock;
> -			xfs_iflock(ip);
> -		}
> -	}
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> +		goto out;
> +	if (!xfs_iflock_nowait(ip))
> +		goto out_unlock;
>  

Do we even need the flush lock here any more if we're never going to
flush from this context? The shutdown case just below notwithstanding
(which I'm also wondering if we should just drop given we abort from
xfs_iflush() on shutdown), the pin count is an atomic and the dirty
state changes under ilock.

Maybe I'm missing something else, but the reason I ask is that the
increased flush lock contention in codepaths that don't actually flush
once it's acquired gives me a bit of concern that we could reduce
effectiveness of the one task that actually does (xfsaild).

> +	/* If we are in shutdown, we don't care about blocking. */
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
>  		/* xfs_iflush_abort() drops the flush lock */
>  		xfs_iflush_abort(ip, false);
>  		goto reclaim;
>  	}
> -	if (xfs_ipincount(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> -			goto out_ifunlock;
> -		xfs_iunpin_wait(ip);
> -	}
> -	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
> -		xfs_ifunlock(ip);
> -		goto reclaim;
> -	}
>  
>  	/*
> -	 * Never flush out dirty data during non-blocking reclaim, as it would
> -	 * just contend with AIL pushing trying to do the same job.
> +	 * If it is pinned, we only want to flush this if there's nothing else
> +	 * to be flushed as it requires a log force. Hence we essentially set
> +	 * the LSN to flush the entire AIL which will end up triggering a log
> +	 * force to unpin this inode, but that will only happen if there are not
> +	 * other inodes in the scan that only need writeback.
>  	 */
> -	if (!(sync_mode & SYNC_WAIT))
> +	if (xfs_ipincount(ip)) {
> +		*lsn = ip->i_itemp->ili_last_lsn;

->ili_last_lsn comes from xfs_cil_ctx->sequence, which I don't think is
actually a physical LSN suitable for AIL pushing. The lsn assigned to
the item once it's physically logged and AIL inserted comes from
ctx->start_lsn, which comes from the iclog header and so is a physical
LSN.

That said, this usage of ili_last_lsn seems to disappear by the end of
the series...

>  		goto out_ifunlock;
> +	}
>  
...
> @@ -1205,39 +1189,28 @@ xfs_reclaim_inode(
>   * corrupted, we still want to try to reclaim all the inodes. If we don't,
>   * then a shut down during filesystem unmount reclaim walk leak all the
>   * unreclaimed inodes.
> + *
> + * Return the number of inodes freed.
>   */
>  STATIC int
>  xfs_reclaim_inodes_ag(
>  	struct xfs_mount	*mp,
>  	int			flags,
> -	int			*nr_to_scan)
> +	int			nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
> -	int			last_error = 0;
>  	xfs_agnumber_t		ag;
> -	int			trylock = flags & SYNC_TRYLOCK;
> -	int			skipped;
> +	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
> +	long			freed = 0;
>  
> -restart:
>  	ag = 0;
> -	skipped = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
>  		unsigned long	first_index = 0;
>  		int		done = 0;
>  		int		nr_found = 0;
>  
>  		ag = pag->pag_agno + 1;
> -
> -		if (trylock) {
> -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> -				skipped++;
> -				xfs_perag_put(pag);
> -				continue;
> -			}
> -			first_index = pag->pag_ici_reclaim_cursor;
> -		} else
> -			mutex_lock(&pag->pag_ici_reclaim_lock);

I understand that the eliminated blocking drops a dependency on the
perag reclaim exclusion as described by the commit log, but I'm not sure
it's enough to justify removing it entirely. For one, the reclaim cursor
management looks potentially racy. Also, doesn't this exclusion provide
some balance for reclaim across AGs? E.g., if a bunch of reclaim threads
come in at the same time, this allows them to walk across AGs instead of
potentially stumbling over eachother in the batching/grabbing code.

I see again that most of this code seems to ultimately go away, replaced
by an LRU mechanism so we no longer operate on a per-ag basis. I can see
how this becomes irrelevant with that mechanism, but I think it might
make more sense to drop this locking along with the broader mechanism in
the last patch or two of the series rather than doing it here. If
nothing else, that eliminates the need for the reviewer to consider this
transient "old mechanism + new locking" state as opposed to reasoning
about the old mechanism vs. new mechanism and why the old locking simply
no longer applies.

> +		first_index = pag->pag_ici_reclaim_cursor;
>  
>  		do {
>  			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
...
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00d66175f41a..5802139f786b 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -676,8 +676,10 @@ xfs_ail_push_sync(
>  	spin_lock(&ailp->ail_lock);
>  	while ((lip = xfs_ail_min(ailp)) != NULL) {
>  		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> +	trace_printk("lip lsn 0x%llx thres 0x%llx targ 0x%llx",
> +			lip->li_lsn, threshold_lsn, ailp->ail_target);
>  		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> -		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
>  			break;

Stale/mislocated changes?

Brian

>  		/* XXX: cmpxchg? */
>  		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO
  2019-08-07 18:09   ` Brian Foster
@ 2019-08-07 23:10     ` Dave Chinner
  2019-08-08 16:20       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-07 23:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Wed, Aug 07, 2019 at 02:09:15PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:48PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode reclaim currently issues it's own inode IO when it comes
> > across dirty inodes. This is used to throttle direct reclaim down to
> > the rate at which we can reclaim dirty inodes. Failure to throttle
> > in this manner results in the OOM killer being trivial to trigger
> > even when there is lots of free memory available.
> > 
> > However, having direct reclaimers issue IO causes an amount of
> > IO thrashing to occur. We can have up to the number of AGs in the
> > filesystem concurrently issuing IO, plus the AIL pushing thread as
> > well. This means we can many competing sources of IO and they all
> > end up thrashing and competing for the request slots in the block
> > device.
> > 
> > Similar to dirty page throttling and the BDI flusher thread, we can
> > use the AIL pushing thread the sole place we issue inode writeback
> > from and everything else waits for it to make progress. To do this,
> > reclaim will skip over dirty inodes, but in doing so will record the
> > lowest LSN of all the dirty inodes it skips. It will then push the
> > AIL to this LSN and wait for it to complete that work.
> > 
> > In doing so, we block direct reclaim on the IO of at least one IO,
> > thereby providing some level of throttling for when we encounter
> > dirty inodes. However we gain the ability to scan and reclaim
> > clean inodes in a non-blocking fashion. This allows us to
> > remove all the per-ag reclaim locking that avoids excessive direct
> > reclaim, as repeated concurrent direct reclaim will hit the same
> > dirty inodes on block waiting on the same IO to complete.
> > 
> 
> The last part of the above sentence sounds borked..

s/on/and/

:)

> >  /*
> > - * Grab the inode for reclaim exclusively.
> > - * Return 0 if we grabbed it, non-zero otherwise.
> > + * Grab the inode for reclaim.
> > + *
> > + * Return false if we aren't going to reclaim it, true if it is a reclaim
> > + * candidate.
> > + *
> > + * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
> > + * caller it does not require flushing. Otherwise return the log item lsn of the
> > + * inode so the caller can determine it's inode flush target.  If we get the
> > + * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
> > + * have locks held.
> >   */
> > -STATIC int
> > +STATIC bool
> >  xfs_reclaim_inode_grab(
> >  	struct xfs_inode	*ip,
> > -	int			flags)
> > +	int			flags,
> > +	xfs_lsn_t		*lsn)
> >  {
> >  	ASSERT(rcu_read_lock_held());
> > +	*lsn = 0;
> 
> The comment above says we return NULLCOMMITLSN. Given the rest of the
> code, I'm assuming we should just fix up the comment.

Yup, I think I've already fixed it.

> > -restart:
> > -	error = 0;
> >  	/*
> >  	 * Don't try to flush the inode if another inode in this cluster has
> >  	 * already flushed it after we did the initial checks in
> >  	 * xfs_reclaim_inode_grab().
> >  	 */
> > -	if (sync_mode & SYNC_TRYLOCK) {
> > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > -			goto out;
> > -		if (!xfs_iflock_nowait(ip))
> > -			goto out_unlock;
> > -	} else {
> > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -		if (!xfs_iflock_nowait(ip)) {
> > -			if (!(sync_mode & SYNC_WAIT))
> > -				goto out_unlock;
> > -			xfs_iflock(ip);
> > -		}
> > -	}
> > +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > +		goto out;
> > +	if (!xfs_iflock_nowait(ip))
> > +		goto out_unlock;
> >  
> 
> Do we even need the flush lock here any more if we're never going to
> flush from this context?

Ideally, no. But the inode my currently be being flushed, in which
case the incore inode is clean, but we can't reclaim it yet. Hence
we need the flush lock to serialise against IO completion.

> The shutdown case just below notwithstanding
> (which I'm also wondering if we should just drop given we abort from
> xfs_iflush() on shutdown), the pin count is an atomic and the dirty
> state changes under ilock.

The shutdown case has to handle pinned inodes, not just inodes being
flushed.

> Maybe I'm missing something else, but the reason I ask is that the
> increased flush lock contention in codepaths that don't actually flush
> once it's acquired gives me a bit of concern that we could reduce
> effectiveness of the one task that actually does (xfsaild).

The flush lock isn't a contended lock - it's actually a bit that is
protected by the i_flags_lock, so if we are contending on anything
it will be the flags lock. And, well, see the LRU isolate function
conversion of this code, becuase it changes how the flags lock is
used for reclaim but I haven't seen any contention as a result of
that change....

> 
> > -	 * Never flush out dirty data during non-blocking reclaim, as it would
> > -	 * just contend with AIL pushing trying to do the same job.
> > +	 * If it is pinned, we only want to flush this if there's nothing else
> > +	 * to be flushed as it requires a log force. Hence we essentially set
> > +	 * the LSN to flush the entire AIL which will end up triggering a log
> > +	 * force to unpin this inode, but that will only happen if there are not
> > +	 * other inodes in the scan that only need writeback.
> >  	 */
> > -	if (!(sync_mode & SYNC_WAIT))
> > +	if (xfs_ipincount(ip)) {
> > +		*lsn = ip->i_itemp->ili_last_lsn;
> 
> ->ili_last_lsn comes from xfs_cil_ctx->sequence, which I don't think is
> actually a physical LSN suitable for AIL pushing. The lsn assigned to
> the item once it's physically logged and AIL inserted comes from
> ctx->start_lsn, which comes from the iclog header and so is a physical
> LSN.

Yup, I've already noticed and fixed that bug :)

> >  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> >  		unsigned long	first_index = 0;
> >  		int		done = 0;
> >  		int		nr_found = 0;
> >  
> >  		ag = pag->pag_agno + 1;
> > -
> > -		if (trylock) {
> > -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> > -				skipped++;
> > -				xfs_perag_put(pag);
> > -				continue;
> > -			}
> > -			first_index = pag->pag_ici_reclaim_cursor;
> > -		} else
> > -			mutex_lock(&pag->pag_ici_reclaim_lock);
> 
> I understand that the eliminated blocking drops a dependency on the
> perag reclaim exclusion as described by the commit log, but I'm not sure
> it's enough to justify removing it entirely. For one, the reclaim cursor
> management looks potentially racy.

We really don't care if the cursor updates are racy. All that will
result in is some inode ranges being scanned twice in quick
succession. All this does now is prevent reclaim from starting at
the start of the AG every time it runs, so we end up with most
reclaimers iterating over previously unscanned inodes.

> Also, doesn't this exclusion provide
> some balance for reclaim across AGs? E.g., if a bunch of reclaim threads
> come in at the same time, this allows them to walk across AGs instead of
> potentially stumbling over eachother in the batching/grabbing code.

What they do now is leapfrog each other and work through the same
AGs much faster. The overall pattern of reclaim doesn't actually
change much, just the speed at which individual AGs are scanned.

But that was not what the locking was put in place for. THe locking
was put in place to be able to throttle the number of concurrent
reclaimers issuing IO. If the reclaimers leapfrogged like they do
without the locking, then we end up with non-sequential inode
writeback patterns, and writeback performance goes really bad,
really quickly. Hence the locking is there to ensure we get
sequential inode writeback patterns from each AG that is being
reclaimed from. That can be optimised by block layer merging, and so
even though we might have a lot of concurrent reclaimers, we get
lots of large, well-formed IOs from each of them.

IOWs, the locking was all about preventing the IO patterns from
breaking down under memory pressure, not about optimising how
reclaimers interact with each other.

> I see again that most of this code seems to ultimately go away, replaced
> by an LRU mechanism so we no longer operate on a per-ag basis. I can see
> how this becomes irrelevant with that mechanism, but I think it might
> make more sense to drop this locking along with the broader mechanism in
> the last patch or two of the series rather than doing it here.

Fundamentally, this patch is all about shifting the IO and blocking
mechanisms to the AIL. This locking is no longer necessary, and it
actually gets in the way of doing non-blocking reclaim and shifting
the IO to the AIL. i.e. we block where we no longer need to, and
that causes more problems for this change than it solves.

> If
> nothing else, that eliminates the need for the reviewer to consider this
> transient "old mechanism + new locking" state as opposed to reasoning
> about the old mechanism vs. new mechanism and why the old locking simply
> no longer applies.

I think you're putting to much "make every step of the transition
perfect" focus on this. We've got to drop this locking to make
reclaim non-blocking, and we have to make reclaim non-blocking
before we can move to a LRU mechanisms that relies on LRU removal
being completely non-blocking and cannot issue IO. It's a waste of
time trying to break this down further and further into "perfect"
patches - it works well enough and without functional regressions so
it does not create landmines for people doing bisects, and that's
largely all that matters in the middle of a large patchset that is
making large algorithm changes...

> > +		first_index = pag->pag_ici_reclaim_cursor;
> >  
> >  		do {
> >  			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
> ...
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 00d66175f41a..5802139f786b 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -676,8 +676,10 @@ xfs_ail_push_sync(
> >  	spin_lock(&ailp->ail_lock);
> >  	while ((lip = xfs_ail_min(ailp)) != NULL) {
> >  		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> > +	trace_printk("lip lsn 0x%llx thres 0x%llx targ 0x%llx",
> > +			lip->li_lsn, threshold_lsn, ailp->ail_target);
> >  		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> > -		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> > +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> >  			break;
> 
> Stale/mislocated changes?

I've already cleaned that one up, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 18/24] xfs: reduce kswapd blocking on inode locking.
  2019-08-07 11:30       ` Brian Foster
@ 2019-08-07 23:16         ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-07 23:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Wed, Aug 07, 2019 at 07:30:09AM -0400, Brian Foster wrote:
> Second (and not necessarily caused by this patch), the ireclaim flag
> semantics are kind of a mess. As you've already noted, we currently
> block on some locks even with SYNC_TRYLOCK, yet the cluster flushing
> code has no concept of these flags (so we always trylock, never wait on
> unpin, for some reason use the shared ilock vs. the exclusive ilock,
> etc.). Further, with this patch TRYLOCK|WAIT basically means that if we
> happen to get the lock, we flush and wait on I/O so we can free the
> inode(s), but if somebody else has flushed the inode (we don't get the
> flush lock) we decide not to wait on the I/O that might (or might not)
> already be in progress. I find that a bit inconsistent. It also makes me
> slightly concerned that we're (ab)using flag semantics for a bug fix
> (waiting on inodes we've just flushed from the same task), but it looks
> like this is all going to change quite a bit still so I'm not going to
> worry too much about this mostly existing mess until I grok the bigger
> picture changes... :P

Yes, SYNC_TRYLOCK/SYNC_WAIT semantics are a mess, but they all go
away later in the patchset.  Non-blocking reclaim makes SYNC_TRYLOCK
go away because everything becomes try-lock based, and SYNC_WAIT goes
away because only the xfs_reclaim_inodes() function needs to wait
for reclaim completion and so that gets it's own LRU walker
implementation and the mode parameter is removed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO
  2019-08-07 23:10     ` Dave Chinner
@ 2019-08-08 16:20       ` Brian Foster
  0 siblings, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-08 16:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 08, 2019 at 09:10:44AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 02:09:15PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:48PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Inode reclaim currently issues it's own inode IO when it comes
> > > across dirty inodes. This is used to throttle direct reclaim down to
> > > the rate at which we can reclaim dirty inodes. Failure to throttle
> > > in this manner results in the OOM killer being trivial to trigger
> > > even when there is lots of free memory available.
> > > 
> > > However, having direct reclaimers issue IO causes an amount of
> > > IO thrashing to occur. We can have up to the number of AGs in the
> > > filesystem concurrently issuing IO, plus the AIL pushing thread as
> > > well. This means we can many competing sources of IO and they all
> > > end up thrashing and competing for the request slots in the block
> > > device.
> > > 
> > > Similar to dirty page throttling and the BDI flusher thread, we can
> > > use the AIL pushing thread the sole place we issue inode writeback
> > > from and everything else waits for it to make progress. To do this,
> > > reclaim will skip over dirty inodes, but in doing so will record the
> > > lowest LSN of all the dirty inodes it skips. It will then push the
> > > AIL to this LSN and wait for it to complete that work.
> > > 
> > > In doing so, we block direct reclaim on the IO of at least one IO,
> > > thereby providing some level of throttling for when we encounter
> > > dirty inodes. However we gain the ability to scan and reclaim
> > > clean inodes in a non-blocking fashion. This allows us to
> > > remove all the per-ag reclaim locking that avoids excessive direct
> > > reclaim, as repeated concurrent direct reclaim will hit the same
> > > dirty inodes on block waiting on the same IO to complete.
> > > 
...
> > > -restart:
> > > -	error = 0;
> > >  	/*
> > >  	 * Don't try to flush the inode if another inode in this cluster has
> > >  	 * already flushed it after we did the initial checks in
> > >  	 * xfs_reclaim_inode_grab().
> > >  	 */
> > > -	if (sync_mode & SYNC_TRYLOCK) {
> > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > > -			goto out;
> > > -		if (!xfs_iflock_nowait(ip))
> > > -			goto out_unlock;
> > > -	} else {
> > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -		if (!xfs_iflock_nowait(ip)) {
> > > -			if (!(sync_mode & SYNC_WAIT))
> > > -				goto out_unlock;
> > > -			xfs_iflock(ip);
> > > -		}
> > > -	}
> > > +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > > +		goto out;
> > > +	if (!xfs_iflock_nowait(ip))
> > > +		goto out_unlock;
> > >  
> > 
> > Do we even need the flush lock here any more if we're never going to
> > flush from this context?
> 
> Ideally, no. But the inode my currently be being flushed, in which
> case the incore inode is clean, but we can't reclaim it yet. Hence
> we need the flush lock to serialise against IO completion.
> 

Ok, so xfs_inode_clean() checks ili_fields and that state is cleared at
flush time (under the flush lock). Hmm, so xfs_inode_clean() basically
determines whether the inode needs flushing or not, not necessarily
whether the in-core inode is clean with respect to the on-disk inode. I
wonder if we could enhance that or provide a new helper variant to cover
the latter as well, perhaps by also looking at ->ili_last_fields (with a
separate lock). Anyways, that's a matter for a separate patch..

> > The shutdown case just below notwithstanding
> > (which I'm also wondering if we should just drop given we abort from
> > xfs_iflush() on shutdown), the pin count is an atomic and the dirty
> > state changes under ilock.
> 
> The shutdown case has to handle pinned inodes, not just inodes being
> flushed.
> 
> > Maybe I'm missing something else, but the reason I ask is that the
> > increased flush lock contention in codepaths that don't actually flush
> > once it's acquired gives me a bit of concern that we could reduce
> > effectiveness of the one task that actually does (xfsaild).
> 
> The flush lock isn't a contended lock - it's actually a bit that is
> protected by the i_flags_lock, so if we are contending on anything
> it will be the flags lock. And, well, see the LRU isolate function
> conversion of this code, becuase it changes how the flags lock is
> used for reclaim but I haven't seen any contention as a result of
> that change....
> 

True, but I guess it's not so much the lock contention I'm concerned
about as opposed to the resulting impact of increased nonblocking
reclaim scanning on xfsaild. As of right now reclaim activity is highly
throttled and if a direct reclaimer does ultimatly acquire the flush
lock, it will perform the flush.

With reduced blocking/throttling and no reclaim flushing, I'm wondering
how possible it is for a bunch of direct reclaimers to come in and
effectively cycle over the same batch of dirty inodes repeatedly and
faster than xfsaild to the point where xfsaild can't make progress
actually flushing some of these inodes. Looking again, the combination
of the unlocked dirty check before grabbing the inode and sync AIL push
after each pass might be enough to prevent this from being a problem.
The former allows a direct reclaimer to skip the lock cycle if the inode
is obviously dirty and if the lock cycle was necessary to detect dirty
state, the associated task will wait for a while before coming around
again.

...
> > >  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> > >  		unsigned long	first_index = 0;
> > >  		int		done = 0;
> > >  		int		nr_found = 0;
> > >  
> > >  		ag = pag->pag_agno + 1;
> > > -
> > > -		if (trylock) {
> > > -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> > > -				skipped++;
> > > -				xfs_perag_put(pag);
> > > -				continue;
> > > -			}
> > > -			first_index = pag->pag_ici_reclaim_cursor;
> > > -		} else
> > > -			mutex_lock(&pag->pag_ici_reclaim_lock);
> > 
> > I understand that the eliminated blocking drops a dependency on the
> > perag reclaim exclusion as described by the commit log, but I'm not sure
> > it's enough to justify removing it entirely. For one, the reclaim cursor
> > management looks potentially racy.
> 
> We really don't care if the cursor updates are racy. All that will
> result in is some inode ranges being scanned twice in quick
> succession. All this does now is prevent reclaim from starting at
> the start of the AG every time it runs, so we end up with most
> reclaimers iterating over previously unscanned inodes.
> 

That might happen, sure, but a racing update may also cause a reclaimer
in progress to jump backwards and rescan a set of inodes it just
finished with (assuming some were dirty and left around). If that
happens once or twice it's probably not a big deal, but if you have a
bunch of reclaimer threads repeatedly stammering over the same ranges
over and over again, ISTM the whole thing could slow down to a crawl and
do quite a bit less real scanning work than request by the shrinker
core.

> > Also, doesn't this exclusion provide
> > some balance for reclaim across AGs? E.g., if a bunch of reclaim threads
> > come in at the same time, this allows them to walk across AGs instead of
> > potentially stumbling over eachother in the batching/grabbing code.
> 
> What they do now is leapfrog each other and work through the same
> AGs much faster. The overall pattern of reclaim doesn't actually
> change much, just the speed at which individual AGs are scanned.
> 
> But that was not what the locking was put in place for. THe locking
> was put in place to be able to throttle the number of concurrent
> reclaimers issuing IO. If the reclaimers leapfrogged like they do
> without the locking, then we end up with non-sequential inode
> writeback patterns, and writeback performance goes really bad,
> really quickly. Hence the locking is there to ensure we get
> sequential inode writeback patterns from each AG that is being
> reclaimed from. That can be optimised by block layer merging, and so
> even though we might have a lot of concurrent reclaimers, we get
> lots of large, well-formed IOs from each of them.
> 
> IOWs, the locking was all about preventing the IO patterns from
> breaking down under memory pressure, not about optimising how
> reclaimers interact with each other.
> 

I don't dispute any of that, but that doesn't necessarily mean that code
added since the locking was added doesn't depend on the serialization
that was already in place at the time to function sanely. I'm not asking
for performance/optimization here.

> > I see again that most of this code seems to ultimately go away, replaced
> > by an LRU mechanism so we no longer operate on a per-ag basis. I can see
> > how this becomes irrelevant with that mechanism, but I think it might
> > make more sense to drop this locking along with the broader mechanism in
> > the last patch or two of the series rather than doing it here.
> 
> Fundamentally, this patch is all about shifting the IO and blocking
> mechanisms to the AIL. This locking is no longer necessary, and it
> actually gets in the way of doing non-blocking reclaim and shifting
> the IO to the AIL. i.e. we block where we no longer need to, and
> that causes more problems for this change than it solves.
> 

I could see the issue of blocking where we no longer need to, but this
patch could also change the contextual blocking behavior without
necessarily removing the lock entirely. Or better yet, drop the trylock
and reduce the scope of the lock to the inode grabbing and cursor update
(and re-sample the cursor on each iteration). The cursor update index
doesn't change after we grab the last inode we're going to try and
reclaim, but we currently hold the lock across the entire batch reclaim.
Doing the lookup and grab under the lock and then releasing it once we
start to reclaim allows multiple threads to drive reclaim of the same AG
as you want, but prevents those tasks from disrupting eachother. We
still would have a blocking lock in this path, but it's purely
structural and we'd only block for what would otherwise be time spent
scanning over already handled inodes (i.e.  wasted CPU time). IOW, if
there's contention on the lock, then by definition some other task has
the same lookup index.

> > If
> > nothing else, that eliminates the need for the reviewer to consider this
> > transient "old mechanism + new locking" state as opposed to reasoning
> > about the old mechanism vs. new mechanism and why the old locking simply
> > no longer applies.
> 
> I think you're putting to much "make every step of the transition
> perfect" focus on this. We've got to drop this locking to make
> reclaim non-blocking, and we have to make reclaim non-blocking
> before we can move to a LRU mechanisms that relies on LRU removal
> being completely non-blocking and cannot issue IO. It's a waste of
> time trying to break this down further and further into "perfect"
> patches - it works well enough and without functional regressions so
> it does not create landmines for people doing bisects, and that's
> largely all that matters in the middle of a large patchset that is
> making large algorithm changes...
> 

I'm not really worried about maintaining perfection throughout every
step of this series. I just think this patch is a bit too fast and
loose.

In logistical terms, this patch and the next few continue to modify the
perag based reclaim mechanism, the second to last patch adds a new
reclaim mechanism to reclaim via the newly added LRU, and the final
patch removes the old/unused perag bits. The locking we're talking about
here is part of that old/unused mechanism that is ripped out in the
final patch, not common code shared between the two, so I don't see a
problem with either leaving it or changing it around as described above.
An untested patch (based on top of this one) for the latter is appended
for reference..

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4c4c5bc12147..87f3ca86dfd1 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1210,12 +1210,14 @@ xfs_reclaim_inodes_ag(
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
-		first_index = pag->pag_ici_reclaim_cursor;
 
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
 
+			mutex_lock(&pag->pag_ici_reclaim_lock);
+			first_index = pag->pag_ici_reclaim_cursor;
+
 			rcu_read_lock();
 			nr_found = radix_tree_gang_lookup_tag(
 					&pag->pag_ici_root,
@@ -1225,6 +1227,7 @@ xfs_reclaim_inodes_ag(
 			if (!nr_found) {
 				done = 1;
 				rcu_read_unlock();
+				mutex_unlock(&pag->pag_ici_reclaim_lock);
 				break;
 			}
 
@@ -1266,6 +1269,11 @@ xfs_reclaim_inodes_ag(
 
 			/* unlock now we've grabbed the inodes. */
 			rcu_read_unlock();
+			if (!done)
+				pag->pag_ici_reclaim_cursor = first_index;
+			else
+				pag->pag_ici_reclaim_cursor = 0;
+			mutex_unlock(&pag->pag_ici_reclaim_lock);
 
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
@@ -1281,10 +1289,6 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && nr_to_scan > 0);
 
-		if (!done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
 		xfs_perag_put(pag);
 	}
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bcf8f64d1b1f..a1805021c92f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,6 +148,7 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
+		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -199,6 +200,7 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
+		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -240,6 +242,7 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
+	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -249,6 +252,7 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
+		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3ed6d942240f..a585860eaa94 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -390,6 +390,7 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
+	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 22/24] xfs: track reclaimable inodes using a LRU list
  2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
@ 2019-08-08 16:36   ` Brian Foster
  2019-08-09  0:10     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-08 16:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we don't do IO from the inode reclaim code, there is no
> need to optimise inode scanning order for optimal IO
> characteristics. The AIL takes care of that for us, so now reclaim
> can focus on selecting the best inodes to reclaim.
> 
> Hence we can change the inode reclaim algorithm to a real LRU and
> remove the need to use the radix tree to track and walk inodes under
> reclaim. This frees up a radix tree bit and simplifies the code that
> marks inodes are reclaim candidates. It also simplifies the reclaim
> code - we don't need batching anymore and all the reclaim logic
> can be added to the LRU isolation callback.
> 
> Further, we get node aware reclaim at the xfs_inode level, which
> should help the per-node reclaim code free relevant inodes faster.
> 
> We can re-use the VFS inode lru pointers - once the inode has been
> reclaimed from the VFS, we can use these pointers ourselves. Hence
> we don't need to grow the inode to change the way we index
> reclaimable inodes.
> 
> Start by adding the list_lru tracking in parallel with the existing
> reclaim code. This makes it easier to see the LRU infrastructure
> separate to the reclaim algorithm changes. Especially the locking
> order, which is ip->i_flags_lock -> list_lru lock.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 31 +++++++------------------------
>  fs/xfs/xfs_icache.h |  1 -
>  fs/xfs/xfs_mount.h  |  1 +
>  fs/xfs/xfs_super.c  | 31 ++++++++++++++++++++++++-------
>  4 files changed, 32 insertions(+), 32 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index a59d3a21be5c..b5c4c1b6fd19 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
...
> @@ -1801,7 +1817,8 @@ xfs_fs_nr_cached_objects(
>  	/* Paranoia: catch incorrect calls during mount setup or teardown */
>  	if (WARN_ON_ONCE(!sb->s_fs_info))
>  		return 0;
> -	return xfs_reclaim_inodes_count(XFS_M(sb));
> +
> +	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);

Do we not need locking here, or are we just skipping it because this
apparently maintains a count field and accuracy isn't critical? If the
latter, a one liner comment would be useful.

Brian

>  }
>  
>  static long
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
  2019-08-01  2:17 ` [PATCH 23/24] xfs: reclaim inodes from the LRU Dave Chinner
@ 2019-08-08 16:39   ` Brian Foster
  2019-08-09  1:20     ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-08 16:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Replace the AG radix tree walking reclaim code with a list_lru
> walker, giving us both node-aware and memcg-aware inode reclaim
> at the XFS level. This requires adding an inode isolation function to
> determine if the inode can be reclaim, and a list walker to
> dispose of the inodes that were isolated.
> 
> We want the isolation function to be non-blocking. If we can't
> grab an inode then we either skip it or rotate it. If it's clean
> then we skip it, if it's dirty then we rotate to give it time to be

Do you mean we remove it if it's clean?

> cleaned before it is scanned again.
> 
> This congregates the dirty inodes at the tail of the LRU, which
> means that if we start hitting a majority of dirty inodes either
> there are lots of unlinked inodes in the reclaim list or we've
> reclaimed all the clean inodes and we're looped back on the dirty
> inodes. Either way, this is an indication we should tell kswapd to
> back off.
> 
> The non-blocking isolation function introduces a complexity for the
> filesystem shutdown case. When the filesystem is shut down, we want
> to free the inode even if it is dirty, and this may require
> blocking. We already hold the locks needed to do this blocking, so
> what we do is that we leave inodes locked - both the ILOCK and the
> flush lock - while they are sitting on the dispose list to be freed
> after the LRU walk completes.  This allows us to process the
> shutdown state outside the LRU walk where we can block safely.
> 
> Keep in mind we don't have to care about inode lock order or
> blocking with inode locks held here because a) we are using
> trylocks, and b) once marked with XFS_IRECLAIM they can't be found
> via the LRU and inode cache lookups will abort and retry. Hence
> nobody will try to lock them in any other context that might also be
> holding other inode locks.
> 
> Also convert xfs_reclaim_inodes() to use a LRU walk to free all
> the reclaimable inodes in the filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 199 ++++++++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_icache.h |  10 ++-
>  fs/xfs/xfs_inode.h  |   8 ++
>  fs/xfs/xfs_super.c  |  50 +++++++++--
>  4 files changed, 232 insertions(+), 35 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index b5c4c1b6fd19..e3e898a2896c 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
...
> @@ -1810,23 +1811,58 @@ xfs_fs_mount(
>  }
>  
>  static long
> -xfs_fs_nr_cached_objects(
> +xfs_fs_free_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	/* Paranoia: catch incorrect calls during mount setup or teardown */
> -	if (WARN_ON_ONCE(!sb->s_fs_info))
> -		return 0;
> +	struct xfs_mount	*mp = XFS_M(sb);
> +        struct xfs_ireclaim_args ra;

^ whitespace damage

> +	long freed;
>  
> -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> +	INIT_LIST_HEAD(&ra.freeable);
> +	ra.lowest_lsn = NULLCOMMITLSN;
> +	ra.dirty_skipped = 0;
> +
> +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> +					xfs_inode_reclaim_isolate, &ra);

This is more related to the locking discussion on the earlier patch, but
this looks like it has more similar serialization to the example patch I
posted than the one without locking at all. IIUC, this walk has an
internal lock per node lru that is held across the walk and passed into
the callback. We never cycle it, so for any given node we only allow one
reclaimer through here at a time.

That seems to be Ok given we don't do much in the isolation handler, the
lock isn't held across the dispose sequence and we're still batching in
the shrinker core on top of that. We're still serialized over the lru
fixups such that concurrent reclaimers aren't processing the same
inodes, however.

BTW I got a lockdep splat[1] for some reason on a straight mount/unmount
cycle with this patch.

Brian

[1] dmesg output:

[   39.011867] ============================================
[   39.012643] WARNING: possible recursive locking detected
[   39.013422] 5.3.0-rc2+ #205 Not tainted
[   39.014623] --------------------------------------------
[   39.015636] umount/871 is trying to acquire lock:
[   39.016122] 00000000ea09de26 (&xfs_nondir_ilock_class){+.+.}, at: xfs_ilock+0xd2/0x280 [xfs]
[   39.017072] 
[   39.017072] but task is already holding lock:
[   39.017832] 000000001a5b5707 (&xfs_nondir_ilock_class){+.+.}, at: xfs_ilock_nowait+0xcb/0x320 [xfs]
[   39.018909] 
[   39.018909] other info that might help us debug this:
[   39.019570]  Possible unsafe locking scenario:
[   39.019570] 
[   39.020248]        CPU0
[   39.020512]        ----
[   39.020778]   lock(&xfs_nondir_ilock_class);
[   39.021246]   lock(&xfs_nondir_ilock_class);
[   39.021705] 
[   39.021705]  *** DEADLOCK ***
[   39.021705] 
[   39.022338]  May be due to missing lock nesting notation
[   39.022338] 
[   39.023070] 3 locks held by umount/871:
[   39.023481]  #0: 000000004d39d244 (&type->s_umount_key#61){+.+.}, at: deactivate_super+0x43/0x50
[   39.024462]  #1: 0000000011270366 (&xfs_dir_ilock_class){++++}, at: xfs_ilock_nowait+0xcb/0x320 [xfs]
[   39.025488]  #2: 000000001a5b5707 (&xfs_nondir_ilock_class){+.+.}, at: xfs_ilock_nowait+0xcb/0x320 [xfs]
[   39.027163] 
[   39.027163] stack backtrace:
[   39.027681] CPU: 3 PID: 871 Comm: umount Not tainted 5.3.0-rc2+ #205
[   39.028534] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   39.029152] Call Trace:
[   39.029428]  dump_stack+0x67/0x90
[   39.029889]  __lock_acquire.cold.67+0x121/0x1f9
[   39.030519]  lock_acquire+0x90/0x170
[   39.031170]  ? xfs_ilock+0xd2/0x280 [xfs]
[   39.031603]  down_write_nested+0x4f/0xb0
[   39.032064]  ? xfs_ilock+0xd2/0x280 [xfs]
[   39.032684]  ? xfs_dispose_inodes+0x124/0x320 [xfs]
[   39.033575]  xfs_ilock+0xd2/0x280 [xfs]
[   39.034058]  xfs_dispose_inodes+0x124/0x320 [xfs]
[   39.034656]  xfs_reclaim_inodes+0x149/0x190 [xfs]
[   39.035381]  ? finish_wait+0x80/0x80
[   39.035855]  xfs_unmountfs+0x81/0x190 [xfs]
[   39.036443]  xfs_fs_put_super+0x35/0x90 [xfs]
[   39.037016]  generic_shutdown_super+0x6c/0x100
[   39.037554]  kill_block_super+0x21/0x50
[   39.037986]  deactivate_locked_super+0x34/0x70
[   39.038477]  cleanup_mnt+0xb8/0x140
[   39.038879]  task_work_run+0x9e/0xd0
[   39.039302]  exit_to_usermode_loop+0xb3/0xc0
[   39.039774]  do_syscall_64+0x206/0x210
[   39.040591]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.041771] RIP: 0033:0x7fcd4ec0536b
[   39.042627] Code: 0b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed 0a 0c 00 f7 d8 64 89 01 48
[   39.045336] RSP: 002b:00007ffdedf686c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[   39.046119] RAX: 0000000000000000 RBX: 00007fcd4ed2f1e4 RCX: 00007fcd4ec0536b
[   39.047506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055ad23f2ad90
[   39.048295] RBP: 000055ad23f2ab80 R08: 0000000000000000 R09: 00007ffdedf67440
[   39.049062] R10: 000055ad23f2adb0 R11: 0000000000000246 R12: 000055ad23f2ad90
[   39.049869] R13: 0000000000000000 R14: 000055ad23f2ac78 R15: 0000000000000000


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 22/24] xfs: track reclaimable inodes using a LRU list
  2019-08-08 16:36   ` Brian Foster
@ 2019-08-09  0:10     ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2019-08-09  0:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 08, 2019 at 12:36:53PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:50PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we don't do IO from the inode reclaim code, there is no
> > need to optimise inode scanning order for optimal IO
> > characteristics. The AIL takes care of that for us, so now reclaim
> > can focus on selecting the best inodes to reclaim.
> > 
> > Hence we can change the inode reclaim algorithm to a real LRU and
> > remove the need to use the radix tree to track and walk inodes under
> > reclaim. This frees up a radix tree bit and simplifies the code that
> > marks inodes are reclaim candidates. It also simplifies the reclaim
> > code - we don't need batching anymore and all the reclaim logic
> > can be added to the LRU isolation callback.
> > 
> > Further, we get node aware reclaim at the xfs_inode level, which
> > should help the per-node reclaim code free relevant inodes faster.
> > 
> > We can re-use the VFS inode lru pointers - once the inode has been
> > reclaimed from the VFS, we can use these pointers ourselves. Hence
> > we don't need to grow the inode to change the way we index
> > reclaimable inodes.
> > 
> > Start by adding the list_lru tracking in parallel with the existing
> > reclaim code. This makes it easier to see the LRU infrastructure
> > separate to the reclaim algorithm changes. Especially the locking
> > order, which is ip->i_flags_lock -> list_lru lock.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 31 +++++++------------------------
> >  fs/xfs/xfs_icache.h |  1 -
> >  fs/xfs/xfs_mount.h  |  1 +
> >  fs/xfs/xfs_super.c  | 31 ++++++++++++++++++++++++-------
> >  4 files changed, 32 insertions(+), 32 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index a59d3a21be5c..b5c4c1b6fd19 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> ...
> > @@ -1801,7 +1817,8 @@ xfs_fs_nr_cached_objects(
> >  	/* Paranoia: catch incorrect calls during mount setup or teardown */
> >  	if (WARN_ON_ONCE(!sb->s_fs_info))
> >  		return 0;
> > -	return xfs_reclaim_inodes_count(XFS_M(sb));
> > +
> > +	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> 
> Do we not need locking here,

No locking is needed - we have no global lock that protects the
list_lru that we could use to serialise the count - that would
completely destroy the scalability advantages the list_lru provide.
As it is, shrinker counts have always been inherently racy and so we
don't really care for accuracy anywhere in the shrinker code. Hence
there is no need to attempt to be accurate here, just like didn't
attempt to be accurate for the per AG reclaimable inode count
aggregation that this replaces.

> or are we just skipping it because this
> apparently maintains a count field and accuracy isn't critical? If the
> latter, a one liner comment would be useful.

I don't think it needs comments as they would be stating the
obvious.  We don't have comments explaining this in any other
shrinker - it's jsut assumed that anyone working with shrinkers
already knows that the counts are not required to be exactly
accurate...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
  2019-08-08 16:39   ` Brian Foster
@ 2019-08-09  1:20     ` Dave Chinner
  2019-08-09 12:36       ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-09  1:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Thu, Aug 08, 2019 at 12:39:05PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Replace the AG radix tree walking reclaim code with a list_lru
> > walker, giving us both node-aware and memcg-aware inode reclaim
> > at the XFS level. This requires adding an inode isolation function to
> > determine if the inode can be reclaim, and a list walker to
> > dispose of the inodes that were isolated.
> > 
> > We want the isolation function to be non-blocking. If we can't
> > grab an inode then we either skip it or rotate it. If it's clean
> > then we skip it, if it's dirty then we rotate to give it time to be
> 
> Do you mean we remove it if it's clean?

No, I mean if we can't grab it and it's clean, then we just skip it,
leaving it at the head of the LRU for the next scanner to
immediately try to reclaim it. If it's dirty, we rotate it so that
time passes before we try to reclaim it again in the hope that it is
already clean by the time we've scanned through the entire LRU...

> > +++ b/fs/xfs/xfs_super.c
> ...
> > @@ -1810,23 +1811,58 @@ xfs_fs_mount(
> >  }
> >  
> >  static long
> > -xfs_fs_nr_cached_objects(
> > +xfs_fs_free_cached_objects(
> >  	struct super_block	*sb,
> >  	struct shrink_control	*sc)
> >  {
> > -	/* Paranoia: catch incorrect calls during mount setup or teardown */
> > -	if (WARN_ON_ONCE(!sb->s_fs_info))
> > -		return 0;
> > +	struct xfs_mount	*mp = XFS_M(sb);
> > +        struct xfs_ireclaim_args ra;
> 
> ^ whitespace damage

Already fixed.

> > +	long freed;
> >  
> > -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> > +	INIT_LIST_HEAD(&ra.freeable);
> > +	ra.lowest_lsn = NULLCOMMITLSN;
> > +	ra.dirty_skipped = 0;
> > +
> > +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> > +					xfs_inode_reclaim_isolate, &ra);
> 
> This is more related to the locking discussion on the earlier patch, but
> this looks like it has more similar serialization to the example patch I
> posted than the one without locking at all. IIUC, this walk has an
> internal lock per node lru that is held across the walk and passed into
> the callback. We never cycle it, so for any given node we only allow one
> reclaimer through here at a time.

That's not a guarantee that list_lru gives us. It could drop it's
internal lock at any time during that walk and we would be
blissfully unaware that it has done this. And at that point, the
reclaim context is completely unaware that other reclaim contexts
may be scanning the same LRU at the same time and are interleaving
with it.

And, really, that does not matter one iota. If multiple scanners are
interleaving, the reclaim traversal order and the decisions made are
no different from what a single reclaimer does.  i.e. we just don't
have to care if reclaim contexts interleave or not, because they
will not repeat work that has already been done unnecessarily.
That's one of the reasons for moving to IO-less LRU ordered reclaim
- it removes all the gross hacks we've had to implement to guarantee
reclaim scanning progress in one nice neat package of generic
infrastructure.

> That seems to be Ok given we don't do much in the isolation handler, the
> lock isn't held across the dispose sequence and we're still batching in
> the shrinker core on top of that. We're still serialized over the lru
> fixups such that concurrent reclaimers aren't processing the same
> inodes, however.

The only thing that we may need here is need_resched() checks if it
turns out that holding a lock for 1024 items to be scanned proved to
be too long to hold on to a single CPU. If we do that we'd cycle the
LRU lock and return RETRY or RETRY_REMOVE, hence enabling reclaimers
more finer-grained interleaving....

> BTW I got a lockdep splat[1] for some reason on a straight mount/unmount
> cycle with this patch.
....
> [   39.030519]  lock_acquire+0x90/0x170
> [   39.031170]  ? xfs_ilock+0xd2/0x280 [xfs]
> [   39.031603]  down_write_nested+0x4f/0xb0
> [   39.032064]  ? xfs_ilock+0xd2/0x280 [xfs]
> [   39.032684]  ? xfs_dispose_inodes+0x124/0x320 [xfs]
> [   39.033575]  xfs_ilock+0xd2/0x280 [xfs]
> [   39.034058]  xfs_dispose_inodes+0x124/0x320 [xfs]

False positive, AFAICT. It's complaining about the final xfs_ilock()
call we do before freeing the inode because we have other inodes
locked. I don't think this can deadlock because the inodes under
reclaim should not be usable by anyone else at this point because
they have the I_RECLAIM flag set.

I did notice this - I added a XXX comment I added to the case being
complained about to note I needed to resolve this locking issue.

+        * Here we do an (almost) spurious inode lock in order to coordinate
+        * with inode cache radix tree lookups.  This is because the lookup
+        * can reference the inodes in the cache without taking references.
+        *
+        * We make that OK here by ensuring that we wait until the inode is
+        * unlocked after the lookup before we go ahead and free it. 
+        * unlocked after the lookup before we go ahead and free it. 
+        *
+        * XXX: need to check this is still true. Not sure it is.
         */

I added that last line in this patch. In more detail....

The comment is suggesting that we need to take the ILOCK to
co-ordinate with RCU protected lookups in progress before we RCU
free the inode. That's waht RCU is supposed to do, so I'm not at all
sure what this is actually serialising against any more.

i.e. any racing radix tree lookup from this point in time is going
to see the XFS_IRECLAIM flag and ip->i_ino == 0 while under the
rcu_read_lock, and they will go try again after dropping all lock
context and waiting for a bit. The inode may remain visibile until
the next rcu grace period expires, but all lookups will abort long
before the get anywhere near the ILOCK. And once the RCU grace
period expires, lookups will be locked out by the rcu_read_lock(),
the raidx tree moves to a state where the removal of the inode is
guaranteed visibile to all CPUs, and then the object is freed.

So the ILOCK should have no part in lookup serialisation, and I need
to go look at the history of the code to determine where and why
this was added, and whether the condition it protects against is
still a valid concern or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
  2019-08-09  1:20     ` Dave Chinner
@ 2019-08-09 12:36       ` Brian Foster
  2019-08-11  2:17         ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Brian Foster @ 2019-08-09 12:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Fri, Aug 09, 2019 at 11:20:22AM +1000, Dave Chinner wrote:
> On Thu, Aug 08, 2019 at 12:39:05PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Replace the AG radix tree walking reclaim code with a list_lru
> > > walker, giving us both node-aware and memcg-aware inode reclaim
> > > at the XFS level. This requires adding an inode isolation function to
> > > determine if the inode can be reclaim, and a list walker to
> > > dispose of the inodes that were isolated.
> > > 
> > > We want the isolation function to be non-blocking. If we can't
> > > grab an inode then we either skip it or rotate it. If it's clean
> > > then we skip it, if it's dirty then we rotate to give it time to be
> > 
> > Do you mean we remove it if it's clean?
> 
> No, I mean if we can't grab it and it's clean, then we just skip it,
> leaving it at the head of the LRU for the next scanner to
> immediately try to reclaim it. If it's dirty, we rotate it so that
> time passes before we try to reclaim it again in the hope that it is
> already clean by the time we've scanned through the entire LRU...
> 

Ah, Ok. That could probably be worded more explicitly. E.g.:

"If we can't grab an inode, we skip it if it is clean or rotate it if
dirty. Dirty inode rotation gives the inode time to be cleaned before
it's scanned again. ..."

> > > +++ b/fs/xfs/xfs_super.c
> > ...
> > > @@ -1810,23 +1811,58 @@ xfs_fs_mount(
...
> > > +	long freed;
> > >  
> > > -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> > > +	INIT_LIST_HEAD(&ra.freeable);
> > > +	ra.lowest_lsn = NULLCOMMITLSN;
> > > +	ra.dirty_skipped = 0;
> > > +
> > > +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> > > +					xfs_inode_reclaim_isolate, &ra);
> > 
> > This is more related to the locking discussion on the earlier patch, but
> > this looks like it has more similar serialization to the example patch I
> > posted than the one without locking at all. IIUC, this walk has an
> > internal lock per node lru that is held across the walk and passed into
> > the callback. We never cycle it, so for any given node we only allow one
> > reclaimer through here at a time.
> 
> That's not a guarantee that list_lru gives us. It could drop it's
> internal lock at any time during that walk and we would be
> blissfully unaware that it has done this. And at that point, the
> reclaim context is completely unaware that other reclaim contexts
> may be scanning the same LRU at the same time and are interleaving
> with it.
> 

What is not a guarantee? I'm not following your point here. I suppose it
technically could drop the lock, but then it would have to restart the
iteration and wouldn't exactly provide predictable batching capability
to users.

This internal lock protects the integrity of the list from external
adds/removes, etc., but it's also passed into the callback so of course
it can be cycled at any point. The callback just has to notify the
caller to restart the walk. E.g., from __list_lru_walk_one():

        /*
         * The lru lock has been dropped, our list traversal is
         * now invalid and so we have to restart from scratch.
         */

> And, really, that does not matter one iota. If multiple scanners are
> interleaving, the reclaim traversal order and the decisions made are
> no different from what a single reclaimer does.  i.e. we just don't
> have to care if reclaim contexts interleave or not, because they
> will not repeat work that has already been done unnecessarily.
> That's one of the reasons for moving to IO-less LRU ordered reclaim
> - it removes all the gross hacks we've had to implement to guarantee
> reclaim scanning progress in one nice neat package of generic
> infrastructure.
> 

Yes, exactly. The difference I'm pointing out is that with the earlier
patch that drops locking from the perag based scan mechanism, the
interleaving of multiple reclaim scanners over the same range is not
exactly the same as a single reclaimer. That sort of behavior is the
intent of the example patch I appended to change the locking instead of
remove it.

> > That seems to be Ok given we don't do much in the isolation handler, the
> > lock isn't held across the dispose sequence and we're still batching in
> > the shrinker core on top of that. We're still serialized over the lru
> > fixups such that concurrent reclaimers aren't processing the same
> > inodes, however.
> 
> The only thing that we may need here is need_resched() checks if it
> turns out that holding a lock for 1024 items to be scanned proved to
> be too long to hold on to a single CPU. If we do that we'd cycle the
> LRU lock and return RETRY or RETRY_REMOVE, hence enabling reclaimers
> more finer-grained interleaving....
> 

Sure, with the caveat that we restart the traversal..

> > BTW I got a lockdep splat[1] for some reason on a straight mount/unmount
> > cycle with this patch.
> ....
> > [   39.030519]  lock_acquire+0x90/0x170
> > [   39.031170]  ? xfs_ilock+0xd2/0x280 [xfs]
> > [   39.031603]  down_write_nested+0x4f/0xb0
> > [   39.032064]  ? xfs_ilock+0xd2/0x280 [xfs]
> > [   39.032684]  ? xfs_dispose_inodes+0x124/0x320 [xfs]
> > [   39.033575]  xfs_ilock+0xd2/0x280 [xfs]
> > [   39.034058]  xfs_dispose_inodes+0x124/0x320 [xfs]
> 
> False positive, AFAICT. It's complaining about the final xfs_ilock()
> call we do before freeing the inode because we have other inodes
> locked. I don't think this can deadlock because the inodes under
> reclaim should not be usable by anyone else at this point because
> they have the I_RECLAIM flag set.
> 

Ok. The code looked sane to me at a glance, but lockdep tends to confuse
the hell out of me.

Brian

> I did notice this - I added a XXX comment I added to the case being
> complained about to note I needed to resolve this locking issue.
> 
> +        * Here we do an (almost) spurious inode lock in order to coordinate
> +        * with inode cache radix tree lookups.  This is because the lookup
> +        * can reference the inodes in the cache without taking references.
> +        *
> +        * We make that OK here by ensuring that we wait until the inode is
> +        * unlocked after the lookup before we go ahead and free it. 
> +        * unlocked after the lookup before we go ahead and free it. 
> +        *
> +        * XXX: need to check this is still true. Not sure it is.
>          */
> 
> I added that last line in this patch. In more detail....
> 
> The comment is suggesting that we need to take the ILOCK to
> co-ordinate with RCU protected lookups in progress before we RCU
> free the inode. That's waht RCU is supposed to do, so I'm not at all
> sure what this is actually serialising against any more.
> 
> i.e. any racing radix tree lookup from this point in time is going
> to see the XFS_IRECLAIM flag and ip->i_ino == 0 while under the
> rcu_read_lock, and they will go try again after dropping all lock
> context and waiting for a bit. The inode may remain visibile until
> the next rcu grace period expires, but all lookups will abort long
> before the get anywhere near the ILOCK. And once the RCU grace
> period expires, lookups will be locked out by the rcu_read_lock(),
> the raidx tree moves to a state where the removal of the inode is
> guaranteed visibile to all CPUs, and then the object is freed.
> 
> So the ILOCK should have no part in lookup serialisation, and I need
> to go look at the history of the code to determine where and why
> this was added, and whether the condition it protects against is
> still a valid concern or not....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
  2019-08-09 12:36       ` Brian Foster
@ 2019-08-11  2:17         ` Dave Chinner
  2019-08-11 12:46           ` Brian Foster
  0 siblings, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2019-08-11  2:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Fri, Aug 09, 2019 at 08:36:32AM -0400, Brian Foster wrote:
> On Fri, Aug 09, 2019 at 11:20:22AM +1000, Dave Chinner wrote:
> > On Thu, Aug 08, 2019 at 12:39:05PM -0400, Brian Foster wrote:
> > > On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Replace the AG radix tree walking reclaim code with a list_lru
> > > > walker, giving us both node-aware and memcg-aware inode reclaim
> > > > at the XFS level. This requires adding an inode isolation function to
> > > > determine if the inode can be reclaim, and a list walker to
> > > > dispose of the inodes that were isolated.
> > > > 
> > > > We want the isolation function to be non-blocking. If we can't
> > > > grab an inode then we either skip it or rotate it. If it's clean
> > > > then we skip it, if it's dirty then we rotate to give it time to be
> > > 
> > > Do you mean we remove it if it's clean?
> > 
> > No, I mean if we can't grab it and it's clean, then we just skip it,
> > leaving it at the head of the LRU for the next scanner to
> > immediately try to reclaim it. If it's dirty, we rotate it so that
> > time passes before we try to reclaim it again in the hope that it is
> > already clean by the time we've scanned through the entire LRU...
> > 
> 
> Ah, Ok. That could probably be worded more explicitly. E.g.:
> 
> "If we can't grab an inode, we skip it if it is clean or rotate it if
> dirty. Dirty inode rotation gives the inode time to be cleaned before
> it's scanned again. ..."

*nod*

> > > > +++ b/fs/xfs/xfs_super.c
> > > ...
> > > > @@ -1810,23 +1811,58 @@ xfs_fs_mount(
> ...
> > > > +	long freed;
> > > >  
> > > > -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> > > > +	INIT_LIST_HEAD(&ra.freeable);
> > > > +	ra.lowest_lsn = NULLCOMMITLSN;
> > > > +	ra.dirty_skipped = 0;
> > > > +
> > > > +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> > > > +					xfs_inode_reclaim_isolate, &ra);
> > > 
> > > This is more related to the locking discussion on the earlier patch, but
> > > this looks like it has more similar serialization to the example patch I
> > > posted than the one without locking at all. IIUC, this walk has an
> > > internal lock per node lru that is held across the walk and passed into
> > > the callback. We never cycle it, so for any given node we only allow one
> > > reclaimer through here at a time.
> > 
> > That's not a guarantee that list_lru gives us. It could drop it's
> > internal lock at any time during that walk and we would be
> > blissfully unaware that it has done this. And at that point, the
> > reclaim context is completely unaware that other reclaim contexts
> > may be scanning the same LRU at the same time and are interleaving
> > with it.
> > 
> 
> What is not a guarantee? I'm not following your point here. I suppose it
> technically could drop the lock, but then it would have to restart the
> iteration and wouldn't exactly provide predictable batching capability
> to users.

There is no guarantee that the list_lru_shrink_walk() provides a
single list walker at a time or that it provides predictable
batching capability to users.

> This internal lock protects the integrity of the list from external
> adds/removes, etc., but it's also passed into the callback so of course
> it can be cycled at any point. The callback just has to notify the
> caller to restart the walk. E.g., from __list_lru_walk_one():
> 
>         /*
>          * The lru lock has been dropped, our list traversal is
>          * now invalid and so we have to restart from scratch.
>          */

As the designer and author of the list_lru code, I do know how it
works. I also know exactly what this problem this behaviour was
intended to solve, because I had to solve it to meet the
requirements I had for the infrastructure.

The isolation walk lock batching currently done is an optimisation
to minimise lru lock contention - it amortise the cost of getting
the lock over a substantial batch of work. If we drop the lock on
every item we try to isolate - my initial implementations did this -
then the lru lock thrashes badly against concurrent inserts and
deletes and scalability is not much better than the global lock it
was replacing.

IOWs, the behaviour we have now is a result of lock contention
optimisation to meet scalability requirements, not because of some
"predictable batching" requirement. If we were to rework the
traversal mechanism such that the lru lock was not necessary to
protect the state of the LRU list across the batch of isolate
callbacks, then we'd get the scalability we need but we'd completely
change the concurrency behaviour. The list would still do LRU
reclaim, and the isolate functions still work exactly as tehy
currently do (i.e. they work on just the item passed to them) but
we'd have concurrent reclaim contexts isolating items on the same
LRU concurrently rather than being serialised. And that's perfectly
fine, because the isolate/dispose architecture just doesn't care
how the items on the LRU are isolated for disposal.....

What I'm trying to say is that the "isolation batching" we have is
not desirable but it is necessary, and we because that's internal to
the list_lru implementation, we can change that behaviour however
we want and it won't affect the subsystems that own the objects
being reclaimed. They still just get handed a list of items to
dispose, and they all come from the reclaim end of the LRU list...

Indeed, the new XFS inode shrinker is not dependent on any specific
batching order, it's not dependent on isolation being serialised,
and it's not dependent on the lru_lock being held across the
isolation function. IOWs, it's set up just right to take advantage
of any increases in isolation concurrency that the list_lru
infrastructure could provide...

> > > That seems to be Ok given we don't do much in the isolation handler, the
> > > lock isn't held across the dispose sequence and we're still batching in
> > > the shrinker core on top of that. We're still serialized over the lru
> > > fixups such that concurrent reclaimers aren't processing the same
> > > inodes, however.
> > 
> > The only thing that we may need here is need_resched() checks if it
> > turns out that holding a lock for 1024 items to be scanned proved to
> > be too long to hold on to a single CPU. If we do that we'd cycle the
> > LRU lock and return RETRY or RETRY_REMOVE, hence enabling reclaimers
> > more finer-grained interleaving....
> > 
> 
> Sure, with the caveat that we restart the traversal..

Which only re-traverses the inodes we skipped because they were
locked at the time. IOWs, Skipping inodes is rare because if it is
in reclaim then the only things that can be contending is a radix
tree lookup in progress or an inode clustering operation
(write/free) in progress. Either way, they will be relatively rare
and very short term lock holds, so if we have to restart the scan
after dropping the lru lock then it's likely we'll restart at next
inode in line for reclaim, anyway....

Hence I don't think having to restart a traversal would really
matter all that much....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 23/24] xfs: reclaim inodes from the LRU
  2019-08-11  2:17         ` Dave Chinner
@ 2019-08-11 12:46           ` Brian Foster
  0 siblings, 0 replies; 87+ messages in thread
From: Brian Foster @ 2019-08-11 12:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm, linux-fsdevel

On Sun, Aug 11, 2019 at 12:17:47PM +1000, Dave Chinner wrote:
> On Fri, Aug 09, 2019 at 08:36:32AM -0400, Brian Foster wrote:
> > On Fri, Aug 09, 2019 at 11:20:22AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 08, 2019 at 12:39:05PM -0400, Brian Foster wrote:
> > > > On Thu, Aug 01, 2019 at 12:17:51PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
...
> > > > > +	long freed;
> > > > >  
> > > > > -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> > > > > +	INIT_LIST_HEAD(&ra.freeable);
> > > > > +	ra.lowest_lsn = NULLCOMMITLSN;
> > > > > +	ra.dirty_skipped = 0;
> > > > > +
> > > > > +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> > > > > +					xfs_inode_reclaim_isolate, &ra);
> > > > 
> > > > This is more related to the locking discussion on the earlier patch, but
> > > > this looks like it has more similar serialization to the example patch I
> > > > posted than the one without locking at all. IIUC, this walk has an
> > > > internal lock per node lru that is held across the walk and passed into
> > > > the callback. We never cycle it, so for any given node we only allow one
> > > > reclaimer through here at a time.
> > > 
> > > That's not a guarantee that list_lru gives us. It could drop it's
> > > internal lock at any time during that walk and we would be
> > > blissfully unaware that it has done this. And at that point, the
> > > reclaim context is completely unaware that other reclaim contexts
> > > may be scanning the same LRU at the same time and are interleaving
> > > with it.
> > > 
> > 
> > What is not a guarantee? I'm not following your point here. I suppose it
> > technically could drop the lock, but then it would have to restart the
> > iteration and wouldn't exactly provide predictable batching capability
> > to users.
> 
> There is no guarantee that the list_lru_shrink_walk() provides a
> single list walker at a time or that it provides predictable
> batching capability to users.
> 

Hm, Ok. I suppose the code could change if that's your point. But how is
that relevant to how XFS reclaim behavior as of this patch compares to
the intermediate patch 20? Note again that this the only reason I bring
it up in this patch (which seems mostly sane to me). Perhaps I should
have just replied again to patch 20 to avoid confusion. Apologies for
that, but that ship has sailed...

To reiterate, I suggested not dropping the lock in patch 20 for
$reasons. You replied generally that doing so might cause more problems
than it solves. I replied with a compromise patch that leaves the lock,
but only acquires it across inode grabbing instead of the entire reclaim
operation. This essentially implements "serialized isolation batching"
as we've termed it here (assuming the patch is correct).

I eventually made it to this patch and observe that you end up with an
implementation that does serialized isolation batching. Of course the
locking and implementation is quite different with this being a whole
new mechanism, so performance and things could be quite different too.
But my point is that if serialized isolation batching is acceptable
behavior for XFS inode reclaim right now, then it seems reasonable to me
that it be acceptable in patch 20 for what is ultimately just an
intermediate state.

I appreciate all of the discussion and background information that
follows, but it gets way far off from the original feedback. I've still
seen no direct response to the thoughts above either here or in patch
20. Are you planning to incorporate that example patch or something
similar? If so, then I think we're on the same page.

> > This internal lock protects the integrity of the list from external
> > adds/removes, etc., but it's also passed into the callback so of course
> > it can be cycled at any point. The callback just has to notify the
> > caller to restart the walk. E.g., from __list_lru_walk_one():
> > 
> >         /*
> >          * The lru lock has been dropped, our list traversal is
> >          * now invalid and so we have to restart from scratch.
> >          */
> 
> As the designer and author of the list_lru code, I do know how it
> works. I also know exactly what this problem this behaviour was
> intended to solve, because I had to solve it to meet the
> requirements I had for the infrastructure.
> 
> The isolation walk lock batching currently done is an optimisation
> to minimise lru lock contention - it amortise the cost of getting
> the lock over a substantial batch of work. If we drop the lock on
> every item we try to isolate - my initial implementations did this -
> then the lru lock thrashes badly against concurrent inserts and
> deletes and scalability is not much better than the global lock it
> was replacing.
> 

Ok.

> IOWs, the behaviour we have now is a result of lock contention
> optimisation to meet scalability requirements, not because of some
> "predictable batching" requirement. If we were to rework the
> traversal mechanism such that the lru lock was not necessary to
> protect the state of the LRU list across the batch of isolate
> callbacks, then we'd get the scalability we need but we'd completely
> change the concurrency behaviour. The list would still do LRU
> reclaim, and the isolate functions still work exactly as tehy
> currently do (i.e. they work on just the item passed to them) but
> we'd have concurrent reclaim contexts isolating items on the same
> LRU concurrently rather than being serialised. And that's perfectly
> fine, because the isolate/dispose architecture just doesn't care
> how the items on the LRU are isolated for disposal.....
> 

Sure, that mostly makes sense. We're free to similarly rework the old
mechanism to not require locking just as well. My point is that the
patch to move I/O submission to the AIL doesn't do that sufficiently,
despite that being the initial purpose of the lock.

BTW, the commit[1] that introduces the pag reclaim lock back in 2010
introduces the cursor at the same time and actually does mention some of
the things I'm pointing out as potential problems with patch 20. It
refers to potential for "massive contention" and prevention of
reclaimers "continually scanning the same inodes in each AG." Again, I'm
not worried about that for this series because we ultimately rework the
shrinker past those issues, but it appears that it wasn't purely a
matter of I/O submission ordering that motivated the lock (though I'm
sure that was part of it).

Given that, I think another reasonable option for patch 20 is to remove
the cursor and locking together. That at least matches some form of
historical reclaim behavior in XFS. If we took that approach, it might
make sense to split patch 20 into one patch that shifts I/O to xfsaild
(new behavior) and a subsequent that undoes [1].

[1] 69b491c214d7 ("xfs: serialise inode reclaim within an AG")

> What I'm trying to say is that the "isolation batching" we have is
> not desirable but it is necessary, and we because that's internal to
> the list_lru implementation, we can change that behaviour however
> we want and it won't affect the subsystems that own the objects
> being reclaimed. They still just get handed a list of items to
> dispose, and they all come from the reclaim end of the LRU list...
> 
> Indeed, the new XFS inode shrinker is not dependent on any specific
> batching order, it's not dependent on isolation being serialised,
> and it's not dependent on the lru_lock being held across the
> isolation function. IOWs, it's set up just right to take advantage
> of any increases in isolation concurrency that the list_lru
> infrastructure could provide...
>

Yep, it's a nice abstraction. This has no bearing on the old
implementation which does have XFS specific locking, however.
 
> > > > That seems to be Ok given we don't do much in the isolation handler, the
> > > > lock isn't held across the dispose sequence and we're still batching in
> > > > the shrinker core on top of that. We're still serialized over the lru
> > > > fixups such that concurrent reclaimers aren't processing the same
> > > > inodes, however.
> > > 
> > > The only thing that we may need here is need_resched() checks if it
> > > turns out that holding a lock for 1024 items to be scanned proved to
> > > be too long to hold on to a single CPU. If we do that we'd cycle the
> > > LRU lock and return RETRY or RETRY_REMOVE, hence enabling reclaimers
> > > more finer-grained interleaving....
> > > 
> > 
> > Sure, with the caveat that we restart the traversal..
> 
> Which only re-traverses the inodes we skipped because they were
> locked at the time. IOWs, Skipping inodes is rare because if it is
> in reclaim then the only things that can be contending is a radix
> tree lookup in progress or an inode clustering operation
> (write/free) in progress. Either way, they will be relatively rare
> and very short term lock holds, so if we have to restart the scan
> after dropping the lru lock then it's likely we'll restart at next
> inode in line for reclaim, anyway....
> 
> Hence I don't think having to restart a traversal would really
> matter all that much....
> 

Perhaps, that seems fairly reasonable given that we rotate dirty inodes.
I'm not totally convinced we might not thrash on an LRU with a high
population of dirty inodes or that a restart is even the simplest
approach to dealing with large batch sizes, but I'd reserve judgement on
that until there's code to review.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2019-08-11 12:46 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-01  2:17 [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-08-01  2:17 ` [PATCH 01/24] mm: directed shrinker work deferral Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:49     ` Dave Chinner
2019-08-05 17:42       ` Brian Foster
2019-08-05 23:43         ` Dave Chinner
2019-08-06 12:27           ` Brian Foster
2019-08-06 22:22             ` Dave Chinner
2019-08-07 11:13               ` Brian Foster
2019-08-01  2:17 ` [PATCH 02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers Dave Chinner
2019-08-02 15:27   ` Brian Foster
2019-08-04  1:50     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 03/24] mm: factor shrinker work calculations Dave Chinner
2019-08-02 15:08   ` Nikolay Borisov
2019-08-04  2:05     ` Dave Chinner
2019-08-02 15:31   ` Brian Foster
2019-08-01  2:17 ` [PATCH 04/24] shrinker: defer work only to kswapd Dave Chinner
2019-08-02 15:34   ` Brian Foster
2019-08-04 16:48   ` Nikolay Borisov
2019-08-04 21:37     ` Dave Chinner
2019-08-07 16:12   ` kbuild test robot
2019-08-07 18:00   ` kbuild test robot
2019-08-01  2:17 ` [PATCH 05/24] shrinker: clean up variable types and tracepoints Dave Chinner
2019-08-01  2:17 ` [PATCH 06/24] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-08-01  2:17 ` [PATCH 07/24] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-08-01  2:17 ` [PATCH 08/24] mm: kswapd backoff for shrinkers Dave Chinner
2019-08-01  2:17 ` [PATCH 09/24] xfs: don't allow log IO to be throttled Dave Chinner
2019-08-01 13:39   ` Chris Mason
2019-08-01 23:58     ` Dave Chinner
2019-08-02  8:12       ` Christoph Hellwig
2019-08-02 14:11       ` Chris Mason
2019-08-02 18:34         ` Matthew Wilcox
2019-08-02 23:28         ` Dave Chinner
2019-08-05 18:32           ` Chris Mason
2019-08-05 23:09             ` Dave Chinner
2019-08-01  2:17 ` [PATCH 10/24] xfs: fix missed wakeup on l_flush_wait Dave Chinner
2019-08-01  2:17 ` [PATCH 11/24] xfs:: account for memory freed from metadata buffers Dave Chinner
2019-08-01  8:16   ` Christoph Hellwig
2019-08-01  9:21     ` Dave Chinner
2019-08-06  5:51       ` Christoph Hellwig
2019-08-01  2:17 ` [PATCH 12/24] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-08-06  5:52   ` Christoph Hellwig
2019-08-06 21:05     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 13/24] xfs: synchronous AIL pushing Dave Chinner
2019-08-05 17:51   ` Brian Foster
2019-08-05 23:21     ` Dave Chinner
2019-08-06 12:29       ` Brian Foster
2019-08-01  2:17 ` [PATCH 14/24] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-08-05 17:53   ` Brian Foster
2019-08-05 23:28     ` Dave Chinner
2019-08-06  5:33       ` Dave Chinner
2019-08-06 12:53         ` Brian Foster
2019-08-06 21:11           ` Dave Chinner
2019-08-01  2:17 ` [PATCH 15/24] xfs: eagerly free shadow buffers to reduce CIL footprint Dave Chinner
2019-08-05 18:03   ` Brian Foster
2019-08-05 23:33     ` Dave Chinner
2019-08-06 12:57       ` Brian Foster
2019-08-06 21:21         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 16/24] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-08-04 17:12   ` Nikolay Borisov
2019-08-01  2:17 ` [PATCH 17/24] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-08-06 18:21   ` Brian Foster
2019-08-06 21:27     ` Dave Chinner
2019-08-07 11:14       ` Brian Foster
2019-08-01  2:17 ` [PATCH 18/24] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-08-06 18:22   ` Brian Foster
2019-08-06 21:33     ` Dave Chinner
2019-08-07 11:30       ` Brian Foster
2019-08-07 23:16         ` Dave Chinner
2019-08-01  2:17 ` [PATCH 19/24] xfs: kill background reclaim work Dave Chinner
2019-08-01  2:17 ` [PATCH 20/24] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-08-07 18:09   ` Brian Foster
2019-08-07 23:10     ` Dave Chinner
2019-08-08 16:20       ` Brian Foster
2019-08-01  2:17 ` [PATCH 21/24] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-08-01  2:17 ` [PATCH 22/24] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-08-08 16:36   ` Brian Foster
2019-08-09  0:10     ` Dave Chinner
2019-08-01  2:17 ` [PATCH 23/24] xfs: reclaim inodes from the LRU Dave Chinner
2019-08-08 16:39   ` Brian Foster
2019-08-09  1:20     ` Dave Chinner
2019-08-09 12:36       ` Brian Foster
2019-08-11  2:17         ` Dave Chinner
2019-08-11 12:46           ` Brian Foster
2019-08-01  2:17 ` [PATCH 24/24] xfs: remove unusued old inode reclaim code Dave Chinner
2019-08-06  5:57 ` [RFC] [PATCH 00/24] mm, xfs: non-blocking inode reclaim Christoph Hellwig
2019-08-06 21:37   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).