linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/28] mm, xfs: non-blocking inode reclaim
@ 2019-10-31 23:45 Dave Chinner
  2019-10-31 23:45 ` [PATCH 01/28] xfs: Lower CIL flush limit for large logs Dave Chinner
                   ` (27 more replies)
  0 siblings, 28 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

Hi folks,

This is an updated version of the non-blocking inode reclaim
patchset for XFS. Original description of the problem and patchset
is below the changelog.

Thoughts, comments and improvemnts welcome.

-Dave.

Changelog:

V3:

- rebased on 5.4-rc5 + linux-xfs/for-next
- fixed various typos and whitespace damage
- new patch to factor out inode lookup from xfs_ifree_cluster() to
  simplify the loop walking all the in-core inodes in the chunk
  being freed.
- fixed up issues with temporary xfs_ail_push_sync() scaffolding
- fixed up per-ag locking description in commit message of patch
  "xfs: use AIL pushing for inode reclaim IO"
- fixed missing per-ag reclaim cursor unlock
- made sync LSN calculation consistent between inode grab and
  reclaim for pinned inodes
- check against NULLCOMMITLSN rather than implicitly relying on
  XFS_LSN_CMP() to do the right thing.
- renamed xfs_reclaim_inodes -> xfs_reclaim_all_inodes().:w
- moved new LRU list init to xfs_fs_fill_super()
- fixed incorrect locking around xfs_ifunlock() in
  xfs_inode_reclaim_isolate().
- added lockdep_assert_held() to __xfs_iflock_nowait()
- use list_first_entry_or_null() in xfs_dispose_inodes()
- moved removal of ag reclaim walk into patch that changes reclaim
  to use the lru list walk.
- added patch to introduce down_write_non_owner() and friends
- added mrupdate_*_non_owner() for inode reclaim to use.
- fixed incorrect flag manipulation in xfs_iget_cache_hit()
- removed spurious extra xfs_ail_push_all() calls


V2:

https://lore.kernel.org/linux-xfs/20191009032124.10541-1-david@fromorbit.com/

- added current_reclaim_account_pages() wrapper for reclaim_state
  updates
- moved xfs_buf page free accounting to the page freeing code rather
  than the reclaim loop that drops the LRU buffer reference.
- increased log CIL flush limit by 2x
- moved xfs_buf GFP_NOFS reclaim changes to correct patch
- renamed sc->will_defer to sc->defer_work
- used min() in do_shrink_slab() appropriately
- converted xfs_trans_ail_update_bulk() to use new
  xfs_ail_update_finish() function (fixes missing wakeups w/
  xfs_ail_push_sync())
- fixed CIL size limit calculation units - was calculating min size
  in sectors rather than bytes. Fixes performance degradation w/
  patch series.
- dropped shadow buffer freeing - Brain pointed out it wasn't doing
  exactly what I thought it was, and the memory saving were from the
  massively undersized CIL limits, not early freeing of the shadow
  buffers. Needs rethinking.
- fixed stray NULLCOMMITLSN in comments
- pinned items don't track the commit LSN, just the CIL sequence
  number so we can't use that to push the AIL.
- removed stale tracing debug from AIL push code.
- fixes to memory reclaim shrinker accounting in 5.3-rc3 result in
  direct reclaim backoff working a whole lot better, such that it's
  no long necessary for the XFS inode shrinker to wait for IO to
  complete. Changed the LRU reclaim logic to simply push on the AIL
  if dirty inodes are hit, but never wait on them. Relevant commits
  are:

  0308f7cf19c9 ("mm/vmscan.c: calculate reclaimed slab caches in all reclaim paths")
  e5ca8071fe65 ("mm/vmscan.c: add a new member reclaim_state in struct shrink_control")

- Added a patch to convert xfs_reclaim_inodes() to use
  xfs_ail_push_all() which gets rid of the last user of
  xfs_ail_push_sync(). This allows it to be removed as "temporary
  infrastructure for the series" rather than having to be fixed up
  and made robust. The optimisations and factoring it drove are
  retained, as they are still a net improvement overall.
- fixed atomic_long vs atomic64 issues with shrinker deferral
  rework.
- don't drop ag reclaim cursor locking any more, it gets removed
  when all the old reclaim code is removed.
- added patch to change inode reclaim vs unreferenced XFS indoe
  lookup done by inode write clustering and inode cluster freeing.
  This gets rid of the need to cycle the ILOCK before running
  call_rcu() to queue the inode to be freed when the current RCU
  grace period expires. In doing so, the last major blocking point
  in XFS inode reclaim is removed.

V1 (original RFC):

https://lore.kernel.org/linux-xfs/20190801021752.4986-1-david@fromorbit.com/

----
Original text:

We've had a problem with inode reclaim for a long time - XFS is
capable of caching tens of millions of inodes with ease and dirtying
hundreds of thousands of those cached inodes every second. It is
also capable of reclaiming more than half a million clean inodes per
second per reclaim thread.

The result of this is that when there is a significant change in
sustained memory pressure on a system ith a large inode cache,
memory reclaim rapdily frees all the clean inodes, but cannot make
progress on reclaiming dirty inodes because they are rate limited
by IO.

However, the shrinker infrastructure in the kernel has no way to
feed back rate limiting to the core memory reclaim algorithms.  In
fact there are no feedback mechanisms at all, and so when reclaim
has freed all the clean inodes and starts hitting dirty inodes, the
filesystem has no way of telling reclaim that the inode reclaim rate
has dropped from 500k/s to 500/s.

The result is that reclaim continues to try to free memory, and
because it makes no progress freeing inodes, it puts much more
pressure on the page LRUs and frees pages. When it runs out of
pages, it starts swapping, and when it runs out of swap or can't get
a page for swap-in it starts going on an OOM kill rampage.

That does nothing to "fix" the shortage of memory caused by the
slowness of dirty inode reclaim - if memory demand continues we just
keep hitting the OOM killer until either something critical is
killed or memory demand eases.

For a long time, XFS has avoided the insane spiral of shouty
OOM-killer rage death by cleaning inodes directly in the shrinker.
This has the effect of throttling memory reclaim to the rate at
which dirty inodes can be cleaned, and so when we get into the state
when memory reclaim is dependent on inode reclaim making progress
we don't ever allow LRU reclaim to run so far ahead of inode reclaim
that it winds up reclaim priority and runs out of LRU pages to
reclaim and/or swap.

This has a downside, though. When there is a large amount of clean
page cache and a small amount of inode cache that is dirty (e.g.
lots of file data pressure and/or application memory demand) the
inode reclaim shrinkers can run out of clean inodes to reclaim and
start blocking on inode writeback. This can result in long reclaim
latencies even though there is lots of memory that can be
immediately reclaimed from the page cache.

There are other issues, too. We have to block kswapd, too, because
it will continue running until watermarks are satisfied, and that
is largely the vector for shouty swappy death if it doesn't back
off before priority windup from lack of progress occurs. Blocking
kswapd then affects direct reclaim function, which often backs off
expecting kswapd to make progress in the mean time. But if kswapd
is not making progress, direct reclaim ends up in priority windup
from lack of progress, too. This is especially prevalent in
workloads that have a high percentage of GFP_NOFS allocations (e.g.
filesystem modification workloads).

The shrinkers have another problem w/ GFP_NOFS reclaim: the work
that is deferred because the shrinker cannot make progress gets
lumped on the first reclaim context that can do that work. That
means a direct reclaimer might get lumped with scanning millions of
objects during low priority scanning when it should only be scanning
a couple of thousand objects. This can result in highly
unpredictable and extremely long direct reclaim delays.

This is most definitely sub-optimal, but it's better than random
and/or premature OOM killer invocation under trivial workloads and
lots of reclaimable memory still being available.

This patch set aims to fix all these problems. The memory reclaim
and shrinker changes involve:

- a substantial rework of how the shrinker defers work, moving all
  the deferred work to kswapd to remove all the unpredictability
  from direct reclaim.  Direct reclaim will only do the work the
  direct reclaim context determines is necesary.

- deferred work is capped, and the amount of deferred work kswapd
  will do in each scan is increased linearly w.r.t. increasing
  reclaim priority. Hence when we are desparate for memory, kswapd
  will be running all the deferred work as quickly as possible.

- The amount of deferred work and the amount of scanning that is
  done by the shrinkers is now tracked in the struct reclaim_state.
  This allows shrink_node() to see how much work is being done in
  comparison to both the LRU scanning and how much is being deferred
  to kswapd. This allows direct reclaim to back off when too much
  work is being deferred and hence allow kswapd to make progress on
  the deferred work while it waits.

- A "need backoff" flag has been added to the struct reclaim_state.
  This allows individual shrinkers to indicate to kswapd that they
  need some time to finish work before being scanned again. This is
  basically for the same case as kswapd backs off from LRU scanning.

  i.e. the LRU scanning has run into the tail of the LRU and is
  finding dirty objects that require IO to complete before reclaim
  can make further progress. This is exactly the same problem we
  have with inode reclaim in XFS, and it is this mechanism that
  enables us to move to IO-less inode reclaim.

The XFS changes are all over the place, and address both the reclaim
blocking problems and all the other related issues I found while
working on this patchest. These involve:

- fixing IO priority inversion problems between metadata
  writeback (inodes!) and log IO caused by the block layer write
  throttling (more on this later).

- some slab caches weren't marked as reclaimable, so were
  incorrectly accounted. Also account for the pages xfs_buf reclaim
  releases.

- reduced the delayed logging dirty item aggregation size (the CIL).
  This defines the minimum amount of memory XFS can operate in when
  there is heavy modifications in progress.

- reduced the memory footprint of the CIL when repeated
  modifications to objects occur.

- Added a mechanism to push the AIL to a specific LSN (metadata
  modification epoch) and wait for it. This forms the basis for
  direct inode reclaim deferring IO and waiting for some progress
  without issuing IO iteslf.

- reworked inode reclaim to use a list_lru to track inodes in
  reclaim rather than a radix tree tag in the inode cache. We
  iterated the radix tree for reclaim because it resulted in optimal
  IO patterns from multiple concurrent reclaimers, but we dont' have
  to care about that any more because all IO comes from the AIL now.

  This gives us try LRU reclaim, and it allows us to effectively
  determine when we've run out of clean inodes to easily reclaim and
  provide that feedback to the higher levels via the "need backoff"
  flag.

- direct reclaim is non-blocking while scanning, but at the end of a
  scan it will still block waiting for IO, but only for /some/
  progress to be made and not specific individual IOs.

- kswapd based reclaim is fully non-blocking.

The result is that there is now enough feedback from the shrinkers
into the main memory reclaim loop for it to back off in the
situations where back-off is required to avoid OOM killer
invocation, despite XFS now largely doing non-blocking reclaim.

Testing involves at 16p/16GB machine running a fsmark workload that
creates sustained heavy dirty inode cache pressure, then
progressively locking 2GB of memory at time to squeeze the workload
into less and less memory. A vanilla kernel runs well up to 12GB
squeezed, but at 14GB squeezed performance goes to hell. With just
the hacky "don't block kswapd by removing SYNC_WAIT" patch that
people seem to like, OOM kills start when squeezed to 12GB. With
that extended to direct reclaim, OOM kills start with squeezed to
just 8GB. With the full patchset, it runs similar to a vanilla
kernel up to 12GB squeezed, and vastly out-performs the vanilla
kernel with 14GB squeezed. Performance only drops ~20% with a 14GB
squeeze, whereas the vanilla kernel sees up to a 90% drop in
performance.

I also run testing with simoop, a simulated workload that Chris
Mason put together to demonstrate the long tail latency and
allocation stall problems the blocking in inode reclaim was causing.
The vanilla kernel averaged ~5 stalls/s over a test period of 10
hours, this patch series resulted in:

alloc stall rate = 0.00/sec (avg: 0.04) (p50: 0.04) (p95: 0.16) (p99: 0.32)

stalls almost going away entirely.

So the signs are there that this is a workable solution to the
problems caused by blocking inode reclaim without re-introducing the
the Death-by-OOM-killer issues the blocking avoids.

Please note that I haven't full gone non-blocking on direct reclaim
for a couple of reasons:

1. congestion_wait() and wait_iff_congested() are completely broken.
The blkmq change-over ripped out all the block layer congestion
reporting in 5.0 and didn't replace it with anything, so unless you
are operating on an NFS client, Ceph, FUSE or a DVD, congestion
checks and backoff aren't actually doing what they are supposed to.
i.e. wait_iff_congested() never blocks, and congestion_wait() always
sleeps for it's full timeout.

IOWs, the whole bdi-based IO congestion feedback mechanism no longer
functions as intended, and so I'm betting a lot of the memory
reclaim heuristics no longer function as they were intended to...

2. The block layer write throttle is full of priority inversions.
Apart from the log IO one I fixed in this series, I noticed that
swap in/out has a major problem. I lost count of the number of OOM
kills that occurred from the swap in path when there were several
processes blocked in wbt_wait() in the block layer in the swap out
path. i.e. if swap out had been making progress, swap in would not
have oom killed. Hence I found it still necessary to throttle direct
reclaim back in the shrinker as there wasn't a realiable way to get
the core reclaim code to throttle effectively.

FWIW, from the swap in/out perspective, this whole inversion problem
is made worse by #1: the congestion_wait/wait_iff_congested
interfaces being broken. Direct reclaim uses wait_iff_congested() to
back off if kswapd has indicated that the node is congested
(PGDAT_CONGESTED) and reclaim is struggling to make progress.
However, this backoff never actually happens now and hence direct
reclaim barrels into the swap code as hard as it can and blocks in
wbt_wait() waiting behind other swap IO instead of backing off and
waiting for some IO to complete and then retrying it's allocation....

So maybe if we fix the bdi congestion interfaces so they work again
we can get rid of the waiting in direct reclaim, but right now I
don't see any other choice....



Dave Chinner (28):
  xfs: Lower CIL flush limit for large logs
  xfs: Throttle commits on delayed background CIL push
  xfs: don't allow log IO to be throttled
  xfs: Improve metadata buffer reclaim accountability
  xfs: correctly acount for reclaimable slabs
  xfs: factor common AIL item deletion code
  xfs: tail updates only need to occur when LSN changes
  xfs: factor inode lookup from xfs_ifree_cluster
  mm: directed shrinker work deferral
  shrinkers: use defer_work for GFP_NOFS sensitive shrinkers
  mm: factor shrinker work calculations
  shrinker: defer work only to kswapd
  shrinker: clean up variable types and tracepoints
  mm: reclaim_state records pages reclaimed, not slabs
  mm: back off direct reclaim on excessive shrinker deferral
  mm: kswapd backoff for shrinkers
  xfs: synchronous AIL pushing
  xfs: don't block kswapd in inode reclaim
  xfs: reduce kswapd blocking on inode locking.
  xfs: kill background reclaim work
  xfs: use AIL pushing for inode reclaim IO
  xfs: remove mode from xfs_reclaim_inodes()
  xfs: track reclaimable inodes using a LRU list
  xfs: reclaim inodes from the LRU
  xfs: remove unusued old inode reclaim code
  xfs: use xfs_ail_push_all in xfs_reclaim_inodes
  rwsem: introduce down/up_write_non_owner
  xfs: rework unreferenced inode lookups

 drivers/staging/android/ashmem.c |   8 +-
 fs/gfs2/glock.c                  |   5 +-
 fs/gfs2/quota.c                  |   6 +-
 fs/inode.c                       |   3 +-
 fs/nfs/dir.c                     |   6 +-
 fs/super.c                       |   6 +-
 fs/xfs/mrlock.h                  |  27 ++
 fs/xfs/xfs_buf.c                 |   4 +-
 fs/xfs/xfs_icache.c              | 632 +++++++++----------------------
 fs/xfs/xfs_icache.h              |  26 +-
 fs/xfs/xfs_inode.c               | 219 +++++------
 fs/xfs/xfs_inode.h               |  18 +
 fs/xfs/xfs_inode_item.c          |  28 +-
 fs/xfs/xfs_log.c                 |  10 +-
 fs/xfs/xfs_log_cil.c             |  37 +-
 fs/xfs/xfs_log_priv.h            |  53 ++-
 fs/xfs/xfs_mount.c               |  10 +-
 fs/xfs/xfs_mount.h               |   6 +-
 fs/xfs/xfs_qm.c                  |  11 +-
 fs/xfs/xfs_super.c               |  93 +++--
 fs/xfs/xfs_trace.h               |   1 +
 fs/xfs/xfs_trans_ail.c           |  88 +++--
 fs/xfs/xfs_trans_priv.h          |   6 +-
 include/linux/rwsem.h            |   6 +
 include/linux/shrinker.h         |   9 +-
 include/linux/swap.h             |  23 +-
 include/trace/events/vmscan.h    |  69 ++--
 kernel/locking/rwsem.c           |  23 ++
 mm/slab.c                        |   3 +-
 mm/slob.c                        |   4 +-
 mm/slub.c                        |   3 +-
 mm/vmscan.c                      | 231 +++++++----
 net/sunrpc/auth.c                |   5 +-
 33 files changed, 863 insertions(+), 816 deletions(-)

-- 
2.24.0.rc0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 01/28] xfs: Lower CIL flush limit for large logs
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-10-31 23:45 ` [PATCH 02/28] xfs: Throttle commits on delayed background CIL push Dave Chinner
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

Flushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can have 2MB of log IO in flight at
once, so limit aggregation to 16x this size. This threshold was
chosen as it little impact on performance (on 16-way fsmark) or log
traffic but pins a lot less memory on large logs especially under
heavy memory pressure.  An aggregation limit of 8x had 5-10%
performance degradation and a 50% increase in log throughput for
the same workload, so clearly that was too small for highly
concurrent workloads on large logs.

This was found via trace analysis of AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4f19375f6592..abd382cfffe3 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -318,13 +318,30 @@ struct xfs_cil {
  * tries to keep 25% of the log free, so we need to keep below that limit or we
  * risk running out of free log space to start any new transactions.
  *
- * In order to keep background CIL push efficient, we will set a lower
- * threshold at which background pushing is attempted without blocking current
- * transaction commits.  A separate, higher bound defines when CIL pushes are
- * enforced to ensure we stay within our maximum checkpoint size bounds.
- * threshold, yet give us plenty of space for aggregation on large logs.
+ * In order to keep background CIL push efficient, we only need to ensure the
+ * CIL is large enough to maintain sufficient in-memory relogging to avoid
+ * repeated physical writes of frequently modified metadata. If we allow the CIL
+ * to grow to a substantial fraction of the log, then we may be pinning hundreds
+ * of megabytes of metadata in memory until the CIL flushes. This can cause
+ * issues when we are running low on memory - pinned memory cannot be reclaimed,
+ * and the CIL consumes a lot of memory. Hence we need to set an upper physical
+ * size limit for the CIL that limits the maximum amount of memory pinned by the
+ * CIL but does not limit performance by reducing relogging efficiency
+ * significantly.
+ *
+ * As such, the CIL push threshold ends up being the smaller of two thresholds:
+ * - a threshold large enough that it allows CIL to be pushed and progress to be
+ *   made without excessive blocking of incoming transaction commits. This is
+ *   defined to be 12.5% of the log space - half the 25% push threshold of the
+ *   AIL.
+ * - small enough that it doesn't pin excessive amounts of memory but maintains
+ *   close to peak relogging efficiency. This is defined to be 16x the iclog
+ *   buffer window (32MB) as measurements have shown this to be roughly the
+ *   point of diminishing performance increases under highly concurrent
+ *   modification workloads.
  */
-#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
+#define XLOG_CIL_SPACE_LIMIT(log)	\
+	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
 
 /*
  * ticket grant locks, queues and accounting have their own cachlines
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 02/28] xfs: Throttle commits on delayed background CIL push
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
  2019-10-31 23:45 ` [PATCH 01/28] xfs: Lower CIL flush limit for large logs Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-11-01 12:04   ` Brian Foster
  2019-10-31 23:45 ` [PATCH 03/28] xfs: don't allow log IO to be throttled Dave Chinner
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

In certain situations the background CIL push can be indefinitely
delayed. While we have workarounds from the obvious cases now, it
doesn't solve the underlying issue. This issue is that there is no
upper limit on the CIL where we will either force or wait for
a background push to start, hence allowing the CIL to grow without
bound until it consumes all log space.

To fix this, add a new wait queue to the CIL which allows background
pushes to wait for the CIL context to be switched out. This happens
when the push starts, so it will allow us to block incoming
transaction commit completion until the push has started. This will
only affect processes that are running modifications, and only when
the CIL threshold has been significantly overrun.

This has no apparent impact on performance, and doesn't even trigger
until over 45 million inodes had been created in a 16-way fsmark
test on a 2GB log. That was limiting at 64MB of log space used, so
the active CIL size is only about 3% of the total log in that case.
The concurrent removal of those files did not trigger the background
sleep at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
 fs/xfs/xfs_trace.h    |  1 +
 3 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index a1204424a938..fc3f8e849dec 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -670,6 +670,11 @@ xlog_cil_push(
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
 
+	/*
+	 * Wake up any background push waiters now this context is being pushed.
+	 */
+	wake_up_all(&ctx->push_wait);
+
 	/*
 	 * Check if we've anything to push. If there is nothing, then we don't
 	 * move on to a new sequence number and so we have to be able to push
@@ -746,6 +751,7 @@ xlog_cil_push(
 	 */
 	INIT_LIST_HEAD(&new_ctx->committing);
 	INIT_LIST_HEAD(&new_ctx->busy_extents);
+	init_waitqueue_head(&new_ctx->push_wait);
 	new_ctx->sequence = ctx->sequence + 1;
 	new_ctx->cil = cil;
 	cil->xc_ctx = new_ctx;
@@ -900,7 +906,7 @@ xlog_cil_push_work(
  */
 static void
 xlog_cil_push_background(
-	struct xlog	*log)
+	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -914,14 +920,36 @@ xlog_cil_push_background(
 	 * don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
+	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+		up_read(&cil->xc_ctx_lock);
 		return;
+	}
 
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
 		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	}
+
+	/*
+	 * Drop the context lock now, we can't hold that if we need to sleep
+	 * because we are over the blocking threshold. The push_lock is still
+	 * held, so blocking threshold sleep/wakeup is still correctly
+	 * serialised here.
+	 */
+	up_read(&cil->xc_ctx_lock);
+
+	/*
+	 * If we are well over the space limit, throttle the work that is being
+	 * done until the push work on this context has begun.
+	 */
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
+		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
+		return;
+	}
+
 	spin_unlock(&cil->xc_push_lock);
 
 }
@@ -1038,9 +1066,9 @@ xfs_log_commit_cil(
 		if (lip->li_ops->iop_committing)
 			lip->li_ops->iop_committing(lip, xc_commit_lsn);
 	}
-	xlog_cil_push_background(log);
 
-	up_read(&cil->xc_ctx_lock);
+	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
+	xlog_cil_push_background(log);
 }
 
 /*
@@ -1199,6 +1227,7 @@ xlog_cil_init(
 
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
+	init_waitqueue_head(&ctx->push_wait);
 	ctx->sequence = 1;
 	ctx->cil = cil;
 	cil->xc_ctx = ctx;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index abd382cfffe3..1cc5333a3f6a 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -242,6 +242,7 @@ struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
+	wait_queue_head_t	push_wait;	/* background push throttle */
 	struct work_struct	discard_endio_work;
 };
 
@@ -339,10 +340,33 @@ struct xfs_cil {
  *   buffer window (32MB) as measurements have shown this to be roughly the
  *   point of diminishing performance increases under highly concurrent
  *   modification workloads.
+ *
+ * To prevent the CIL from overflowing upper commit size bounds, we introduce a
+ * new threshold at which we block committing transactions until the background
+ * CIL commit commences and switches to a new context. While this is not a hard
+ * limit, it forces the process committing a transaction to the CIL to block and
+ * yeild the CPU, giving the CIL push work a chance to be scheduled and start
+ * work. This prevents a process running lots of transactions from overfilling
+ * the CIL because it is not yielding the CPU. We set the blocking limit at
+ * twice the background push space threshold so we keep in line with the AIL
+ * push thresholds.
+ *
+ * Note: this is not a -hard- limit as blocking is applied after the transaction
+ * is inserted into the CIL and the push has been triggered. It is largely a
+ * throttling mechanism that allows the CIL push to be scheduled and run. A hard
+ * limit will be difficult to implement without introducing global serialisation
+ * in the CIL commit fast path, and it's not at all clear that we actually need
+ * such hard limits given the ~7 years we've run without a hard limit before
+ * finding the first situation where a checkpoint size overflow actually
+ * occurred. Hence the simple throttle, and an ASSERT check to tell us that
+ * we've overrun the max size.
  */
 #define XLOG_CIL_SPACE_LIMIT(log)	\
 	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
 
+#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
+	(XLOG_CIL_SPACE_LIMIT(log) * 2)
+
 /*
  * ticket grant locks, queues and accounting have their own cachlines
  * as these are quite hot and can be operated on concurrently.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c13bb3655e48..d3635d1e3de6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub);
 DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter);
 DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit);
 DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub);
+DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
 
 DECLARE_EVENT_CLASS(xfs_log_item_class,
 	TP_PROTO(struct xfs_log_item *lip),
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 03/28] xfs: don't allow log IO to be throttled
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
  2019-10-31 23:45 ` [PATCH 01/28] xfs: Lower CIL flush limit for large logs Dave Chinner
  2019-10-31 23:45 ` [PATCH 02/28] xfs: Throttle commits on delayed background CIL push Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Running metadata intensive workloads, I've been seeing the AIL
pushing getting stuck on pinned buffers and triggering log forces.
The log force is taking a long time to run because the log IO is
getting throttled by wbt_wait() - the block layer writeback
throttle. It's being throttled because there is a huge amount of
metadata writeback going on which is filling the request queue.

IOWs, we have a priority inversion problem here.

Mark the log IO bios with REQ_IDLE so they don't get throttled
by the block layer writeback throttle. When we are forcing the CIL,
we are likely to need to to tens of log IOs, and they are issued as
fast as they can be build and IO completed. Hence REQ_IDLE is
appropriate - it's an indication that more IO will follow shortly.

And because we also set REQ_SYNC, the writeback throttle will now
treat log IO the same way it treats direct IO writes - it will not
throttle them at all. Hence we solve the priority inversion problem
caused by the writeback throttle being unable to distinguish between
high priority log IO and background metadata writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 33fb38864207..e7441a983b3b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1792,7 +1792,15 @@ xlog_write_iclog(
 	iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno;
 	iclog->ic_bio.bi_end_io = xlog_bio_end_io;
 	iclog->ic_bio.bi_private = iclog;
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_FUA;
+
+	/*
+	 * We use REQ_SYNC | REQ_IDLE here to tell the block layer the are more
+	 * IOs coming immediately after this one. This prevents the block layer
+	 * writeback throttle from throttling log writes behind background
+	 * metadata writeback and causing priority inversions.
+	 */
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
+				REQ_IDLE | REQ_FUA;
 	if (need_flush)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
 
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (2 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 03/28] xfs: don't allow log IO to be throttled Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-11-01 12:05   ` Brian Foster
  2019-11-04 23:21   ` Darrick J. Wong
  2019-10-31 23:45 ` [PATCH 05/28] xfs: correctly acount for reclaimable slabs Dave Chinner
                   ` (23 subsequent siblings)
  27 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The buffer cache shrinker frees more than just the xfs_buf slab
objects - it also frees the pages attached to the buffers. Make sure
the memory reclaim code accounts for this memory being freed
correctly, similar to how the inode shrinker accounts for pages
freed from the page cache due to mapping invalidation.

We also need to make sure that the mm subsystem knows these are
reclaimable objects. We provide the memory reclaim subsystem with a
a shrinker to reclaim xfs_bufs, so we should really mark the slab
that way.

We also have a lot of xfs_bufs in a busy system, spread them around
like we do inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1e63dd3d1257..d34e5d2edacd 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -324,6 +324,9 @@ xfs_buf_free(
 
 			__free_page(page);
 		}
+		if (current->reclaim_state)
+			current->reclaim_state->reclaimed_slab +=
+							bp->b_page_count;
 	} else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 	_xfs_buf_free_pages(bp);
@@ -2061,7 +2064,8 @@ int __init
 xfs_buf_init(void)
 {
 	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
-						KM_ZONE_HWALIGN, NULL);
+			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
+			NULL);
 	if (!xfs_buf_zone)
 		goto out;
 
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 05/28] xfs: correctly acount for reclaimable slabs
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (3 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-10-31 23:45 ` [PATCH 06/28] xfs: factor common AIL item deletion code Dave Chinner
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The XFS inode item slab actually reclaimed by inode shrinker
callbacks from the memory reclaim subsystem. These should be marked
as reclaimable so the mm subsystem has the full picture of how much
memory it can actually reclaim from the XFS slab caches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bcb1575a5652..ebe2ccd36127 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1876,7 +1876,7 @@ xfs_init_zones(void)
 
 	xfs_ili_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_log_item_t), "xfs_ili",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD | KM_ZONE_RECLAIM, NULL);
 	if (!xfs_ili_zone)
 		goto out_destroy_inode_zone;
 	xfs_icreate_zone = kmem_zone_init(sizeof(struct xfs_icreate_item),
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 06/28] xfs: factor common AIL item deletion code
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (4 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 05/28] xfs: correctly acount for reclaimable slabs Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-11-04 23:16   ` Darrick J. Wong
  2019-10-31 23:45 ` [PATCH 07/28] xfs: tail updates only need to occur when LSN changes Dave Chinner
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Factor the common AIL deletion code that does all the wakeups into a
helper so we only have one copy of this somewhat tricky code to
interface with all the wakeups necessary when the LSN of the log
tail changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_inode_item.c | 12 +----------
 fs/xfs/xfs_trans_ail.c  | 48 ++++++++++++++++++++++-------------------
 fs/xfs/xfs_trans_priv.h |  4 +++-
 3 files changed, 30 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index bb8f076805b9..ab12e526540a 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -743,17 +743,7 @@ xfs_iflush_done(
 				xfs_clear_li_failed(blip);
 			}
 		}
-
-		if (mlip_changed) {
-			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
-				xlog_assign_tail_lsn_locked(ailp->ail_mount);
-			if (list_empty(&ailp->ail_head))
-				wake_up_all(&ailp->ail_empty);
-		}
-		spin_unlock(&ailp->ail_lock);
-
-		if (mlip_changed)
-			xfs_log_space_wake(ailp->ail_mount);
+		xfs_ail_update_finish(ailp, mlip_changed);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 6ccfd75d3c24..656819523bbd 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -678,6 +678,27 @@ xfs_ail_push_all_sync(
 	finish_wait(&ailp->ail_empty, &wait);
 }
 
+void
+xfs_ail_update_finish(
+	struct xfs_ail		*ailp,
+	bool			do_tail_update) __releases(ailp->ail_lock)
+{
+	struct xfs_mount	*mp = ailp->ail_mount;
+
+	if (!do_tail_update) {
+		spin_unlock(&ailp->ail_lock);
+		return;
+	}
+
+	if (!XFS_FORCED_SHUTDOWN(mp))
+		xlog_assign_tail_lsn_locked(mp);
+
+	if (list_empty(&ailp->ail_head))
+		wake_up_all(&ailp->ail_empty);
+	spin_unlock(&ailp->ail_lock);
+	xfs_log_space_wake(mp);
+}
+
 /*
  * xfs_trans_ail_update - bulk AIL insertion operation.
  *
@@ -737,15 +758,7 @@ xfs_trans_ail_update_bulk(
 	if (!list_empty(&tmp))
 		xfs_ail_splice(ailp, cur, &tmp, lsn);
 
-	if (mlip_changed) {
-		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
-			xlog_assign_tail_lsn_locked(ailp->ail_mount);
-		spin_unlock(&ailp->ail_lock);
-
-		xfs_log_space_wake(ailp->ail_mount);
-	} else {
-		spin_unlock(&ailp->ail_lock);
-	}
+	xfs_ail_update_finish(ailp, mlip_changed);
 }
 
 bool
@@ -789,10 +802,10 @@ void
 xfs_trans_ail_delete(
 	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip,
-	int			shutdown_type) __releases(ailp->ail_lock)
+	int			shutdown_type)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
-	bool			mlip_changed;
+	bool			need_update;
 
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
@@ -805,17 +818,8 @@ xfs_trans_ail_delete(
 		return;
 	}
 
-	mlip_changed = xfs_ail_delete_one(ailp, lip);
-	if (mlip_changed) {
-		if (!XFS_FORCED_SHUTDOWN(mp))
-			xlog_assign_tail_lsn_locked(mp);
-		if (list_empty(&ailp->ail_head))
-			wake_up_all(&ailp->ail_empty);
-	}
-
-	spin_unlock(&ailp->ail_lock);
-	if (mlip_changed)
-		xfs_log_space_wake(ailp->ail_mount);
+	need_update = xfs_ail_delete_one(ailp, lip);
+	xfs_ail_update_finish(ailp, need_update);
 }
 
 int
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2e073c1c4614..64ffa746730e 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -92,8 +92,10 @@ xfs_trans_ail_update(
 }
 
 bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
+void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
+			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
-		int shutdown_type) __releases(ailp->ail_lock);
+		int shutdown_type);
 
 static inline void
 xfs_trans_ail_remove(
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 07/28] xfs: tail updates only need to occur when LSN changes
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (5 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 06/28] xfs: factor common AIL item deletion code Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-11-04 23:18   ` Darrick J. Wong
  2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
                   ` (20 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

We currently wake anything waiting on the log tail to move whenever
the log item at the tail of the log is removed. Historically this
was fine behaviour because there were very few items at any given
LSN. But with delayed logging, there may be thousands of items at
any given LSN, and we can't move the tail until they are all gone.

Hence if we are removing them in near tail-first order, we might be
waking up processes waiting on the tail LSN to change (e.g. log
space waiters) repeatedly without them being able to make progress.
This also occurs with the new sync push waiters, and can result in
thousands of spurious wakeups every second when under heavy direct
reclaim pressure.

To fix this, check that the tail LSN has actually changed on the
AIL before triggering wakeups. This will reduce the number of
spurious wakeups when doing bulk AIL removal and make this code much
more efficient.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 18 ++++++++++----
 fs/xfs/xfs_trans_ail.c  | 52 ++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_trans_priv.h |  4 ++--
 3 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index ab12e526540a..79ffe6dff115 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -731,19 +731,27 @@ xfs_iflush_done(
 	 * holding the lock before removing the inode from the AIL.
 	 */
 	if (need_ail) {
-		bool			mlip_changed = false;
+		xfs_lsn_t	tail_lsn = 0;
 
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(blip, &tmp, li_bio_list) {
 			if (INODE_ITEM(blip)->ili_logged &&
-			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
-				mlip_changed |= xfs_ail_delete_one(ailp, blip);
-			else {
+			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
+				/*
+				 * xfs_ail_update_finish() only cares about the
+				 * lsn of the first tail item removed, any
+				 * others will be at the same or higher lsn so
+				 * we just ignore them.
+				 */
+				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
+				if (!tail_lsn && lsn)
+					tail_lsn = lsn;
+			} else {
 				xfs_clear_li_failed(blip);
 			}
 		}
-		xfs_ail_update_finish(ailp, mlip_changed);
+		xfs_ail_update_finish(ailp, tail_lsn);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 656819523bbd..685a21cd24c0 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -108,17 +108,25 @@ xfs_ail_next(
  * We need the AIL lock in order to get a coherent read of the lsn of the last
  * item in the AIL.
  */
+static xfs_lsn_t
+__xfs_ail_min_lsn(
+	struct xfs_ail		*ailp)
+{
+	struct xfs_log_item	*lip = xfs_ail_min(ailp);
+
+	if (lip)
+		return lip->li_lsn;
+	return 0;
+}
+
 xfs_lsn_t
 xfs_ail_min_lsn(
 	struct xfs_ail		*ailp)
 {
-	xfs_lsn_t		lsn = 0;
-	struct xfs_log_item	*lip;
+	xfs_lsn_t		lsn;
 
 	spin_lock(&ailp->ail_lock);
-	lip = xfs_ail_min(ailp);
-	if (lip)
-		lsn = lip->li_lsn;
+	lsn = __xfs_ail_min_lsn(ailp);
 	spin_unlock(&ailp->ail_lock);
 
 	return lsn;
@@ -681,11 +689,12 @@ xfs_ail_push_all_sync(
 void
 xfs_ail_update_finish(
 	struct xfs_ail		*ailp,
-	bool			do_tail_update) __releases(ailp->ail_lock)
+	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
 
-	if (!do_tail_update) {
+	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
+	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
 		spin_unlock(&ailp->ail_lock);
 		return;
 	}
@@ -730,7 +739,7 @@ xfs_trans_ail_update_bulk(
 	xfs_lsn_t		lsn) __releases(ailp->ail_lock)
 {
 	struct xfs_log_item	*mlip;
-	int			mlip_changed = 0;
+	xfs_lsn_t		tail_lsn = 0;
 	int			i;
 	LIST_HEAD(tmp);
 
@@ -745,9 +754,10 @@ xfs_trans_ail_update_bulk(
 				continue;
 
 			trace_xfs_ail_move(lip, lip->li_lsn, lsn);
+			if (mlip == lip && !tail_lsn)
+				tail_lsn = lip->li_lsn;
+
 			xfs_ail_delete(ailp, lip);
-			if (mlip == lip)
-				mlip_changed = 1;
 		} else {
 			trace_xfs_ail_insert(lip, 0, lsn);
 		}
@@ -758,15 +768,23 @@ xfs_trans_ail_update_bulk(
 	if (!list_empty(&tmp))
 		xfs_ail_splice(ailp, cur, &tmp, lsn);
 
-	xfs_ail_update_finish(ailp, mlip_changed);
+	xfs_ail_update_finish(ailp, tail_lsn);
 }
 
-bool
+/*
+ * Delete one log item from the AIL.
+ *
+ * If this item was at the tail of the AIL, return the LSN of the log item so
+ * that we can use it to check if the LSN of the tail of the log has moved
+ * when finishing up the AIL delete process in xfs_ail_update_finish().
+ */
+xfs_lsn_t
 xfs_ail_delete_one(
 	struct xfs_ail		*ailp,
 	struct xfs_log_item	*lip)
 {
 	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
+	xfs_lsn_t		lsn = lip->li_lsn;
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
@@ -774,7 +792,9 @@ xfs_ail_delete_one(
 	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
 	lip->li_lsn = 0;
 
-	return mlip == lip;
+	if (mlip == lip)
+		return lsn;
+	return 0;
 }
 
 /**
@@ -805,7 +825,7 @@ xfs_trans_ail_delete(
 	int			shutdown_type)
 {
 	struct xfs_mount	*mp = ailp->ail_mount;
-	bool			need_update;
+	xfs_lsn_t		tail_lsn;
 
 	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
 		spin_unlock(&ailp->ail_lock);
@@ -818,8 +838,8 @@ xfs_trans_ail_delete(
 		return;
 	}
 
-	need_update = xfs_ail_delete_one(ailp, lip);
-	xfs_ail_update_finish(ailp, need_update);
+	tail_lsn = xfs_ail_delete_one(ailp, lip);
+	xfs_ail_update_finish(ailp, tail_lsn);
 }
 
 int
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 64ffa746730e..35655eac01a6 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -91,8 +91,8 @@ xfs_trans_ail_update(
 	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
 }
 
-bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
-void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
+xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
+void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
 			__releases(ailp->ail_lock);
 void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
 		int shutdown_type);
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (6 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 07/28] xfs: tail updates only need to occur when LSN changes Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-11-01 12:05   ` Brian Foster
  2019-11-04 23:20   ` Darrick J. Wong
  2019-10-31 23:45 ` [PATCH 09/28] mm: directed shrinker work deferral Dave Chinner
                   ` (19 subsequent siblings)
  27 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

There's lots of indent in this code which makes it a bit hard to
follow. We are also going to completely rework the inode lookup code
as part of the inode reclaim rework, so factor out the inode lookup
code from the inode cluster freeing code.

Based on prototype code from Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 152 +++++++++++++++++++++++++--------------------
 1 file changed, 84 insertions(+), 68 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e9e4f444f8ce..33edb18098ca 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2516,6 +2516,88 @@ xfs_iunlink_remove(
 	return error;
 }
 
+/*
+ * Look up the inode number specified and mark it stale if it is found. If it is
+ * dirty, return the inode so it can be attached to the cluster buffer so it can
+ * be processed appropriately when the cluster free transaction completes.
+ */
+static struct xfs_inode *
+xfs_ifree_get_one_inode(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*free_ip,
+	int			inum)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_inode	*ip;
+
+retry:
+	rcu_read_lock();
+	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
+
+	/* Inode not in memory, nothing to do */
+	if (!ip)
+		goto out_rcu_unlock;
+
+	/*
+	 * because this is an RCU protected lookup, we could find a recently
+	 * freed or even reallocated inode during the lookup. We need to check
+	 * under the i_flags_lock for a valid inode here. Skip it if it is not
+	 * valid, the wrong inode or stale.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
+		spin_unlock(&ip->i_flags_lock);
+		goto out_rcu_unlock;
+	}
+	spin_unlock(&ip->i_flags_lock);
+
+	/*
+	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
+	 * other inodes that we did not find in the list attached to the buffer
+	 * and are not already marked stale. If we can't lock it, back off and
+	 * retry.
+	 */
+	if (ip != free_ip) {
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
+			rcu_read_unlock();
+			delay(1);
+			goto retry;
+		}
+
+		/*
+		 * Check the inode number again in case we're racing with
+		 * freeing in xfs_reclaim_inode().  See the comments in that
+		 * function for more information as to why the initial check is
+		 * not sufficient.
+		 */
+		if (ip->i_ino != inum) {
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			goto out_rcu_unlock;
+		}
+	}
+	rcu_read_unlock();
+
+	xfs_iflock(ip);
+	xfs_iflags_set(ip, XFS_ISTALE);
+
+	/*
+	 * We don't need to attach clean inodes or those only with unlogged
+	 * changes (which we throw away, anyway).
+	 */
+	if (!ip->i_itemp || xfs_inode_clean(ip)) {
+		ASSERT(ip != free_ip);
+		xfs_ifunlock(ip);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		goto out_no_inode;
+	}
+	return ip;
+
+out_rcu_unlock:
+	rcu_read_unlock();
+out_no_inode:
+	return NULL;
+}
+
 /*
  * A big issue when freeing the inode cluster is that we _cannot_ skip any
  * inodes that are in memory - they all must be marked stale and attached to
@@ -2616,77 +2698,11 @@ xfs_ifree_cluster(
 		 * even trying to lock them.
 		 */
 		for (i = 0; i < igeo->inodes_per_cluster; i++) {
-retry:
-			rcu_read_lock();
-			ip = radix_tree_lookup(&pag->pag_ici_root,
-					XFS_INO_TO_AGINO(mp, (inum + i)));
-
-			/* Inode not in memory, nothing to do */
-			if (!ip) {
-				rcu_read_unlock();
+			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
+			if (!ip)
 				continue;
-			}
-
-			/*
-			 * because this is an RCU protected lookup, we could
-			 * find a recently freed or even reallocated inode
-			 * during the lookup. We need to check under the
-			 * i_flags_lock for a valid inode here. Skip it if it
-			 * is not valid, the wrong inode or stale.
-			 */
-			spin_lock(&ip->i_flags_lock);
-			if (ip->i_ino != inum + i ||
-			    __xfs_iflags_test(ip, XFS_ISTALE)) {
-				spin_unlock(&ip->i_flags_lock);
-				rcu_read_unlock();
-				continue;
-			}
-			spin_unlock(&ip->i_flags_lock);
-
-			/*
-			 * Don't try to lock/unlock the current inode, but we
-			 * _cannot_ skip the other inodes that we did not find
-			 * in the list attached to the buffer and are not
-			 * already marked stale. If we can't lock it, back off
-			 * and retry.
-			 */
-			if (ip != free_ip) {
-				if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
-					rcu_read_unlock();
-					delay(1);
-					goto retry;
-				}
-
-				/*
-				 * Check the inode number again in case we're
-				 * racing with freeing in xfs_reclaim_inode().
-				 * See the comments in that function for more
-				 * information as to why the initial check is
-				 * not sufficient.
-				 */
-				if (ip->i_ino != inum + i) {
-					xfs_iunlock(ip, XFS_ILOCK_EXCL);
-					rcu_read_unlock();
-					continue;
-				}
-			}
-			rcu_read_unlock();
 
-			xfs_iflock(ip);
-			xfs_iflags_set(ip, XFS_ISTALE);
-
-			/*
-			 * we don't need to attach clean inodes or those only
-			 * with unlogged changes (which we throw away, anyway).
-			 */
 			iip = ip->i_itemp;
-			if (!iip || xfs_inode_clean(ip)) {
-				ASSERT(ip != free_ip);
-				xfs_ifunlock(ip);
-				xfs_iunlock(ip, XFS_ILOCK_EXCL);
-				continue;
-			}
-
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 09/28] mm: directed shrinker work deferral
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (7 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
@ 2019-10-31 23:45 ` Dave Chinner
  2019-11-04 15:25   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:45 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Introduce a mechanism for ->count_objects() to indicate to the
shrinker infrastructure that the reclaim context will not allow
scanning work to be done and so the work it decides is necessary
needs to be deferred.

This simplifies the code by separating out the accounting of
deferred work from the actual doing of the work, and allows better
decisions to be made by the shrinekr control logic on what action it
can take.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h | 7 +++++++
 mm/vmscan.c              | 8 ++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..3405c39ab92c 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -31,6 +31,13 @@ struct shrink_control {
 
 	/* current memcg being shrunk (for memcg aware shrinkers) */
 	struct mem_cgroup *memcg;
+
+	/*
+	 * set by ->count_objects if reclaim context prevents reclaim from
+	 * occurring. This allows the shrinker to immediately defer all the
+	 * work and not even attempt to scan the cache.
+	 */
+	bool defer_work;
 };
 
 #define SHRINK_STOP (~0UL)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee4eecc7e1c2..a215d71d9d4b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
 
+	/*
+	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+	 * defer the work to a context that can scan the cache.
+	 */
+	if (shrinkctl->defer_work)
+		goto done;
+
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
 	 * pass to avoid too frequent shrinker calls, but if the slab has less
@@ -570,6 +577,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
+done:
 	if (next_deferred >= scanned)
 		next_deferred -= scanned;
 	else
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (8 preceding siblings ...)
  2019-10-31 23:45 ` [PATCH 09/28] mm: directed shrinker work deferral Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

For shrinkers that currently avoid scanning when called under
GFP_NOFS contexts, convert them to use the new ->defer_work flag
rather than checking and returning errors during scans.

This makes it very clear that these shrinkers are not doing any work
because of the context limitations, not because there is no work
that can be done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/staging/android/ashmem.c |  8 ++++----
 fs/gfs2/glock.c                  |  5 +++--
 fs/gfs2/quota.c                  |  6 +++---
 fs/nfs/dir.c                     |  6 +++---
 fs/super.c                       |  6 +++---
 fs/xfs/xfs_qm.c                  | 11 ++++++++---
 net/sunrpc/auth.c                |  5 ++---
 7 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 74d497d39c5a..0b80149f0ac5 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -438,10 +438,6 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	unsigned long freed = 0;
 
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!mutex_trylock(&ashmem_mutex))
 		return -1;
 
@@ -478,6 +474,10 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 static unsigned long
 ashmem_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
+	/* We might recurse into filesystem code, so bail out if necessary */
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->defer_work = true;
+
 	/*
 	 * note that lru_count is count of pages on the lru, not a count of
 	 * objects on the list. This means the scan function needs to return the
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 0290a22ebccf..a25161b93f96 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1614,14 +1614,15 @@ static long gfs2_scan_glock_lru(int nr)
 static unsigned long gfs2_glock_shrink_scan(struct shrinker *shrink,
 					    struct shrink_control *sc)
 {
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
 	return gfs2_scan_glock_lru(sc->nr_to_scan);
 }
 
 static unsigned long gfs2_glock_shrink_count(struct shrinker *shrink,
 					     struct shrink_control *sc)
 {
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->defer_work = true;
+
 	return vfs_pressure_ratio(atomic_read(&lru_count));
 }
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 7c016a082aa6..661189b42c31 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -166,9 +166,6 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	LIST_HEAD(dispose);
 	unsigned long freed;
 
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
 				     gfs2_qd_isolate, &dispose);
 
@@ -180,6 +177,9 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->defer_work = true;
+
 	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e180033e35cf..fd4a70479790 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2211,10 +2211,7 @@ unsigned long
 nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	int nr_to_scan = sc->nr_to_scan;
-	gfp_t gfp_mask = sc->gfp_mask;
 
-	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return SHRINK_STOP;
 	return nfs_do_access_cache_scan(nr_to_scan);
 }
 
@@ -2222,6 +2219,9 @@ nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 unsigned long
 nfs_access_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 {
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		sc->defer_work = true;
+
 	return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
 }
 
diff --git a/fs/super.c b/fs/super.c
index cfadab2cbf35..6dcab2a92454 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -74,9 +74,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
 	 * to recurse into the FS that called us in clear_inode() and friends..
 	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!trylock_super(sb))
 		return SHRINK_STOP;
 
@@ -141,6 +138,9 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 	smp_rmb();
 
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->defer_work = true;
+
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index ecd8ce152ab1..aa03f2448145 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -502,9 +502,6 @@ xfs_qm_shrink_scan(
 	unsigned long		freed;
 	int			error;
 
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
-		return 0;
-
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
@@ -534,6 +531,14 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
+	/*
+	 * __GFP_DIRECT_RECLAIM is used here to avoid blocking kswapd
+	 */
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) !=
+					(__GFP_FS|__GFP_DIRECT_RECLAIM)) {
+		sc->defer_work = true;
+	}
+
 	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index cdb05b48de44..7d11a7034fee 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -527,9 +527,6 @@ static unsigned long
 rpcauth_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 
 {
-	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-		return SHRINK_STOP;
-
 	/* nothing left, don't come back */
 	if (list_empty(&cred_unused))
 		return SHRINK_STOP;
@@ -541,6 +538,8 @@ static unsigned long
 rpcauth_cache_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 
 {
+	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		sc->defer_work = true;
 	return number_cred_unused * sysctl_vfs_cache_pressure / 100;
 }
 
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 11/28] mm: factor shrinker work calculations
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (9 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-02 10:55   ` kbuild test robot
  2019-11-04 15:29   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 12/28] shrinker: defer work only to kswapd Dave Chinner
                   ` (16 subsequent siblings)
  27 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Start to clean up the shrinker code by factoring out the calculation
that determines how much work to do. This separates the calculation
from clamping and other adjustments that are done before the
shrinker work is run. Document the scan batch size calculation
better while we are there.

Also convert the calculation for the amount of work to be done to
use 64 bit logic so we don't have to keep jumping through hoops to
keep calculations within 32 bits on 32 bit systems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 97 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 70 insertions(+), 27 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a215d71d9d4b..2d39ec37c04d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -459,13 +459,68 @@ EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
 
+/*
+ * Calculate the number of new objects to scan this time around. Return
+ * the work to be done. If there are freeable objects, return that number in
+ * @freeable_objects.
+ */
+static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
+			    struct shrinker *shrinker, int priority,
+			    int64_t *freeable_objects)
+{
+	int64_t delta;
+	int64_t freeable;
+
+	freeable = shrinker->count_objects(shrinker, shrinkctl);
+	if (freeable == 0 || freeable == SHRINK_EMPTY)
+		return freeable;
+
+	if (shrinker->seeks) {
+		/*
+		 * shrinker->seeks is a measure of how much IO is required to
+		 * reinstantiate the object in memory. The default value is 2
+		 * which is typical for a cold inode requiring a directory read
+		 * and an inode read to re-instantiate.
+		 *
+		 * The scan batch size is defined by the shrinker priority, but
+		 * to be able to bias the reclaim we increase the default batch
+		 * size by 4. Hence we end up with a scan batch multipler that
+		 * scales like so:
+		 *
+		 * ->seeks	scan batch multiplier
+		 *    1		      4.00x
+		 *    2               2.00x
+		 *    3               1.33x
+		 *    4               1.00x
+		 *    8               0.50x
+		 *
+		 * IOWs, the more seeks it takes to pull the item into cache,
+		 * the smaller the reclaim scan batch. Hence we put more reclaim
+		 * pressure on caches that are fast to repopulate and to keep a
+		 * rough balance between caches that have different costs.
+		 */
+		delta = freeable >> (priority - 2);
+		do_div(delta, shrinker->seeks);
+	} else {
+		/*
+		 * These objects don't require any IO to create. Trim them
+		 * aggressively under memory pressure to keep them from causing
+		 * refetches in the IO caches.
+		 */
+		delta = freeable / 2;
+	}
+
+	*freeable_objects = freeable;
+	return delta > 0 ? delta : 0;
+}
+
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker, int priority)
 {
 	unsigned long freed = 0;
-	unsigned long long delta;
 	long total_scan;
-	long freeable;
+	int64_t freeable_objects = 0;
+	int64_t scan_count;
 	long nr;
 	long new_nr;
 	int nid = shrinkctl->nid;
@@ -476,9 +531,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 		nid = 0;
 
-	freeable = shrinker->count_objects(shrinker, shrinkctl);
-	if (freeable == 0 || freeable == SHRINK_EMPTY)
-		return freeable;
+	scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
+					&freeable_objects);
+	if (scan_count == 0 || scan_count == SHRINK_EMPTY)
+		return scan_count;
 
 	/*
 	 * copy the current shrinker scan count into a local variable
@@ -487,25 +543,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 */
 	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
 
-	total_scan = nr;
-	if (shrinker->seeks) {
-		delta = freeable >> priority;
-		delta *= 4;
-		do_div(delta, shrinker->seeks);
-	} else {
-		/*
-		 * These objects don't require any IO to create. Trim
-		 * them aggressively under memory pressure to keep
-		 * them from causing refetches in the IO caches.
-		 */
-		delta = freeable / 2;
-	}
-
-	total_scan += delta;
+	total_scan = nr + scan_count;
 	if (total_scan < 0) {
 		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
 		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
+		total_scan = scan_count;
 		next_deferred = nr;
 	} else
 		next_deferred = total_scan;
@@ -522,19 +564,20 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * Hence only allow the shrinker to scan the entire cache when
 	 * a large delta change is calculated directly.
 	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
+	if (scan_count < freeable_objects / 4)
+		total_scan = min_t(long, total_scan, freeable_objects / 2);
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	if (total_scan > freeable_objects * 2)
+		total_scan = freeable_objects * 2;
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
-				   freeable, delta, total_scan, priority);
+				   freeable_objects, scan_count,
+				   total_scan, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -559,7 +602,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * possible.
 	 */
 	while (total_scan >= batch_size ||
-	       total_scan >= freeable) {
+	       total_scan >= freeable_objects) {
 		unsigned long ret;
 		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 12/28] shrinker: defer work only to kswapd
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (10 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-04 15:29   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 13/28] shrinker: clean up variable types and tracepoints Dave Chinner
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Right now deferred work is picked up by whatever GFP_KERNEL context
reclaimer that wins the race to empty the node's deferred work
counter. However, if there are lots of direct reclaimers, that
work might be continually picked up by contexts taht can't do any
work and so the opportunities to do the work are missed by contexts
that could do them.

A further problem with the current code is that the deferred work
can be picked up by a random direct reclaimer, resulting in that
specific process having to do all the deferred reclaim work and
hence can take extremely long latencies if the reclaim work blocks
regularly. This is not good for direct reclaim fairness or for
minimising long tail latency events.

To avoid these problems, simply limit deferred work to kswapd
contexts. We know kswapd is a context that can always do reclaim
work, and hence deferring work to kswapd allows the deferred work to
be done in the background and not adversely affect any specific
process context doing direct reclaim.

The advantage of this is that amount of work to be done in direct
reclaim is now bound and predictable - it is entirely based on
the cache's freeable objects and the reclaim priority. hence all
direct reclaimers running at the same time should be doing
relatively equal amounts of work, thereby reducing the incidence of
long tail latencies due to uneven reclaim workloads.

Note that we use signed integers for everything except the freed
count as the returns from the shrinker callouts cannot be guaranteed
untainted. Indeed, the shrinkers can return scan counts larger that
were fed in, so we need scan counts to underflow in a detectable
manner to terminate loops. This is necessary to avoid a misbehaving
shrinker from triggering endless scanning loops.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h |   2 +-
 mm/vmscan.c              | 100 ++++++++++++++++++++-------------------
 2 files changed, 53 insertions(+), 49 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 3405c39ab92c..30c10f42109f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -81,7 +81,7 @@ struct shrinker {
 	int id;
 #endif
 	/* objs pending delete, per node */
-	atomic_long_t *nr_deferred;
+	atomic64_t *nr_deferred;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d39ec37c04d..c0e2bf656e3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -517,16 +517,16 @@ static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker, int priority)
 {
-	unsigned long freed = 0;
-	long total_scan;
+	uint64_t freed = 0;
 	int64_t freeable_objects = 0;
 	int64_t scan_count;
-	long nr;
-	long new_nr;
+	int64_t scanned_objects = 0;
+	int64_t next_deferred = 0;
+	int64_t deferred_count = 0;
+	int64_t new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
-	long scanned = 0, next_deferred;
 
 	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 		nid = 0;
@@ -537,47 +537,51 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		return scan_count;
 
 	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
-	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
-
-	total_scan = nr + scan_count;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = scan_count;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
-
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
+	 * If kswapd, we take all the deferred work and do it here. We don't let
+	 * direct reclaim do this, because then it means some poor sod is going
+	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
+	 * amount of reclaim work from concurrent kswapd operations. Hence we do
+	 * the work in the wrong place, at the wrong time, and it's largely
+	 * unpredictable.
 	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
+	 * By doing the deferred work only in kswapd, we can schedule the work
+	 * according the the reclaim priority - low priority reclaim will do
+	 * less deferred work, hence we'll do more of the deferred work the more
+	 * desperate we become for free memory. This avoids the need for needing
+	 * to specifically avoid deferred work windup as low amount os memory
+	 * pressure won't excessive trim caches anymore.
 	 */
-	if (scan_count < freeable_objects / 4)
-		total_scan = min_t(long, total_scan, freeable_objects / 2);
+	if (current_is_kswapd()) {
+		int64_t	deferred_scan;
+
+		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
+
+		/* we want to scan 5-10% of the deferred work here at minimum */
+		deferred_scan = deferred_count;
+		if (priority)
+			do_div(deferred_scan, priority);
+		scan_count += deferred_scan;
+
+		/*
+		 * If there is more deferred work than the number of freeable
+		 * items in the cache, limit the amount of work we will carry
+		 * over to the next kswapd run on this cache. This prevents
+		 * deferred work windup.
+		 */
+		deferred_count = min(deferred_count, freeable_objects * 2);
+
+	}
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable_objects * 2)
-		total_scan = freeable_objects * 2;
+	scan_count = min(scan_count, freeable_objects * 2);
 
-	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
 				   freeable_objects, scan_count,
-				   total_scan, priority);
+				   scan_count, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * scanning at high prio and therefore should try to reclaim as much as
 	 * possible.
 	 */
-	while (total_scan >= batch_size ||
-	       total_scan >= freeable_objects) {
+	while (scan_count >= batch_size ||
+	       scan_count >= freeable_objects) {
 		unsigned long ret;
-		unsigned long nr_to_scan = min(batch_size, total_scan);
+		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
 
 		shrinkctl->nr_to_scan = nr_to_scan;
 		shrinkctl->nr_scanned = nr_to_scan;
@@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		freed += ret;
 
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
-		total_scan -= shrinkctl->nr_scanned;
-		scanned += shrinkctl->nr_scanned;
+		scan_count -= shrinkctl->nr_scanned;
+		scanned_objects += shrinkctl->nr_scanned;
 
 		cond_resched();
 	}
-
 done:
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
+	if (deferred_count)
+		next_deferred = deferred_count - scanned_objects;
 	else
-		next_deferred = 0;
+		next_deferred = scan_count;
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
 	 * scan, there is no need to do an update.
 	 */
 	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
+		new_nr = atomic64_add_return(next_deferred,
 						&shrinker->nr_deferred[nid]);
 	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
+					scan_count);
 	return freed;
 }
 
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 13/28] shrinker: clean up variable types and tracepoints
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (11 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 12/28] shrinker: defer work only to kswapd Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-04 15:30   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The tracepoint information in the shrinker code don't make a lot of
sense anymore and contain redundant information as a result of the
changes in the patchset. Refine the information passed to the
tracepoints so they expose the operation of the shrinkers more
precisely and clean up the remaining code and varibles in the
shrinker code so it all makes sense.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/trace/events/vmscan.h | 69 ++++++++++++++++-------------------
 mm/vmscan.c                   | 24 +++++-------
 2 files changed, 41 insertions(+), 52 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a5ab2973e8dc..110637d9efa5 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -184,84 +184,77 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 
 TRACE_EVENT(mm_shrink_slab_start,
 	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
-		long nr_objects_to_shrink, unsigned long cache_items,
-		unsigned long long delta, unsigned long total_scan,
-		int priority),
+		int64_t deferred_count, int64_t freeable_objects,
+		int64_t scan_count, int priority),
 
-	TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
+	TP_ARGS(shr, sc, deferred_count, freeable_objects, scan_count,
 		priority),
 
 	TP_STRUCT__entry(
 		__field(struct shrinker *, shr)
 		__field(void *, shrink)
 		__field(int, nid)
-		__field(long, nr_objects_to_shrink)
-		__field(gfp_t, gfp_flags)
-		__field(unsigned long, cache_items)
-		__field(unsigned long long, delta)
-		__field(unsigned long, total_scan)
+		__field(int64_t, deferred_count)
+		__field(int64_t, freeable_objects)
+		__field(int64_t, scan_count)
 		__field(int, priority)
+		__field(gfp_t, gfp_flags)
 	),
 
 	TP_fast_assign(
 		__entry->shr = shr;
 		__entry->shrink = shr->scan_objects;
 		__entry->nid = sc->nid;
-		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
-		__entry->gfp_flags = sc->gfp_mask;
-		__entry->cache_items = cache_items;
-		__entry->delta = delta;
-		__entry->total_scan = total_scan;
+		__entry->deferred_count = deferred_count;
+		__entry->freeable_objects = freeable_objects;
+		__entry->scan_count = scan_count;
 		__entry->priority = priority;
+		__entry->gfp_flags = sc->gfp_mask;
 	),
 
-	TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+	TP_printk("%pS %p: nid: %d scan count %lld freeable items %lld deferred count %lld priority %d gfp_flags %s",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nid,
-		__entry->nr_objects_to_shrink,
-		show_gfp_flags(__entry->gfp_flags),
-		__entry->cache_items,
-		__entry->delta,
-		__entry->total_scan,
-		__entry->priority)
+		__entry->scan_count,
+		__entry->freeable_objects,
+		__entry->deferred_count,
+		__entry->priority,
+		show_gfp_flags(__entry->gfp_flags))
 );
 
 TRACE_EVENT(mm_shrink_slab_end,
-	TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
-		long unused_scan_cnt, long new_scan_cnt, long total_scan),
+	TP_PROTO(struct shrinker *shr, int nid, int64_t freed_objects,
+		int64_t scanned_objects, int64_t deferred_scan),
 
-	TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
-		total_scan),
+	TP_ARGS(shr, nid, freed_objects, scanned_objects,
+		deferred_scan),
 
 	TP_STRUCT__entry(
 		__field(struct shrinker *, shr)
 		__field(int, nid)
 		__field(void *, shrink)
-		__field(long, unused_scan)
-		__field(long, new_scan)
-		__field(int, retval)
-		__field(long, total_scan)
+		__field(long long, freed_objects)
+		__field(long long, scanned_objects)
+		__field(long long, deferred_scan)
 	),
 
 	TP_fast_assign(
 		__entry->shr = shr;
 		__entry->nid = nid;
 		__entry->shrink = shr->scan_objects;
-		__entry->unused_scan = unused_scan_cnt;
-		__entry->new_scan = new_scan_cnt;
-		__entry->retval = shrinker_retval;
-		__entry->total_scan = total_scan;
+		__entry->freed_objects = freed_objects;
+		__entry->scanned_objects = scanned_objects;
+		__entry->deferred_scan = deferred_scan;
 	),
 
-	TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+	TP_printk("%pS %p: nid: %d freed objects %lld scanned objects %lld, deferred scan %lld",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nid,
-		__entry->unused_scan,
-		__entry->new_scan,
-		__entry->total_scan,
-		__entry->retval)
+		__entry->freed_objects,
+		__entry->scanned_objects,
+		__entry->deferred_scan)
 );
 
 TRACE_EVENT(mm_vmscan_lru_isolate,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c0e2bf656e3f..7a8256322150 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -523,7 +523,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	int64_t scanned_objects = 0;
 	int64_t next_deferred = 0;
 	int64_t deferred_count = 0;
-	int64_t new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
@@ -580,8 +579,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	scan_count = min(scan_count, freeable_objects * 2);
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
-				   freeable_objects, scan_count,
-				   scan_count, priority);
+				   freeable_objects, scan_count, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -624,23 +622,21 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 done:
+	/*
+	 * Calculate the remaining work that we need to defer to kswapd, and
+	 * store it in a manner that handles concurrent updates. If we exhausted
+	 * the scan, there is no need to do an update.
+	 */
 	if (deferred_count)
 		next_deferred = deferred_count - scanned_objects;
 	else
 		next_deferred = scan_count;
-	/*
-	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates. If we exhausted the
-	 * scan, there is no need to do an update.
-	 */
+
 	if (next_deferred > 0)
-		new_nr = atomic64_add_return(next_deferred,
-						&shrinker->nr_deferred[nid]);
-	else
-		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);
+		atomic64_add(next_deferred, &shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
-					scan_count);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, scanned_objects,
+				 next_deferred);
 	return freed;
 }
 
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (12 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 13/28] shrinker: clean up variable types and tracepoints Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-04 19:58   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
                   ` (13 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Add a wrapper to account for page freeing in shrinker reclaim so
that the high level scanning accounts for all the memory freed
during a shrinker scan.

No logic changes, just replacing open coded checks with a simple
wrapper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c           |  3 +--
 fs/xfs/xfs_buf.c     |  4 +---
 include/linux/swap.h | 20 ++++++++++++++++++--
 mm/slab.c            |  3 +--
 mm/slob.c            |  4 +---
 mm/slub.c            |  3 +--
 mm/vmscan.c          |  4 ++--
 7 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fef457a42882..a77caf216659 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -764,8 +764,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 				__count_vm_events(KSWAPD_INODESTEAL, reap);
 			else
 				__count_vm_events(PGINODESTEAL, reap);
-			if (current->reclaim_state)
-				current->reclaim_state->reclaimed_slab += reap;
+			current_reclaim_account_pages(reap);
 		}
 		iput(inode);
 		spin_lock(lru_lock);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index d34e5d2edacd..55b082bc53b3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -324,9 +324,7 @@ xfs_buf_free(
 
 			__free_page(page);
 		}
-		if (current->reclaim_state)
-			current->reclaim_state->reclaimed_slab +=
-							bp->b_page_count;
+		current_reclaim_account_pages(bp->b_page_count);
 	} else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 	_xfs_buf_free_pages(bp);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 063c0c1e112b..72b855fe20b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -126,12 +126,28 @@ union swap_header {
 
 /*
  * current->reclaim_state points to one of these when a task is running
- * memory reclaim
+ * memory reclaim. It is typically used by shrinkers to return reclaim
+ * information back to the main vmscan loop.
  */
 struct reclaim_state {
-	unsigned long reclaimed_slab;
+	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
 };
 
+/*
+ * When code frees a page that may be run from a memory reclaim context, it
+ * needs to account for the pages it frees so memory reclaim can track them.
+ * Slab memory that is freed is accounted via this mechanism, so this is not
+ * necessary for slab or heap memory being freed. However, if the object being
+ * freed frees pages directly, then those pages should be accounted as well when
+ * in memory reclaim. This helper function takes care accounting for the pages
+ * being reclaimed when it is required.
+ */
+static inline void current_reclaim_account_pages(int nr_pages)
+{
+	if (current->reclaim_state)
+		current->reclaim_state->reclaimed_pages += nr_pages;
+}
+
 #ifdef __KERNEL__
 
 struct address_space;
diff --git a/mm/slab.c b/mm/slab.c
index 66e5d8032bae..419be005f41a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1395,8 +1395,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	page_mapcount_reset(page);
 	page->mapping = NULL;
 
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
+	current_reclaim_account_pages(1 << order);
 	uncharge_slab_page(page, order, cachep);
 	__free_pages(page, order);
 }
diff --git a/mm/slob.c b/mm/slob.c
index fa53e9f73893..c54a7eeee86d 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -211,9 +211,7 @@ static void slob_free_pages(void *b, int order)
 {
 	struct page *sp = virt_to_page(b);
 
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
-
+	current_reclaim_account_pages(1 << order);
 	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
 			    -(1 << order));
 	__free_pages(sp, order);
diff --git a/mm/slub.c b/mm/slub.c
index b25c807a111f..478554082079 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1746,8 +1746,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 	__ClearPageSlab(page);
 
 	page->mapping = NULL;
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += pages;
+	current_reclaim_account_pages(pages);
 	uncharge_slab_page(page, order, s);
 	__free_pages(page, order);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7a8256322150..967e3d3c7748 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2870,8 +2870,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
 
 		if (reclaim_state) {
-			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
+			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
+			reclaim_state->reclaimed_pages = 0;
 		}
 
 		/* Record the subtree's reclaim efficiency */
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (13 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-04 19:58   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 16/28] mm: kswapd backoff for shrinkers Dave Chinner
                   ` (12 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

When the majority of possible shrinker reclaim work is deferred by
the shrinkers (e.g. due to GFP_NOFS context), and there is more work
defered than LRU pages were scanned, back off reclaim if there are
large amounts of IO in progress.

This tends to occur when there are inode cache heavy workloads that
have little page cache or application memory pressure on filesytems
like XFS. Inode cache heavy workloads involve lots of IO, so if we
are getting device congestion it is indicative of memory reclaim
running up against an IO throughput limitation. in this situation
we need to throttle direct reclaim as we nee dto wait for kswapd to
get some of the deferred work done.

However, if there is no device congestion, then the system is
keeping up with both the workload and memory reclaim and so there's
no need to throttle.

Hence we should only back off scanning for a bit if we see this
condition and there is block device congestion present.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/swap.h |  2 ++
 mm/vmscan.c          | 30 +++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 72b855fe20b0..da0913e14bb9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -131,6 +131,8 @@ union swap_header {
  */
 struct reclaim_state {
 	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
+	unsigned long	scanned_objects;	/* quantity of work done */ 
+	unsigned long	deferred_objects;	/* work that wasn't done */
 };
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 967e3d3c7748..13c11e10c9c5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -570,6 +570,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		deferred_count = min(deferred_count, freeable_objects * 2);
 
 	}
+	if (current->reclaim_state)
+		current->reclaim_state->scanned_objects += scanned_objects;
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
@@ -585,8 +587,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
 	 * defer the work to a context that can scan the cache.
 	 */
-	if (shrinkctl->defer_work)
+	if (shrinkctl->defer_work) {
+		if (current->reclaim_state)
+			current->reclaim_state->deferred_objects += scan_count;
 		goto done;
+	}
 
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
@@ -2871,7 +2876,30 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
+
+			/*
+			 * If we are deferring more work than we are actually
+			 * doing in the shrinkers, and we are scanning more
+			 * objects than we are pages, the we have a large amount
+			 * of slab caches we are deferring work to kswapd for.
+			 * We better back off here for a while, otherwise
+			 * we risk priority windup, swap storms and OOM kills
+			 * once we empty the page lists but still can't make
+			 * progress on the shrinker memory.
+			 *
+			 * kswapd won't ever defer work as it's run under a
+			 * GFP_KERNEL context and can always do work.
+			 */
+			if ((reclaim_state->deferred_objects >
+					sc->nr_scanned - nr_scanned) &&
+			    (reclaim_state->deferred_objects >
+					reclaim_state->scanned_objects)) {
+				wait_iff_congested(BLK_RW_ASYNC, HZ/50);
+			}
+
 			reclaim_state->reclaimed_pages = 0;
+			reclaim_state->deferred_objects = 0;
+			reclaim_state->scanned_objects = 0;
 		}
 
 		/* Record the subtree's reclaim efficiency */
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 16/28] mm: kswapd backoff for shrinkers
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (14 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-04 19:58   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 17/28] xfs: synchronous AIL pushing Dave Chinner
                   ` (11 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

When kswapd reaches the end of the page LRU and starts hitting dirty
pages, the logic in shrink_node() allows it to back off and wait for
IO to complete, thereby preventing kswapd from scanning excessively
and driving the system into swap thrashing and OOM conditions.

When we have inode cache heavy workloads on XFS, we have exactly the
same problem with reclaim inodes. The non-blocking kswapd reclaim
will keep putting pressure onto the inode cache which is unable to
make progress. When the system gets to the point where there is no
pages in the LRU to free, there is no swap left and there are no
clean inodes that can be freed, it will OOM. This has a specific
signature in OOM:

[  110.841987] Mem-Info:
[  110.842816] active_anon:241 inactive_anon:82 isolated_anon:1
                active_file:168 inactive_file:143 isolated_file:0
                unevictable:2621523 dirty:1 writeback:8 unstable:0
                slab_reclaimable:564445 slab_unreclaimable:420046
                mapped:1042 shmem:11 pagetables:6509 bounce:0
                free:77626 free_pcp:2 free_cma:0

In this case, we have about 500-600 pages left in teh LRUs, but we
have ~565000 reclaimable slab pages still available for reclaim.
Unfortunately, they are mostly dirty inodes, and so we really need
to be able to throttle kswapd when shrinker progress is limited due
to reaching the dirty end of the LRU...

So, add a flag into the reclaim_state so if the shrinker decides it
needs kswapd to back off and wait for a while (for whatever reason)
it can do so.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/swap.h |  1 +
 mm/vmscan.c          | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index da0913e14bb9..76fc28f0e483 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -133,6 +133,7 @@ struct reclaim_state {
 	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
 	unsigned long	scanned_objects;	/* quantity of work done */ 
 	unsigned long	deferred_objects;	/* work that wasn't done */
+	bool		need_backoff;		/* tell kswapd to slow down */
 };
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13c11e10c9c5..0f7d35820057 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,8 +2949,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			 * implies that pages are cycling through the LRU
 			 * faster than they are written so also forcibly stall.
 			 */
-			if (sc->nr.immediate)
+			if (sc->nr.immediate) {
 				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			} else if (reclaim_state && reclaim_state->need_backoff) {
+				/*
+				 * Ditto, but it's a slab cache that is cycling
+				 * through the LRU faster than they are written
+				 */
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				reclaim_state->need_backoff = false;
+			}
 		}
 
 		/*
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 17/28] xfs: synchronous AIL pushing
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (15 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 16/28] mm: kswapd backoff for shrinkers Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-05 17:05   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 18/28] xfs: don't block kswapd in inode reclaim Dave Chinner
                   ` (10 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Provide an interface to push the AIL to a target LSN and wait for
the tail of the log to move past that LSN. This is used to wait for
all items older than a specific LSN to either be cleaned (written
back) or relogged to a higher LSN in the AIL. The primary use for
this is to allow IO free inode reclaim throttling.

Factor the common AIL deletion code that does all the wakeups into a
helper so we only have one copy of this somewhat tricky code to
interface with all the wakeups necessary when the LSN of the log
tail changes.

xfs_ail_push_sync() is temporary infrastructure to facilitate
non-blocking, IO-less inode reclaim throttling that allows further
structural changes to be made. Once those structural changes are
made, the need for this function goes away and it is removed. In
essence, it is only provided to ensure git bisects don't break while
the changes to the reclaim algorithms are in progress.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_ail.c  | 32 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_priv.h |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 685a21cd24c0..3e1d0e1439e2 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -662,6 +662,36 @@ xfs_ail_push_all(
 		xfs_ail_push(ailp, threshold_lsn);
 }
 
+/*
+ * Push the AIL to a specific lsn and wait for it to complete.
+ */
+void
+xfs_ail_push_sync(
+	struct xfs_ail		*ailp,
+	xfs_lsn_t		threshold_lsn)
+{
+	struct xfs_log_item	*lip;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&ailp->ail_lock);
+	while ((lip = xfs_ail_min(ailp)) != NULL) {
+		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
+		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
+		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
+			break;
+		if (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
+			ailp->ail_target = threshold_lsn;
+		wake_up_process(ailp->ail_task);
+		spin_unlock(&ailp->ail_lock);
+		schedule();
+		spin_lock(&ailp->ail_lock);
+	}
+	spin_unlock(&ailp->ail_lock);
+
+	finish_wait(&ailp->ail_push, &wait);
+}
+
+
 /*
  * Push out all items in the AIL immediately and wait until the AIL is empty.
  */
@@ -702,6 +732,7 @@ xfs_ail_update_finish(
 	if (!XFS_FORCED_SHUTDOWN(mp))
 		xlog_assign_tail_lsn_locked(mp);
 
+	wake_up_all(&ailp->ail_push);
 	if (list_empty(&ailp->ail_head))
 		wake_up_all(&ailp->ail_empty);
 	spin_unlock(&ailp->ail_lock);
@@ -858,6 +889,7 @@ xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
+	init_waitqueue_head(&ailp->ail_push);
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_fsname);
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 35655eac01a6..1b6f4bbd47c0 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -61,6 +61,7 @@ struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
+	wait_queue_head_t	ail_push;
 };
 
 /*
@@ -113,6 +114,7 @@ xfs_trans_ail_remove(
 }
 
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
+void			xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
 void			xfs_ail_push_all(struct xfs_ail *);
 void			xfs_ail_push_all_sync(struct xfs_ail *);
 struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 18/28] xfs: don't block kswapd in inode reclaim
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (16 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 17/28] xfs: synchronous AIL pushing Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-10-31 23:46 ` [PATCH 19/28] xfs: reduce kswapd blocking on inode locking Dave Chinner
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

We have a number of reasons for blocking kswapd in XFS inode
reclaim, mainly all to do with the fact that memory reclaim has no
feedback mechanisms to throttle on dirty slab objects that need IO
to reclaim.

As a result, we currently throttle inode reclaim by issuing IO in
the reclaim context. The unfortunate side effect of this is that it
can cause long tail latencies in reclaim and for some workloads this
can be a problem.

Now that the shrinkers finally have a method of telling kswapd to
back off, we can start the process of making inode reclaim in XFS
non-blocking. The first thing we need to do is not block kswapd, but
so that doesn't cause immediate serious problems, make sure inode
writeback is always underway when kswapd is running.

As we don't block kswapd now, we don't have to worry about reclaim
scans taking long delays due to IO being issued and waited for.
Hence while direct reclaim gets delayed by IO, kswapd will not and
so it will keep pushing the AIL to clean inodes. Hence direct
reclaim doesn't need to push the AIL anymore as kswapd will do it
reliably now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 944add5ff8e0..edcc3f6bb3bf 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
-	/* kick background reclaimer and push the AIL */
+	int			sync_mode = SYNC_TRYLOCK;
+
+	/* kick background reclaimer */
 	xfs_reclaim_work_queue(mp);
-	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	/*
+	 * For kswapd, we kick background inode writeback. For direct
+	 * reclaim, we issue and wait on inode writeback to throttle
+	 * reclaim rates and avoid shouty OOM-death.
+	 */
+	if (current_is_kswapd())
+		xfs_ail_push_all(mp->m_ail);
+	else
+		sync_mode |= SYNC_WAIT;
+
+	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
 }
 
 /*
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 19/28] xfs: reduce kswapd blocking on inode locking.
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (17 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 18/28] xfs: don't block kswapd in inode reclaim Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-05 17:05   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 20/28] xfs: kill background reclaim work Dave Chinner
                   ` (8 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

When doing async node reclaiming, we grab a batch of inodes that we
are likely able to reclaim and ignore those that are already
flushing. However, when we actually go to reclaim them, the first
thing we do is lock the inode. If we are racing with something
else reclaiming the inode or flushing it because it is dirty,
we block on the inode lock. Hence we can still block kswapd here.

Further, if we flush an inode, we also cluster all the other dirty
inodes in that cluster into the same IO, flush locking them all.
However, if the workload is operating on sequential inodes (e.g.
created by a tarball extraction) most of these inodes will be
sequntial in the cache and so in the same batch
we've already grabbed for reclaim scanning.

As a result, it is common for all the inodes in the batch to be
dirty and it is common for the first inode flushed to also flush all
the inodes in the reclaim batch. In which case, they are now all
going to be flush locked and we do not want to block on them.

Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use
trylock semantics and abort reclaim of an inode as quickly as we can
without blocking kswapd. This will be necessary for the upcoming
conversion to LRU lists for inode reclaim tracking.

Found via tracing and finding big batches of repeated lock/unlock
runs on inodes that we just flushed by write clustering during
reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_icache.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index edcc3f6bb3bf..189cf423fe8f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1104,11 +1104,23 @@ xfs_reclaim_inode(
 
 restart:
 	error = 0;
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
+	/*
+	 * Don't try to flush the inode if another inode in this cluster has
+	 * already flushed it after we did the initial checks in
+	 * xfs_reclaim_inode_grab().
+	 */
+	if (sync_mode & SYNC_TRYLOCK) {
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 			goto out;
-		xfs_iflock(ip);
+		if (!xfs_iflock_nowait(ip))
+			goto out_unlock;
+	} else {
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		if (!xfs_iflock_nowait(ip)) {
+			if (!(sync_mode & SYNC_WAIT))
+				goto out_unlock;
+			xfs_iflock(ip);
+		}
 	}
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
@@ -1215,9 +1227,10 @@ xfs_reclaim_inode(
 
 out_ifunlock:
 	xfs_ifunlock(ip);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	/*
 	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
 	 * a short while. However, this just burns CPU time scanning the tree
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 20/28] xfs: kill background reclaim work
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (18 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 19/28] xfs: reduce kswapd blocking on inode locking Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-05 17:05   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO Dave Chinner
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

This function is now entirely done by kswapd, so we don't need the
worker thread to do async reclaim anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_icache.c | 44 --------------------------------------------
 fs/xfs/xfs_icache.h |  2 --
 fs/xfs/xfs_mount.c  |  2 --
 fs/xfs/xfs_mount.h  |  2 --
 fs/xfs/xfs_super.c  | 11 +----------
 5 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 189cf423fe8f..7e175304e146 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -138,44 +138,6 @@ xfs_inode_free(
 	__xfs_inode_free(ip);
 }
 
-/*
- * Queue a new inode reclaim pass if there are reclaimable inodes and there
- * isn't a reclaim pass already in progress. By default it runs every 5s based
- * on the xfs periodic sync default of 30s. Perhaps this should have it's own
- * tunable, but that can be done if this method proves to be ineffective or too
- * aggressive.
- */
-static void
-xfs_reclaim_work_queue(
-	struct xfs_mount        *mp)
-{
-
-	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
-		queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work,
-			msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
-	}
-	rcu_read_unlock();
-}
-
-/*
- * This is a fast pass over the inode cache to try to get reclaim moving on as
- * many inodes as possible in a short period of time. It kicks itself every few
- * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
- */
-void
-xfs_reclaim_worker(
-	struct work_struct *work)
-{
-	struct xfs_mount *mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_reclaim_work);
-
-	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
-	xfs_reclaim_work_queue(mp);
-}
-
 static void
 xfs_perag_set_reclaim_tag(
 	struct xfs_perag	*pag)
@@ -192,9 +154,6 @@ xfs_perag_set_reclaim_tag(
 			   XFS_ICI_RECLAIM_TAG);
 	spin_unlock(&mp->m_perag_lock);
 
-	/* schedule periodic background inode reclaim */
-	xfs_reclaim_work_queue(mp);
-
 	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
 }
 
@@ -1393,9 +1352,6 @@ xfs_reclaim_inodes_nr(
 {
 	int			sync_mode = SYNC_TRYLOCK;
 
-	/* kick background reclaimer */
-	xfs_reclaim_work_queue(mp);
-
 	/*
 	 * For kswapd, we kick background inode writeback. For direct
 	 * reclaim, we issue and wait on inode writeback to throttle
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 48f1fd2bb6ad..4c0d8920cc54 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,8 +49,6 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
-void xfs_reclaim_worker(struct work_struct *work);
-
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3e8eedf01eb2..8f76c2add18b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -952,7 +952,6 @@ xfs_mountfs(
 	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
 	 * quota inodes.
 	 */
-	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp, SYNC_WAIT);
 	xfs_health_unmount(mp);
  out_log_dealloc:
@@ -1035,7 +1034,6 @@ xfs_unmountfs(
 	 * reclaim just to be sure. We can stop background inode reclaim
 	 * here as well if it is still running.
 	 */
-	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	xfs_reclaim_inodes(mp, SYNC_WAIT);
 	xfs_health_unmount(mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a46cb3fd24b1..8c6885d3b085 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -163,7 +163,6 @@ typedef struct xfs_mount {
 	uint			m_chsize;	/* size of next field */
 	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
-	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
 						     trimming */
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
@@ -180,7 +179,6 @@ typedef struct xfs_mount {
 	struct workqueue_struct *m_buf_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
 	struct workqueue_struct	*m_cil_workqueue;
-	struct workqueue_struct	*m_reclaim_workqueue;
 	struct workqueue_struct *m_eofblocks_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ebe2ccd36127..a4fe679207ef 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -794,15 +794,10 @@ xfs_init_mount_workqueues(
 	if (!mp->m_cil_workqueue)
 		goto out_destroy_unwritten;
 
-	mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s",
-			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
-	if (!mp->m_reclaim_workqueue)
-		goto out_destroy_cil;
-
 	mp->m_eofblocks_workqueue = alloc_workqueue("xfs-eofblocks/%s",
 			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
 	if (!mp->m_eofblocks_workqueue)
-		goto out_destroy_reclaim;
+		goto out_destroy_cil;
 
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s", WQ_FREEZABLE, 0,
 					       mp->m_fsname);
@@ -813,8 +808,6 @@ xfs_init_mount_workqueues(
 
 out_destroy_eofb:
 	destroy_workqueue(mp->m_eofblocks_workqueue);
-out_destroy_reclaim:
-	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_cil:
 	destroy_workqueue(mp->m_cil_workqueue);
 out_destroy_unwritten:
@@ -831,7 +824,6 @@ xfs_destroy_mount_workqueues(
 {
 	destroy_workqueue(mp->m_sync_workqueue);
 	destroy_workqueue(mp->m_eofblocks_workqueue);
-	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_cil_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
 	destroy_workqueue(mp->m_buf_workqueue);
@@ -1520,7 +1512,6 @@ xfs_mount_alloc(
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
 	atomic_set(&mp->m_active_trans, 0);
-	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
 	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (19 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 20/28] xfs: kill background reclaim work Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-05 17:06   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 22/28] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
                   ` (6 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim currently issues it's own inode IO when it comes
across dirty inodes. This is used to throttle direct reclaim down to
the rate at which we can reclaim dirty inodes. Failure to throttle
in this manner results in the OOM killer being trivial to trigger
even when there is lots of free memory available.

However, having direct reclaimers issue IO causes an amount of
IO thrashing to occur. We can have up to the number of AGs in the
filesystem concurrently issuing IO, plus the AIL pushing thread as
well. This means we can many competing sources of IO and they all
end up thrashing and competing for the request slots in the block
device.

Similar to dirty page throttling and the BDI flusher thread, we can
use the AIL pushing thread the sole place we issue inode writeback
from and everything else waits for it to make progress. To do this,
reclaim will skip over dirty inodes, but in doing so will record the
lowest LSN of all the dirty inodes it skips. It will then push the
AIL to this LSN and wait for it to complete that work.

In doing so, we block direct reclaim on the IO of at least one IO,
thereby providing some level of throttling for when we encounter
dirty inodes. However we gain the ability to scan and reclaim clean
inodes in a non-blocking fashion.

Hence direct reclaim will be throttled directly by the rate at which
dirty inodes are cleaned by AIL pushing, rather than by delays
caused by competing IO submissions. This allows us to reduce the
locking that limits direct reclaim concurrency to just protecting
the reclaim cursor state, hence greatly simplifying the inode
reclaim code as it now just skips dirty inodes.

Note: this patch by itself isn't completely able to throttle direct
reclaim sufficiently to prevent OOM killer madness. We can't do that
until we change the way we index reclaimable inodes in the next
patch and can feed back state to the mm core sanely.  However, we
can't change the way we index reclaimable inodes until we have
IO-less non-blocking reclaim for both direct reclaim and kswapd
reclaim.  Catch-22...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 218 ++++++++++++++++++--------------------------
 1 file changed, 89 insertions(+), 129 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7e175304e146..ff8ae32614a6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -22,6 +22,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_log.h"
 
 #include <linux/iversion.h>
 
@@ -967,28 +968,42 @@ xfs_inode_ag_iterator_tag(
 }
 
 /*
- * Grab the inode for reclaim exclusively.
- * Return 0 if we grabbed it, non-zero otherwise.
+ * Grab the inode for reclaim.
+ *
+ * Return false if we aren't going to reclaim it, true if it is a reclaim
+ * candidate.
+ *
+ * If the inode is clean or unreclaimable, return 0 to tell the caller it does
+ * not require flushing. Otherwise return the log item lsn of the inode so the
+ * caller can determine it's inode flush target.  If we get the clean/dirty
+ * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks
+ * held.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode_grab(
 	struct xfs_inode	*ip,
-	int			flags)
+	int			flags,
+	xfs_lsn_t		*lsn)
 {
 	ASSERT(rcu_read_lock_held());
+	*lsn = 0;
 
 	/* quick check for stale RCU freed inode */
 	if (!ip->i_ino)
-		return 1;
+		return false;
 
 	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
+	 * Do unlocked checks to see if the inode already is being flushed or in
+	 * reclaim to avoid lock traffic. If the inode is not clean, return the
+	 * position in the AIL for the caller to push to.
 	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return 1;
+	if (!xfs_inode_clean(ip)) {
+		*lsn = ip->i_itemp->ili_item.li_lsn;
+		return false;
+	}
+
+	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+		return false;
 
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
@@ -1005,11 +1020,11 @@ xfs_reclaim_inode_grab(
 	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
 		/* not a reclaim candidate. */
 		spin_unlock(&ip->i_flags_lock);
-		return 1;
+		return false;
 	}
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
-	return 0;
+	return true;
 }
 
 /*
@@ -1050,92 +1065,61 @@ xfs_reclaim_inode_grab(
  *	clean		=> reclaim
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
+ *
+ * Returns true if the inode was reclaimed, false otherwise.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
-	int			sync_mode)
+	xfs_lsn_t		*lsn)
 {
-	struct xfs_buf		*bp = NULL;
-	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
-	int			error;
+	xfs_ino_t		ino;
+
+	*lsn = 0;
 
-restart:
-	error = 0;
 	/*
 	 * Don't try to flush the inode if another inode in this cluster has
 	 * already flushed it after we did the initial checks in
 	 * xfs_reclaim_inode_grab().
 	 */
-	if (sync_mode & SYNC_TRYLOCK) {
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
-			goto out;
-		if (!xfs_iflock_nowait(ip))
-			goto out_unlock;
-	} else {
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		if (!xfs_iflock_nowait(ip)) {
-			if (!(sync_mode & SYNC_WAIT))
-				goto out_unlock;
-			xfs_iflock(ip);
-		}
-	}
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+		goto out;
+	if (!xfs_iflock_nowait(ip))
+		goto out_unlock;
 
+	/* If we are in shutdown, we don't care about blocking. */
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
 		/* xfs_iflush_abort() drops the flush lock */
 		xfs_iflush_abort(ip, false);
 		goto reclaim;
 	}
+
+	/* Can't do anything to pinned inodes without blocking, skip it. */
 	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out_ifunlock;
-		xfs_iunpin_wait(ip);
-	}
-	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto reclaim;
+		*lsn = ip->i_itemp->ili_item.li_lsn;
+		goto out_ifunlock;
 	}
 
 	/*
-	 * Never flush out dirty data during non-blocking reclaim, as it would
-	 * just contend with AIL pushing trying to do the same job.
+	 * Dirty inode we didn't catch, skip it.
 	 */
-	if (!(sync_mode & SYNC_WAIT))
+	if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) {
+		*lsn = ip->i_itemp->ili_item.li_lsn;
 		goto out_ifunlock;
+	}
 
 	/*
-	 * Now we have an inode that needs flushing.
-	 *
-	 * Note that xfs_iflush will never block on the inode buffer lock, as
-	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
-	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
-	 * result in an ABBA deadlock with xfs_ifree_cluster().
-	 *
-	 * As xfs_ifree_cluser() must gather all inodes that are active in the
-	 * cache to mark them stale, if we hit this case we don't actually want
-	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
-	 * inode, back off and try again.  Hopefully the next pass through will
-	 * see the stale flag set on the inode.
+	 * It's clean, we have it locked, we can now drop the flush lock
+	 * and reclaim it.
 	 */
-	error = xfs_iflush(ip, &bp);
-	if (error == -EAGAIN) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		/* backoff longer than in xfs_ifree_cluster */
-		delay(2);
-		goto restart;
-	}
-
-	if (!error) {
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-	}
+	xfs_ifunlock(ip);
 
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
+	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
+	ASSERT(ip->i_ino != 0);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1148,6 +1132,7 @@ xfs_reclaim_inode(
 	 * will see an invalid inode that it can skip.
 	 */
 	spin_lock(&ip->i_flags_lock);
+	ino = ip->i_ino; /* for radix_tree_delete */
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
@@ -1182,7 +1167,7 @@ xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	__xfs_inode_free(ip);
-	return error;
+	return true;
 
 out_ifunlock:
 	xfs_ifunlock(ip);
@@ -1190,14 +1175,7 @@ xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	/*
-	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and the reclaim work never goes back to
-	 * the idle state. Instead, return 0 to let the next scheduled
-	 * background reclaim attempt to reclaim the inode again.
-	 */
-	return 0;
+	return false;
 }
 
 /*
@@ -1205,55 +1183,42 @@ xfs_reclaim_inode(
  * corrupted, we still want to try to reclaim all the inodes. If we don't,
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
+ *
+ * Return the number of inodes freed.
  */
 STATIC int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
-	int			*nr_to_scan)
+	int			nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
+	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
+	long			freed = 0;
 
-restart:
 	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
-
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
-
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
 
+			mutex_lock(&pag->pag_ici_reclaim_lock);
+			first_index = pag->pag_ici_reclaim_cursor;
+
 			rcu_read_lock();
 			nr_found = radix_tree_gang_lookup_tag(
 					&pag->pag_ici_root,
 					(void **)batch, first_index,
 					XFS_LOOKUP_BATCH,
 					XFS_ICI_RECLAIM_TAG);
-			if (!nr_found) {
+			if (!nr_found)
 				done = 1;
-				rcu_read_unlock();
-				break;
-			}
 
 			/*
 			 * Grab the inodes before we drop the lock. if we found
@@ -1262,9 +1227,13 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done ||
+				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
 					batch[i] = NULL;
 
+				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
+					lowest_lsn = lsn;
+
 				/*
 				 * Update the index for the next lookup. Catch
 				 * overflows into the next AG range which can
@@ -1289,41 +1258,34 @@ xfs_reclaim_inodes_ag(
 
 			/* unlock now we've grabbed the inodes. */
 			rcu_read_unlock();
+			if (!done)
+				pag->pag_ici_reclaim_cursor = first_index;
+			else
+				pag->pag_ici_reclaim_cursor = 0;
+			mutex_unlock(&pag->pag_ici_reclaim_lock);
 
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
+				if (xfs_reclaim_inode(batch[i], pag, &lsn))
+					freed++;
+				if (lsn && (lowest_lsn == NULLCOMMITLSN ||
+				            XFS_LSN_CMP(lsn, lowest_lsn) < 0))
+					lowest_lsn = lsn;
 			}
 
-			*nr_to_scan -= XFS_LOOKUP_BATCH;
-
+			nr_to_scan -= XFS_LOOKUP_BATCH;
 			cond_resched();
 
-		} while (nr_found && !done && *nr_to_scan > 0);
+		} while (nr_found && !done && nr_to_scan > 0);
 
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
 
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return last_error;
+	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
+		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
+
+	return freed;
 }
 
 int
@@ -1331,9 +1293,7 @@ xfs_reclaim_inodes(
 	xfs_mount_t	*mp,
 	int		mode)
 {
-	int		nr_to_scan = INT_MAX;
-
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
 }
 
 /*
@@ -1350,7 +1310,7 @@ xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
-	int			sync_mode = SYNC_TRYLOCK;
+	int			sync_mode = 0;
 
 	/*
 	 * For kswapd, we kick background inode writeback. For direct
@@ -1362,7 +1322,7 @@ xfs_reclaim_inodes_nr(
 	else
 		sync_mode |= SYNC_WAIT;
 
-	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
 }
 
 /*
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 22/28] xfs: remove mode from xfs_reclaim_inodes()
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (20 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-10-31 23:46 ` [PATCH 23/28] xfs: track reclaimable inodes using a LRU list Dave Chinner
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Because it's always SYNC_WAIT now. Rename it to
xfs_reclaim_all_inodes() to make it clear how it is different to the
other similarly named reclaim functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 9 ++++-----
 fs/xfs/xfs_icache.h | 2 +-
 fs/xfs/xfs_mount.c  | 4 ++--
 fs/xfs/xfs_super.c  | 3 +--
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ff8ae32614a6..048f7f1b54ff 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1288,12 +1288,11 @@ xfs_reclaim_inodes_ag(
 	return freed;
 }
 
-int
-xfs_reclaim_inodes(
-	xfs_mount_t	*mp,
-	int		mode)
+void
+xfs_reclaim_all_inodes(
+	struct xfs_mount	*mp)
 {
-	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
+	xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX);
 }
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 4c0d8920cc54..37cd33741bee 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,7 +49,7 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+void xfs_reclaim_all_inodes(struct xfs_mount *mp);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 8f76c2add18b..5f3fd1d8f63f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -952,7 +952,7 @@ xfs_mountfs(
 	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
 	 * quota inodes.
 	 */
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_all_inodes(mp);
 	xfs_health_unmount(mp);
  out_log_dealloc:
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
@@ -1034,7 +1034,7 @@ xfs_unmountfs(
 	 * reclaim just to be sure. We can stop background inode reclaim
 	 * here as well if it is still running.
 	 */
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_all_inodes(mp);
 	xfs_health_unmount(mp);
 
 	xfs_qm_unmount(mp);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a4fe679207ef..456a398aad82 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1151,8 +1151,7 @@ xfs_quiesce_attr(
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
 	/* reclaim inodes to do any IO before the freeze completes */
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_all_inodes(mp);
 
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 23/28] xfs: track reclaimable inodes using a LRU list
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (21 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 22/28] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-10-31 23:46 ` [PATCH 24/28] xfs: reclaim inodes from the LRU Dave Chinner
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Now that we don't do IO from the inode reclaim code, there is no
need to optimise inode scanning order for optimal IO
characteristics. The AIL takes care of that for us, so now reclaim
can focus on selecting the best inodes to reclaim.

Hence we can change the inode reclaim algorithm to a real LRU and
remove the need to use the radix tree to track and walk inodes under
reclaim. This frees up a radix tree bit and simplifies the code that
marks inodes are reclaim candidates. It also simplifies the reclaim
code - we don't need batching anymore and all the reclaim logic
can be added to the LRU isolation callback.

Further, we get node aware reclaim at the xfs_inode level, which
should help the per-node reclaim code free relevant inodes faster.

We can re-use the VFS inode lru pointers - once the inode has been
reclaimed from the VFS, we can use these pointers ourselves. Hence
we don't need to grow the inode to change the way we index
reclaimable inodes.

Start by adding the list_lru tracking in parallel with the existing
reclaim code. This makes it easier to see the LRU infrastructure
separate to the reclaim algorithm changes. Especially the locking
order, which is ip->i_flags_lock -> list_lru lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 32 ++++++++------------------------
 fs/xfs/xfs_icache.h |  1 -
 fs/xfs/xfs_mount.h  |  1 +
 fs/xfs/xfs_super.c  | 30 ++++++++++++++++++++++--------
 4 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 048f7f1b54ff..350f42e7730b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -198,6 +198,8 @@ xfs_inode_set_reclaim_tag(
 	xfs_perag_set_reclaim_tag(pag);
 	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
 
+	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
+
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
@@ -370,12 +372,10 @@ xfs_iget_cache_hit(
 
 		/*
 		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
-		 * from stomping over us while we recycle the inode.  We can't
-		 * clear the radix tree reclaimable tag yet as it requires
-		 * pag_ici_lock to be held exclusive.
+		 * from stomping over us while we recycle the inode. Remove it
+		 * from the LRU straight away so we can re-init the VFS inode.
 		 */
 		ip->i_flags |= XFS_IRECLAIM;
-
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
@@ -407,6 +407,7 @@ xfs_iget_cache_hit(
 		 */
 		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 		ip->i_flags |= XFS_INEW;
+		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
 		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
 		inode->i_state = I_NEW;
 		ip->i_sick = 0;
@@ -1135,6 +1136,9 @@ xfs_reclaim_inode(
 	ino = ip->i_ino; /* for radix_tree_delete */
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
+
+	/* XXX: temporary until lru based reclaim */
+	list_lru_del(&pag->pag_mount->m_inode_lru, &VFS_I(ip)->i_lru);
 	spin_unlock(&ip->i_flags_lock);
 
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1324,26 +1328,6 @@ xfs_reclaim_inodes_nr(
 	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
 }
 
-/*
- * Return the number of reclaimable inodes in the filesystem for
- * the shrinker to determine how much to reclaim.
- */
-int
-xfs_reclaim_inodes_count(
-	struct xfs_mount	*mp)
-{
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag = 0;
-	int			reclaimable = 0;
-
-	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
-		ag = pag->pag_agno + 1;
-		reclaimable += pag->pag_ici_reclaimable;
-		xfs_perag_put(pag);
-	}
-	return reclaimable;
-}
-
 STATIC int
 xfs_inode_match_id(
 	struct xfs_inode	*ip,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 37cd33741bee..afd692b06c13 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -50,7 +50,6 @@ struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
 void xfs_reclaim_all_inodes(struct xfs_mount *mp);
-int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c6885d3b085..4f153ee17e18 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -75,6 +75,7 @@ typedef struct xfs_mount {
 	uint8_t			m_rt_sick;
 
 	struct xfs_ail		*m_ail;		/* fs active log item list */
+	struct list_lru		m_inode_lru;
 
 	struct xfs_sb		m_sb;		/* copy of fs superblock */
 	spinlock_t		m_sb_lock;	/* sb counter lock */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 456a398aad82..98ffbe42f8ae 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -891,28 +891,31 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
 
 	trace_xfs_destroy_inode(ip);
 
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-	XFS_STATS_INC(ip->i_mount, vn_rele);
-	XFS_STATS_INC(ip->i_mount, vn_remove);
+	XFS_STATS_INC(mp, vn_rele);
+	XFS_STATS_INC(mp, vn_remove);
 
 	xfs_inactive(ip);
 
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) {
+	if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
 		ASSERT(0);
 	}
 
-	XFS_STATS_INC(ip->i_mount, vn_reclaim);
+	XFS_STATS_INC(mp, vn_reclaim);
 
 	/*
 	 * We should never get here with one of the reclaim flags already set.
 	 */
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
+	spin_lock(&ip->i_flags_lock);
+	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIM));
+	spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * We always use background reclaim here because even if the
@@ -1544,9 +1547,16 @@ xfs_fs_fill_super(
 		goto out;
 	sb->s_fs_info = mp;
 
+	/*
+	 * The inode lru needs to be associated with the superblock shrinker,
+	 * and like the rest of the superblock shrinker, it's memcg aware.
+	 */
+	if (list_lru_init_memcg(&mp->m_inode_lru, &sb->s_shrink))
+		goto out_free_fsname;
+
 	error = xfs_parseargs(mp, (char *)data);
 	if (error)
-		goto out_free_fsname;
+		goto out_free_lru;
 
 	sb_min_blocksize(sb, BBSIZE);
 	sb->s_xattr = xfs_xattr_handlers;
@@ -1710,6 +1720,8 @@ xfs_fs_fill_super(
 	xfs_destroy_mount_workqueues(mp);
  out_close_devices:
 	xfs_close_devices(mp);
+ out_free_lru:
+	list_lru_destroy(&mp->m_inode_lru);
  out_free_fsname:
 	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
@@ -1743,6 +1755,7 @@ xfs_fs_put_super(
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
 
+	list_lru_destroy(&mp->m_inode_lru);
 	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
 	kfree(mp);
@@ -1766,7 +1779,8 @@ xfs_fs_nr_cached_objects(
 	/* Paranoia: catch incorrect calls during mount setup or teardown */
 	if (WARN_ON_ONCE(!sb->s_fs_info))
 		return 0;
-	return xfs_reclaim_inodes_count(XFS_M(sb));
+
+	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
 }
 
 static long
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 24/28] xfs: reclaim inodes from the LRU
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (22 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 23/28] xfs: track reclaimable inodes using a LRU list Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-06 17:21   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 25/28] xfs: remove unusued old inode reclaim code Dave Chinner
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Replace the AG radix tree walking reclaim code with a list_lru
walker, giving us both node-aware and memcg-aware inode reclaim
at the XFS level. This requires adding an inode isolation function to
determine if the inode can be reclaim, and a list walker to
dispose of the inodes that were isolated.

We want the isolation function to be non-blocking. If we can't
grab an inode then we either skip it or rotate it. If it's clean
then we skip it, if it's dirty then we rotate to give it time to be
cleaned before it is scanned again.

This congregates the dirty inodes at the tail of the LRU, which
means that if we start hitting a majority of dirty inodes either
there are lots of unlinked inodes in the reclaim list or we've
reclaimed all the clean inodes and we're looped back on the dirty
inodes. Either way, this is an indication we should tell kswapd to
back off.

The non-blocking isolation function introduces a complexity for the
filesystem shutdown case. When the filesystem is shut down, we want
to free the inode even if it is dirty, and this may require
blocking. We already hold the locks needed to do this blocking, so
what we do is that we leave inodes locked - both the ILOCK and the
flush lock - while they are sitting on the dispose list to be freed
after the LRU walk completes.  This allows us to process the
shutdown state outside the LRU walk where we can block safely.

Because we now are reclaiming inodes from the context that it needs
memory in (memcg and/or node), direct reclaim throttling within the
high level reclaim code in now much more effective. Hence we don't
wait on IO for either kswapd or direct reclaim. However, we have to
tell kswapd to back off if we start hitting too many dirty inodes.
This implies we've wrapped around the LRU and don't have many clean
inodes left to reclaim, so it needs to wait a while for the AIL
pushing to clean some of the remaining reclaimable inodes.

Keep in mind we don't have to care about inode lock order or
blocking with inode locks held here because a) we are using
trylocks, and b) once marked with XFS_IRECLAIM they can't be found
via the LRU and inode cache lookups will abort and retry. Hence
nobody will try to lock them in any other context that might also be
holding other inode locks.

Also convert xfs_reclaim_all_inodes() to use a LRU walk to free all
the reclaimable inodes in the filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 404 +++++++++++++-------------------------------
 fs/xfs/xfs_icache.h |  18 +-
 fs/xfs/xfs_inode.h  |  18 ++
 fs/xfs/xfs_super.c  |  46 ++++-
 4 files changed, 190 insertions(+), 296 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 350f42e7730b..05dd292bfdb6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -968,160 +968,110 @@ xfs_inode_ag_iterator_tag(
 	return last_error;
 }
 
-/*
- * Grab the inode for reclaim.
- *
- * Return false if we aren't going to reclaim it, true if it is a reclaim
- * candidate.
- *
- * If the inode is clean or unreclaimable, return 0 to tell the caller it does
- * not require flushing. Otherwise return the log item lsn of the inode so the
- * caller can determine it's inode flush target.  If we get the clean/dirty
- * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks
- * held.
- */
-STATIC bool
-xfs_reclaim_inode_grab(
-	struct xfs_inode	*ip,
-	int			flags,
-	xfs_lsn_t		*lsn)
+enum lru_status
+xfs_inode_reclaim_isolate(
+	struct list_head	*item,
+	struct list_lru_one	*lru,
+	spinlock_t		*lru_lock,
+	void			*arg)
 {
-	ASSERT(rcu_read_lock_held());
-	*lsn = 0;
+        struct xfs_ireclaim_args *ra = arg;
+        struct inode		*inode = container_of(item, struct inode,
+						      i_lru);
+        struct xfs_inode	*ip = XFS_I(inode);
+	enum lru_status		ret;
+	xfs_lsn_t		lsn = 0;
+
+	/* Careful: inversion of iflags_lock and everything else here */
+	if (!spin_trylock(&ip->i_flags_lock))
+		return LRU_SKIP;
+
+	/* if we are in shutdown, we'll reclaim it even if dirty */
+	ret = LRU_ROTATE;
+	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) &&
+	    !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_unlock_flags;
+	}
 
-	/* quick check for stale RCU freed inode */
-	if (!ip->i_ino)
-		return false;
+	ret = LRU_SKIP;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+		goto out_unlock_flags;
 
-	/*
-	 * Do unlocked checks to see if the inode already is being flushed or in
-	 * reclaim to avoid lock traffic. If the inode is not clean, return the
-	 * position in the AIL for the caller to push to.
-	 */
-	if (!xfs_inode_clean(ip)) {
-		*lsn = ip->i_itemp->ili_item.li_lsn;
-		return false;
+	if (!__xfs_iflock_nowait(ip)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_unlock_inode;
 	}
 
-	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return false;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+		goto reclaim;
 
 	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 *
-	 * Due to RCU lookup, we may find inodes that have been freed and only
-	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
-	 * aren't candidates for reclaim at all, so we must check the
-	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
+	 * Now the inode is locked, we can actually determine if it is dirty
+	 * without racing with anything.
 	 */
-	spin_lock(&ip->i_flags_lock);
-	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
-	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
-		/* not a reclaim candidate. */
-		spin_unlock(&ip->i_flags_lock);
-		return false;
+	ret = LRU_ROTATE;
+	if (xfs_ipincount(ip)) {
+		ra->dirty_skipped++;
+		goto out_ifunlock;
+	}
+	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
+		lsn = ip->i_itemp->ili_item.li_lsn;
+		ra->dirty_skipped++;
+		goto out_ifunlock;
 	}
+
+reclaim:
+	/*
+	 * Once we mark the inode with XFS_IRECLAIM, no-one will grab it again.
+	 * RCU lookups will still find the inode, but they'll stop when they set
+	 * the IRECLAIM flag. Hence we can leave the inode locked as we move it
+	 * to the dispose list so we can deal with shutdown cleanup there
+	 * outside the LRU lock context.
+	 */
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
+	list_lru_isolate_move(lru, &inode->i_lru, &ra->freeable);
 	spin_unlock(&ip->i_flags_lock);
-	return true;
+	return LRU_REMOVED;
+
+out_ifunlock:
+	__xfs_ifunlock(ip);
+out_unlock_inode:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_unlock_flags:
+	spin_unlock(&ip->i_flags_lock);
+
+	if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
+		ra->lowest_lsn = lsn;
+	return ret;
 }
 
-/*
- * Inodes in different states need to be treated differently. The following
- * table lists the inode states and the reclaim actions necessary:
- *
- *	inode state	     iflush ret		required action
- *      ---------------      ----------         ---------------
- *	bad			-		reclaim
- *	shutdown		EIO		unpin and reclaim
- *	clean, unpinned		0		reclaim
- *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		requeue
- *	stale, pinned		EAGAIN		requeue
- *	dirty, async		-		requeue
- *	dirty, sync		0		reclaim
- *
- * (*) dgc: I don't think the clean, pinned state is possible but it gets
- * handled anyway given the order of checks implemented.
- *
- * Also, because we get the flush lock first, we know that any inode that has
- * been flushed delwri has had the flush completed by the time we check that
- * the inode is clean.
- *
- * Note that because the inode is flushed delayed write by AIL pushing, the
- * flush lock may already be held here and waiting on it can result in very
- * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
- * the caller should push the AIL first before trying to reclaim inodes to
- * minimise the amount of time spent waiting.  For background relaim, we only
- * bother to reclaim clean inodes anyway.
- *
- * Hence the order of actions after gaining the locks should be:
- *	bad		=> reclaim
- *	shutdown	=> unpin and reclaim
- *	pinned, async	=> requeue
- *	pinned, sync	=> unpin
- *	stale		=> reclaim
- *	clean		=> reclaim
- *	dirty, async	=> requeue
- *	dirty, sync	=> flush, wait and reclaim
- *
- * Returns true if the inode was reclaimed, false otherwise.
- */
-STATIC bool
-xfs_reclaim_inode(
-	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	xfs_lsn_t		*lsn)
+static void
+xfs_dispose_inode(
+	struct xfs_inode	*ip)
 {
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
 	xfs_ino_t		ino;
 
-	*lsn = 0;
+	ASSERT(xfs_isiflocked(ip));
+	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE) ||
+	       XFS_FORCED_SHUTDOWN(mp));
+	ASSERT(ip->i_ino != 0);
 
 	/*
-	 * Don't try to flush the inode if another inode in this cluster has
-	 * already flushed it after we did the initial checks in
-	 * xfs_reclaim_inode_grab().
+	 * Process the shutdown reclaim work we deferred from the LRU isolation
+	 * callback before we go any further.
 	 */
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
-		goto out;
-	if (!xfs_iflock_nowait(ip))
-		goto out_unlock;
-
-	/* If we are in shutdown, we don't care about blocking. */
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+	if (XFS_FORCED_SHUTDOWN(mp)) {
 		xfs_iunpin_wait(ip);
-		/* xfs_iflush_abort() drops the flush lock */
 		xfs_iflush_abort(ip, false);
-		goto reclaim;
-	}
-
-	/* Can't do anything to pinned inodes without blocking, skip it. */
-	if (xfs_ipincount(ip)) {
-		*lsn = ip->i_itemp->ili_item.li_lsn;
-		goto out_ifunlock;
-	}
-
-	/*
-	 * Dirty inode we didn't catch, skip it.
-	 */
-	if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) {
-		*lsn = ip->i_itemp->ili_item.li_lsn;
-		goto out_ifunlock;
+	} else {
+		xfs_ifunlock(ip);
 	}
 
-	/*
-	 * It's clean, we have it locked, we can now drop the flush lock
-	 * and reclaim it.
-	 */
-	xfs_ifunlock(ip);
-
-reclaim:
-	ASSERT(!xfs_isiflocked(ip));
-	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
-	ASSERT(ip->i_ino != 0);
-
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
 	 * to be reclaimed with an invalid inode number when in the free state.
@@ -1136,27 +1086,20 @@ xfs_reclaim_inode(
 	ino = ip->i_ino; /* for radix_tree_delete */
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
-
-	/* XXX: temporary until lru based reclaim */
-	list_lru_del(&pag->pag_mount->m_inode_lru, &VFS_I(ip)->i_lru);
 	spin_unlock(&ip->i_flags_lock);
-
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
-	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
+	XFS_STATS_INC(mp, xs_ig_reclaims);
+
 	/*
 	 * Remove the inode from the per-AG radix tree.
-	 *
-	 * Because radix_tree_delete won't complain even if the item was never
-	 * added to the tree assert that it's been there before to catch
-	 * problems with the inode life time early on.
 	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
 	spin_lock(&pag->pag_ici_lock);
-	if (!radix_tree_delete(&pag->pag_ici_root,
-				XFS_INO_TO_AGINO(ip->i_mount, ino)))
+	if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino)))
 		ASSERT(0);
-	xfs_perag_clear_reclaim_tag(pag);
 	spin_unlock(&pag->pag_ici_lock);
+	xfs_perag_put(pag);
 
 	/*
 	 * Here we do an (almost) spurious inode lock in order to coordinate
@@ -1165,167 +1108,52 @@ xfs_reclaim_inode(
 	 *
 	 * We make that OK here by ensuring that we wait until the inode is
 	 * unlocked after the lookup before we go ahead and free it.
+	 *
+	 * XXX: need to check this is still true. Not sure it is.
 	 */
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_qm_dqdetach(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	__xfs_inode_free(ip);
-	return true;
-
-out_ifunlock:
-	xfs_ifunlock(ip);
-out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	return false;
 }
 
-/*
- * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
- * corrupted, we still want to try to reclaim all the inodes. If we don't,
- * then a shut down during filesystem unmount reclaim walk leak all the
- * unreclaimed inodes.
- *
- * Return the number of inodes freed.
- */
-STATIC int
-xfs_reclaim_inodes_ag(
-	struct xfs_mount	*mp,
-	int			flags,
-	int			nr_to_scan)
+void
+xfs_dispose_inodes(
+	struct list_head	*freeable)
 {
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag;
-	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
-	long			freed = 0;
-
-	ag = 0;
-	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
-		unsigned long	first_index = 0;
-		int		done = 0;
-		int		nr_found = 0;
-
-		ag = pag->pag_agno + 1;
-		do {
-			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
-			int	i;
-
-			mutex_lock(&pag->pag_ici_reclaim_lock);
-			first_index = pag->pag_ici_reclaim_cursor;
+	struct inode		*inode;
 
-			rcu_read_lock();
-			nr_found = radix_tree_gang_lookup_tag(
-					&pag->pag_ici_root,
-					(void **)batch, first_index,
-					XFS_LOOKUP_BATCH,
-					XFS_ICI_RECLAIM_TAG);
-			if (!nr_found)
-				done = 1;
-
-			/*
-			 * Grab the inodes before we drop the lock. if we found
-			 * nothing, nr == 0 and the loop will be skipped.
-			 */
-			for (i = 0; i < nr_found; i++) {
-				struct xfs_inode *ip = batch[i];
-
-				if (done ||
-				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
-					batch[i] = NULL;
-
-				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
-					lowest_lsn = lsn;
-
-				/*
-				 * Update the index for the next lookup. Catch
-				 * overflows into the next AG range which can
-				 * occur if we have inodes in the last block of
-				 * the AG and we are currently pointing to the
-				 * last inode.
-				 *
-				 * Because we may see inodes that are from the
-				 * wrong AG due to RCU freeing and
-				 * reallocation, only update the index if it
-				 * lies in this AG. It was a race that lead us
-				 * to see this inode, so another lookup from
-				 * the same index will not find it again.
-				 */
-				if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
-								pag->pag_agno)
-					continue;
-				first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
-				if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
-					done = 1;
-			}
-
-			/* unlock now we've grabbed the inodes. */
-			rcu_read_unlock();
-			if (!done)
-				pag->pag_ici_reclaim_cursor = first_index;
-			else
-				pag->pag_ici_reclaim_cursor = 0;
-			mutex_unlock(&pag->pag_ici_reclaim_lock);
-
-			for (i = 0; i < nr_found; i++) {
-				if (!batch[i])
-					continue;
-				if (xfs_reclaim_inode(batch[i], pag, &lsn))
-					freed++;
-				if (lsn && (lowest_lsn == NULLCOMMITLSN ||
-				            XFS_LSN_CMP(lsn, lowest_lsn) < 0))
-					lowest_lsn = lsn;
-			}
-
-			nr_to_scan -= XFS_LOOKUP_BATCH;
-			cond_resched();
-
-		} while (nr_found && !done && nr_to_scan > 0);
-
-		xfs_perag_put(pag);
+	while ((inode = list_first_entry_or_null(freeable,
+						 struct inode, i_lru))) {
+		list_del_init(&inode->i_lru);
+		xfs_dispose_inode(XFS_I(inode));
+		cond_resched();
 	}
-
-	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
-		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
-
-	return freed;
 }
-
 void
 xfs_reclaim_all_inodes(
 	struct xfs_mount	*mp)
 {
-	xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX);
-}
-
-/*
- * Scan a certain number of inodes for reclaim.
- *
- * When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doing synchronous
- * reclaim of inodes. That means if we come across dirty inodes, we wait for
- * them to be cleaned, which we hope will not be very long due to the
- * background walker having already kicked the IO off on those dirty inodes.
- */
-long
-xfs_reclaim_inodes_nr(
-	struct xfs_mount	*mp,
-	int			nr_to_scan)
-{
-	int			sync_mode = 0;
-
-	/*
-	 * For kswapd, we kick background inode writeback. For direct
-	 * reclaim, we issue and wait on inode writeback to throttle
-	 * reclaim rates and avoid shouty OOM-death.
-	 */
-	if (current_is_kswapd())
-		xfs_ail_push_all(mp->m_ail);
-	else
-		sync_mode |= SYNC_WAIT;
-
-	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
+	while (list_lru_count(&mp->m_inode_lru)) {
+		struct xfs_ireclaim_args ra;
+		long freed, to_free;
+
+		xfs_ireclaim_args_init(&ra);
+
+		to_free = list_lru_count(&mp->m_inode_lru);
+		freed = list_lru_walk(&mp->m_inode_lru,
+				      xfs_inode_reclaim_isolate, &ra, to_free);
+		xfs_dispose_inodes(&ra.freeable);
+
+		if (freed == 0) {
+			xfs_log_force(mp, XFS_LOG_SYNC);
+			xfs_ail_push_all(mp->m_ail);
+		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
+			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
+		}
+		cond_resched();
+	}
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index afd692b06c13..86e858e4a281 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -49,8 +49,24 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
 void xfs_inode_free(struct xfs_inode *ip);
 
+struct xfs_ireclaim_args {
+	struct list_head	freeable;
+	xfs_lsn_t		lowest_lsn;
+	unsigned long		dirty_skipped;
+};
+
+static inline void
+xfs_ireclaim_args_init(struct xfs_ireclaim_args *ra)
+{
+	INIT_LIST_HEAD(&ra->freeable);
+	ra->lowest_lsn = NULLCOMMITLSN;
+	ra->dirty_skipped = 0;
+}
+
+enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
+		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
+void xfs_dispose_inodes(struct list_head *freeable);
 void xfs_reclaim_all_inodes(struct xfs_mount *mp);
-long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bcfb35a9c5ca..00145debf820 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -270,6 +270,15 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 
 extern void __xfs_iflock(struct xfs_inode *ip);
 
+static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
+{
+	lockdep_assert_held(&ip->i_flags_lock);
+	if (ip->i_flags & XFS_IFLOCK)
+		return false;
+	ip->i_flags |= XFS_IFLOCK;
+	return true;
+}
+
 static inline int xfs_iflock_nowait(struct xfs_inode *ip)
 {
 	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
@@ -281,6 +290,15 @@ static inline void xfs_iflock(struct xfs_inode *ip)
 		__xfs_iflock(ip);
 }
 
+static inline void __xfs_ifunlock(struct xfs_inode *ip)
+{
+	lockdep_assert_held(&ip->i_flags_lock);
+	ASSERT(ip->i_flags & XFS_IFLOCK);
+	ip->i_flags &= ~XFS_IFLOCK;
+	smp_mb();
+	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
+}
+
 static inline void xfs_ifunlock(struct xfs_inode *ip)
 {
 	ASSERT(xfs_isiflocked(ip));
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 98ffbe42f8ae..096ae31b5436 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -17,6 +17,7 @@
 #include "xfs_alloc.h"
 #include "xfs_fsops.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
@@ -1772,23 +1773,54 @@ xfs_fs_mount(
 }
 
 static long
-xfs_fs_nr_cached_objects(
+xfs_fs_free_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
-	/* Paranoia: catch incorrect calls during mount setup or teardown */
-	if (WARN_ON_ONCE(!sb->s_fs_info))
-		return 0;
+	struct xfs_mount	*mp = XFS_M(sb);
+	struct xfs_ireclaim_args ra;
+	long			freed;
 
-	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
+	xfs_ireclaim_args_init(&ra);
+
+	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
+					xfs_inode_reclaim_isolate, &ra);
+	xfs_dispose_inodes(&ra.freeable);
+
+	/*
+	 * Deal with dirty inodes. We will have the LSN of
+	 * the oldest dirty inode in our reclaim args if we skipped any.
+	 *
+	 * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
+	 * we freed) then we need kswapd to back off once it's scan has been
+	 * completed. That way it will have some clean inodes once it comes back
+	 * and can make progress, but make sure we have inode cleaning in
+	 * progress.
+	 *
+	 * Direct reclaim will be throttled by the caller as it winds the
+	 * priority up. All we need to do is keep pushing on dirty inodes
+	 * in the background so when we come back progress will be made.
+	 */
+	if (current_is_kswapd() && ra.dirty_skipped >= freed) {
+		if (current->reclaim_state)
+			current->reclaim_state->need_backoff = true;
+	}
+	if (ra.lowest_lsn != NULLCOMMITLSN)
+		xfs_ail_push(mp->m_ail, ra.lowest_lsn);
+
+	return freed;
 }
 
 static long
-xfs_fs_free_cached_objects(
+xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
+	/* Paranoia: catch incorrect calls during mount setup or teardown */
+	if (WARN_ON_ONCE(!sb->s_fs_info))
+		return 0;
+
+	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
 }
 
 static const struct super_operations xfs_super_operations = {
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 25/28] xfs: remove unusued old inode reclaim code
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (23 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 24/28] xfs: reclaim inodes from the LRU Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-06 17:21   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes Dave Chinner
                   ` (2 subsequent siblings)
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Now that the custom AG radix tree walker has been replaced and
removed, we don't need the radix tree tags anymore, nor the reclaim
cursors or the locks taht protect it. Remove all remaining traces of
these things.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 82 +--------------------------------------------
 fs/xfs/xfs_icache.h |  7 ++--
 fs/xfs/xfs_mount.c  |  4 ---
 fs/xfs/xfs_mount.h  |  3 --
 fs/xfs/xfs_super.c  |  5 +--
 5 files changed, 6 insertions(+), 95 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 05dd292bfdb6..71a729e29260 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -139,83 +139,6 @@ xfs_inode_free(
 	__xfs_inode_free(ip);
 }
 
-static void
-xfs_perag_set_reclaim_tag(
-	struct xfs_perag	*pag)
-{
-	struct xfs_mount	*mp = pag->pag_mount;
-
-	lockdep_assert_held(&pag->pag_ici_lock);
-	if (pag->pag_ici_reclaimable++)
-		return;
-
-	/* propagate the reclaim tag up into the perag radix tree */
-	spin_lock(&mp->m_perag_lock);
-	radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
-			   XFS_ICI_RECLAIM_TAG);
-	spin_unlock(&mp->m_perag_lock);
-
-	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
-}
-
-static void
-xfs_perag_clear_reclaim_tag(
-	struct xfs_perag	*pag)
-{
-	struct xfs_mount	*mp = pag->pag_mount;
-
-	lockdep_assert_held(&pag->pag_ici_lock);
-	if (--pag->pag_ici_reclaimable)
-		return;
-
-	/* clear the reclaim tag from the perag radix tree */
-	spin_lock(&mp->m_perag_lock);
-	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
-			     XFS_ICI_RECLAIM_TAG);
-	spin_unlock(&mp->m_perag_lock);
-	trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
-}
-
-
-/*
- * We set the inode flag atomically with the radix tree tag.
- * Once we get tag lookups on the radix tree, this inode flag
- * can go away.
- */
-void
-xfs_inode_set_reclaim_tag(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	spin_lock(&pag->pag_ici_lock);
-	spin_lock(&ip->i_flags_lock);
-
-	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			   XFS_ICI_RECLAIM_TAG);
-	xfs_perag_set_reclaim_tag(pag);
-	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
-
-	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
-
-	spin_unlock(&ip->i_flags_lock);
-	spin_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
-}
-
-STATIC void
-xfs_inode_clear_reclaim_tag(
-	struct xfs_perag	*pag,
-	xfs_ino_t		ino)
-{
-	radix_tree_tag_clear(&pag->pag_ici_root,
-			     XFS_INO_TO_AGINO(pag->pag_mount, ino),
-			     XFS_ICI_RECLAIM_TAG);
-	xfs_perag_clear_reclaim_tag(pag);
-}
-
 static void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
@@ -397,18 +320,16 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
-		spin_lock(&pag->pag_ici_lock);
-		spin_lock(&ip->i_flags_lock);
 
 		/*
 		 * Clear the per-lifetime state in the inode as we are now
 		 * effectively a new inode and need to return to the initial
 		 * state before reuse occurs.
 		 */
+		spin_lock(&ip->i_flags_lock);
 		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 		ip->i_flags |= XFS_INEW;
 		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
-		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
 		inode->i_state = I_NEW;
 		ip->i_sick = 0;
 		ip->i_checked = 0;
@@ -417,7 +338,6 @@ xfs_iget_cache_hit(
 		init_rwsem(&inode->i_rwsem);
 
 		spin_unlock(&ip->i_flags_lock);
-		spin_unlock(&pag->pag_ici_lock);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 86e858e4a281..ec646b9e88b7 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -25,9 +25,8 @@ struct xfs_eofblocks {
  */
 #define XFS_ICI_NO_TAG		(-1)	/* special flag for an untagged lookup
 					   in xfs_inode_ag_iterator */
-#define XFS_ICI_RECLAIM_TAG	0	/* inode is to be reclaimed */
-#define XFS_ICI_EOFBLOCKS_TAG	1	/* inode has blocks beyond EOF */
-#define XFS_ICI_COWBLOCKS_TAG	2	/* inode can have cow blocks to gc */
+#define XFS_ICI_EOFBLOCKS_TAG	0	/* inode has blocks beyond EOF */
+#define XFS_ICI_COWBLOCKS_TAG	1	/* inode can have cow blocks to gc */
 
 /*
  * Flags for xfs_iget()
@@ -68,8 +67,6 @@ enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
 void xfs_dispose_inodes(struct list_head *freeable);
 void xfs_reclaim_all_inodes(struct xfs_mount *mp);
 
-void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
-
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
 int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5f3fd1d8f63f..9d60a4e033a0 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,7 +148,6 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -200,7 +199,6 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -242,7 +240,6 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
-	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -252,7 +249,6 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4f153ee17e18..dea05cd867bf 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -343,9 +343,6 @@ typedef struct xfs_perag {
 
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
-	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
-	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
 	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 096ae31b5436..d2200fbce139 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -916,7 +916,6 @@ xfs_fs_destroy_inode(
 	spin_lock(&ip->i_flags_lock);
 	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
 	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIM));
-	spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * We always use background reclaim here because even if the
@@ -925,7 +924,9 @@ xfs_fs_destroy_inode(
 	 * this more efficiently than we can here, so simply let background
 	 * reclaim tear down all inodes.
 	 */
-	xfs_inode_set_reclaim_tag(ip);
+	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
+	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
+	spin_unlock(&ip->i_flags_lock);
 }
 
 static void
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (24 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 25/28] xfs: remove unusued old inode reclaim code Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-06 17:22   ` Brian Foster
  2019-10-31 23:46 ` [PATCH 27/28] rwsem: introduce down/up_write_non_owner Dave Chinner
  2019-10-31 23:46 ` [PATCH 28/28] xfs: rework unreferenced inode lookups Dave Chinner
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

If we are reclaiming all inodes, it is likely we need to flush the
entire AIL to do that. We have mechanisms to do that without needing
to push to a specific LSN.

Convert xfs_relaim_all_inodes() to use xfs_ail_push_all variant so
we can get rid of the hacky xfs_ail_push_sync() scaffolding we used
to support the intermediate stages of the non-blocking reclaim
changeset.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c     | 17 +++++++++++------
 fs/xfs/xfs_trans_ail.c  | 32 --------------------------------
 fs/xfs/xfs_trans_priv.h |  2 --
 3 files changed, 11 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 71a729e29260..11bf4768d491 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -25,6 +25,7 @@
 #include "xfs_log.h"
 
 #include <linux/iversion.h>
+#include <linux/backing-dev.h>	/* for congestion_wait() */
 
 /*
  * Allocate and initialise an xfs_inode.
@@ -1051,6 +1052,10 @@ xfs_dispose_inodes(
 		cond_resched();
 	}
 }
+
+/*
+ * Reclaim all unused inodes in the filesystem.
+ */
 void
 xfs_reclaim_all_inodes(
 	struct xfs_mount	*mp)
@@ -1059,6 +1064,9 @@ xfs_reclaim_all_inodes(
 		struct xfs_ireclaim_args ra;
 		long freed, to_free;
 
+		/* push the AIL to clean dirty reclaimable inodes */
+		xfs_ail_push_all(mp->m_ail);
+
 		xfs_ireclaim_args_init(&ra);
 
 		to_free = list_lru_count(&mp->m_inode_lru);
@@ -1066,13 +1074,10 @@ xfs_reclaim_all_inodes(
 				      xfs_inode_reclaim_isolate, &ra, to_free);
 		xfs_dispose_inodes(&ra.freeable);
 
-		if (freed == 0) {
+		if (freed == 0)
 			xfs_log_force(mp, XFS_LOG_SYNC);
-			xfs_ail_push_all(mp->m_ail);
-		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
-			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
-		}
-		cond_resched();
+		else if (ra.dirty_skipped)
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 }
 
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 3e1d0e1439e2..685a21cd24c0 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -662,36 +662,6 @@ xfs_ail_push_all(
 		xfs_ail_push(ailp, threshold_lsn);
 }
 
-/*
- * Push the AIL to a specific lsn and wait for it to complete.
- */
-void
-xfs_ail_push_sync(
-	struct xfs_ail		*ailp,
-	xfs_lsn_t		threshold_lsn)
-{
-	struct xfs_log_item	*lip;
-	DEFINE_WAIT(wait);
-
-	spin_lock(&ailp->ail_lock);
-	while ((lip = xfs_ail_min(ailp)) != NULL) {
-		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
-		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
-		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
-			break;
-		if (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
-			ailp->ail_target = threshold_lsn;
-		wake_up_process(ailp->ail_task);
-		spin_unlock(&ailp->ail_lock);
-		schedule();
-		spin_lock(&ailp->ail_lock);
-	}
-	spin_unlock(&ailp->ail_lock);
-
-	finish_wait(&ailp->ail_push, &wait);
-}
-
-
 /*
  * Push out all items in the AIL immediately and wait until the AIL is empty.
  */
@@ -732,7 +702,6 @@ xfs_ail_update_finish(
 	if (!XFS_FORCED_SHUTDOWN(mp))
 		xlog_assign_tail_lsn_locked(mp);
 
-	wake_up_all(&ailp->ail_push);
 	if (list_empty(&ailp->ail_head))
 		wake_up_all(&ailp->ail_empty);
 	spin_unlock(&ailp->ail_lock);
@@ -889,7 +858,6 @@ xfs_trans_ail_init(
 	spin_lock_init(&ailp->ail_lock);
 	INIT_LIST_HEAD(&ailp->ail_buf_list);
 	init_waitqueue_head(&ailp->ail_empty);
-	init_waitqueue_head(&ailp->ail_push);
 
 	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->ail_mount->m_fsname);
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 1b6f4bbd47c0..35655eac01a6 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -61,7 +61,6 @@ struct xfs_ail {
 	int			ail_log_flush;
 	struct list_head	ail_buf_list;
 	wait_queue_head_t	ail_empty;
-	wait_queue_head_t	ail_push;
 };
 
 /*
@@ -114,7 +113,6 @@ xfs_trans_ail_remove(
 }
 
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
-void			xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
 void			xfs_ail_push_all(struct xfs_ail *);
 void			xfs_ail_push_all_sync(struct xfs_ail *);
 struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 27/28] rwsem: introduce down/up_write_non_owner
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (25 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-10-31 23:46 ` [PATCH 28/28] xfs: rework unreferenced inode lookups Dave Chinner
  27 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

To serialise freeing of inodes against unreferenced lookups, XFS
wants to hold the inode locked from the reclaim context that queues
it from RCU freeing until the grace period that actually frees the
inode. THis means the inode is being unlocked by a context that
didn't lock it, and that makes lockdep unhappy.

This is a very special use case - inodes can be found once marked
for reclaim because of lockless RCU lookups, so we need some
synchronisation that will prevent such inodes from being locked.  To
access an unreferenced inode we need to take the ILOCK rwsem without
blocking and still under rcu_read_lock() to hold off reclaim of the
inode. If the inode has been reclaimed and is queued for freeing,
holding the ILOCK rwsem until the RCU grace period expires means
no lookup that finds it in that grace period will be able to lock it
and use it. Once the grace period expires we are guaranteed that
nothing will ever find the inode again, and we can unlock it and
free it.

This requires down_write_trylock_non_owner() in the reclaim context
before we mark the inode as reclaimed and run call_rcu() to free it.
It require up_write_non_owner() in the RCU callback before we free
the inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/rwsem.h  |  6 ++++++
 kernel/locking/rwsem.c | 23 +++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 00d6054687dd..e557bd994d0e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -191,6 +191,9 @@ do {								\
  */
 extern void down_read_non_owner(struct rw_semaphore *sem);
 extern void up_read_non_owner(struct rw_semaphore *sem);
+extern void down_write_non_owner(struct rw_semaphore *sem);
+extern int down_write_trylock_non_owner(struct rw_semaphore *sem);
+extern void up_write_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)		down_read(sem)
 # define down_write_nest_lock(sem, nest_lock)	down_write(sem)
@@ -198,6 +201,9 @@ extern void up_read_non_owner(struct rw_semaphore *sem);
 # define down_write_killable_nested(sem, subclass)	down_write_killable(sem)
 # define down_read_non_owner(sem)		down_read(sem)
 # define up_read_non_owner(sem)			up_read(sem)
+# define down_write_non_owner(sem)		down_write(sem)
+# define down_write_trylock_non_owner(sem)	down_write_trylock(sem)
+# define up_write_non_owner(sem)		up_write(sem)
 #endif
 
 #endif /* _LINUX_RWSEM_H */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index eef04551eae7..36162d42fe09 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1654,4 +1654,27 @@ void up_read_non_owner(struct rw_semaphore *sem)
 }
 EXPORT_SYMBOL(up_read_non_owner);
 
+void down_write_non_owner(struct rw_semaphore *sem)
+{
+	might_sleep();
+	__down_write(sem);
+}
+EXPORT_SYMBOL(down_write_non_owner);
+
+/*
+ * trylock for writing -- returns 1 if successful, 0 if contention
+ */
+int down_write_trylock_non_owner(struct rw_semaphore *sem)
+{
+	return __down_write_trylock(sem);
+}
+EXPORT_SYMBOL(down_write_trylock_non_owner);
+
+void up_write_non_owner(struct rw_semaphore *sem)
+{
+	rwsem_set_owner(sem);
+	__up_write(sem);
+}
+EXPORT_SYMBOL(up_write_non_owner);
+
 #endif
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
                   ` (26 preceding siblings ...)
  2019-10-31 23:46 ` [PATCH 27/28] rwsem: introduce down/up_write_non_owner Dave Chinner
@ 2019-10-31 23:46 ` Dave Chinner
  2019-11-06 22:18   ` Brian Foster
  27 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-10-31 23:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Looking up an unreferenced inode in the inode cache is a bit hairy.
We do this for inode invalidation and writeback clustering purposes,
which is all invisible to the VFS. Hence we can't take reference
counts to the inode and so must be very careful how we do it.

There are several different places that all do the lookups and
checks slightly differently. Fundamentally, though, they are all
racy and inode reclaim has to block waiting for the inode lock if it
loses the race. This is not very optimal given all the work we;ve
already done to make reclaim non-blocking.

We can make the reclaim process nonblocking with a couple of simple
changes. If we define the unreferenced lookup process in a way that
will either always grab an inode in a way that reclaim will notice
and skip, or will notice a reclaim has grabbed the inode so it can
skip the inode, then there is no need for reclaim to need to cycle
the inode ILOCK at all.

Selecting an inode for reclaim is already non-blocking, so if the
ILOCK is held the inode will be skipped. If we ensure that reclaim
holds the ILOCK until the inode is freed, then we can do the same
thing in the unreferenced lookup to avoid inodes in reclaim. We can
do this simply by holding the ILOCK until the RCU grace period
expires and the inode freeing callback is run. As all unreferenced
lookups have to hold the rcu_read_lock(), we are guaranteed that
a reclaimed inode will be noticed as the trylock will fail.


Additional research notes on final reclaim locking before free
--------------------------------------------------------------

2016: 1f2dcfe89eda ("xfs: xfs_inode_free() isn't RCU safe")

Fixes situation where the inode is found during RCU lookup within
the freeing grace period, but critical structures have already been
freed. lookup code that has this problem is stuff like
xfs_iflush_cluster.


2008: 455486b9ccdd ("[XFS] avoid all reclaimable inodes in xfs_sync_inodes_ag")

Prior to this commit, the flushing of inodes required serialisation
with xfs_ireclaim(), which did this lock/unlock thingy to ensure
that it waited for flushing in xfs_sync_inodes_ag() to complete
before freeing the inode:

                /*
-                * If we can't get a reference on the VFS_I, the inode must be
-                * in reclaim. If we can get the inode lock without blocking,
-                * it is safe to flush the inode because we hold the tree lock
-                * and xfs_iextract will block right now. Hence if we lock the
-                * inode while holding the tree lock, xfs_ireclaim() is
-                * guaranteed to block on the inode lock we now hold and hence
-                * it is safe to reference the inode until we drop the inode
-                * locks completely.
+                * If we can't get a reference on the inode, it must be
+                * in reclaim. Leave it for the reclaim code to flush.
                 */

This case is completely gone from the modern code.

lock/unlock exists at start of git era. Switching to archive tree.

This xfs_sync() functionality goes back to 1994 when inode
writeback was first introduced by:

47ac6d60 ("Add support to xfs_ireclaim() needed for xfs_sync().")

So it has been there forever -  lets see if we can get rid of it.
State of existing codeL

- xfs_iflush_cluster() does not check for XFS_IRECLAIM inode flag
  while holding rcu_read_lock()/i_flags_lock, so doesn't avoid
  reclaimable or inodes that are in the process of being reclaimed.
  Inodes at this point of reclaim are clean, so if xfs_iflush_cluster
  wins the race to the ILOCK, then inode reclaim has to wait
  for the lock to be dropped by xfs_iflush_cluster() once it detects
  the inode is clean.

- xfs_ifree_cluster() has similar logic based around XFS_ISTALE,
  results in similar race conditions that require inode reclaim to
  cycle the ILOCK to serialise against.

- xfs_inode_ag_walk() uses xfs_inode_ag_walk_grab(), and it checks
  XFS_IRECLAIM under RCU. It then tries to take a reference to the
  VFS inode via igrab(), which will fail if the inode is either
  XFS_IRECLAIMABLE | XFS_IRECLAIM, and it if races then igrab() will
  fail because the inode has I_FREEING still set, so it's protected
  against reclaim races.

That leaves xfs_iflush_cluster() + xfs_ifree_cluster() to be
modified to do reclaim-safe lookups. W.r.t. new inode reclaim LRU
isolate function:

	1. inode can be referenced while rcu_read_lock() is held.

	2. XFS_IRECLAIM means inode has been fully locked down and
	   has placed on the dispose list, and will be freed soon.
		- ilock_nowait() will fail once IRECLAIM is set due
		  to lock order in isolation code.

	3. ip->i_ino == 0 means it's been removed from the dispose
	   list and is about to or has been removed from the radix
	   tree and may have already been queued on the rcu freeing
	   list to be freed at the end of the current grace period.

		- the old xfs_ireclaim() code will have dropped the
		  ILOCK here, and so there's a race between checking
		  IRECLAIM, grabbing ilock_nowait() and reclaim
		  freeing the inode.
		- this is what the spurious lock/unlock avoids.

	4. it xfs_ilock_nowait() fails before the rcu grace period
	   expires, it doesn't matter if we race between checking
	   IRECLAIM and failing the lock attempt. In fact, we don't
	   even have to check XFS_IRECLAIM - just failing
	   xfs_ilock_nowait() is sufficient to avoid inodes being
	   reclaimed.

	   Hence when xfs_ilock_nowait() fails, we can either drop the
	   rcu_read_lock at that point and restart the inode lookup,
	   or we just skip the inode altogether. If we raced with
	   reclaim, the retry will not find the inode in reclaim
	   again. If we raced wtih some other lock holder, then
	   we'll find the inode and try to lock it again.

		- Requires holding ILOCK into rcu freeing callback
		  and dropping it there. i.e. inode to be reclaimed
		  remains locked until grace period expires.
		- No window at all between IRECLAIM being set and
		  visible to other CPUs and the inode being removed
		  from the cache and freed where ilock_nowait will
		  succeed.
		- simple, effective, reliable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/mrlock.h     |  27 +++++++++
 fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
 fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
 3 files changed, 153 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
index 79155eec341b..1752a2592bcc 100644
--- a/fs/xfs/mrlock.h
+++ b/fs/xfs/mrlock.h
@@ -75,4 +75,31 @@ static inline void mrdemote(mrlock_t *mrp)
 	downgrade_write(&mrp->mr_lock);
 }
 
+/* special locking cases for inode reclaim */
+static inline void mrupdate_non_owner(mrlock_t *mrp)
+{
+	down_write_non_owner(&mrp->mr_lock);
+#if defined(DEBUG) || defined(XFS_WARN)
+	mrp->mr_writer = 1;
+#endif
+}
+
+static inline int mrtryupdate_non_owner(mrlock_t *mrp)
+{
+	if (!down_write_trylock_non_owner(&mrp->mr_lock))
+		return 0;
+#if defined(DEBUG) || defined(XFS_WARN)
+	mrp->mr_writer = 1;
+#endif
+	return 1;
+}
+
+static inline void mrunlock_excl_non_owner(mrlock_t *mrp)
+{
+#if defined(DEBUG) || defined(XFS_WARN)
+	mrp->mr_writer = 0;
+#endif
+	up_write_non_owner(&mrp->mr_lock);
+}
+
 #endif /* __XFS_SUPPORT_MRLOCK_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 11bf4768d491..45ee3b5cd873 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -106,6 +106,7 @@ xfs_inode_free_callback(
 		ip->i_itemp = NULL;
 	}
 
+	mrunlock_excl_non_owner(&ip->i_lock);
 	kmem_zone_free(xfs_inode_zone, ip);
 }
 
@@ -132,6 +133,7 @@ xfs_inode_free(
 	 * free state. The ip->i_flags_lock provides the barrier against lookup
 	 * races.
 	 */
+	mrupdate_non_owner(&ip->i_lock);
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
@@ -295,11 +297,24 @@ xfs_iget_cache_hit(
 		}
 
 		/*
-		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
-		 * from stomping over us while we recycle the inode. Remove it
-		 * from the LRU straight away so we can re-init the VFS inode.
+		 * Before we reinitialise the inode, we need to make sure
+		 * reclaim does not pull it out from underneath us. We already
+		 * hold the i_flags_lock, and because the XFS_IRECLAIM is not
+		 * set we know the inode is still on the LRU. However, the LRU
+		 * code may have just selected this inode to reclaim, so we need
+		 * to ensure we hold the i_flags_lock long enough for the
+		 * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
+		 * removing the inode from the LRU, which will spin on the LRU
+		 * list locks until reclaim stops walking, at which point we
+		 * know there is no possible race between reclaim isolation and
+		 * this lookup.
+		 *
+		 * We also set the XFS_IRECLAIM flag here while trying to do the
+		 * re-initialisation to prevent multiple racing lookups on this
+		 * inode from all landing here at the same time.
 		 */
 		ip->i_flags |= XFS_IRECLAIM;
+		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
@@ -314,6 +329,7 @@ xfs_iget_cache_hit(
 			spin_lock(&ip->i_flags_lock);
 			wake = !!__xfs_iflags_test(ip, XFS_INEW);
 			ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+			list_lru_add(&mp->m_inode_lru, &inode->i_lru);
 			if (wake)
 				wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
 			ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
@@ -330,7 +346,6 @@ xfs_iget_cache_hit(
 		spin_lock(&ip->i_flags_lock);
 		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 		ip->i_flags |= XFS_INEW;
-		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
 		inode->i_state = I_NEW;
 		ip->i_sick = 0;
 		ip->i_checked = 0;
@@ -610,8 +625,7 @@ xfs_icache_inode_is_allocated(
 /*
  * The inode lookup is done in batches to keep the amount of lock traffic and
  * radix tree lookups to a minimum. The batch size is a trade off between
- * lookup reduction and stack usage. This is in the reclaim path, so we can't
- * be too greedy.
+ * lookup reduction and stack usage.
  */
 #define XFS_LOOKUP_BATCH	32
 
@@ -916,8 +930,13 @@ xfs_inode_reclaim_isolate(
 		goto out_unlock_flags;
 	}
 
+	/*
+	 * If we are going to reclaim this inode, it will be unlocked by the
+	 * RCU callback and so is in a different context. Hence we need to use
+	 * special non-owner trylock semantics for XFS_ILOCK_EXCL here.
+	 */
 	ret = LRU_SKIP;
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+	if (!mrtryupdate_non_owner(&ip->i_lock))
 		goto out_unlock_flags;
 
 	if (!__xfs_iflock_nowait(ip)) {
@@ -960,7 +979,7 @@ xfs_inode_reclaim_isolate(
 out_ifunlock:
 	__xfs_ifunlock(ip);
 out_unlock_inode:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	mrunlock_excl_non_owner(&ip->i_lock);
 out_unlock_flags:
 	spin_unlock(&ip->i_flags_lock);
 
@@ -969,6 +988,41 @@ xfs_inode_reclaim_isolate(
 	return ret;
 }
 
+/*
+ * We are passed a locked inode to dispose of.
+ *
+ * To avoid race conditions with lookups that don't take references, we do
+ * not drop the XFS_ILOCK_EXCL until the RCU callback that frees the inode.
+ * This means that any attempt to lock the inode during the current RCU grace
+ * period will fail, and hence we do not need any synchonisation here to wait
+ * for code that pins unreferenced inodes with the XFS_ILOCK to drain.
+ *
+ * This requires code that requires such pins to do the following under a single
+ * rcu_read_lock() context:
+ *
+ *	- rcu_read_lock
+ *	- find the inode via radix tree lookup
+ *	- take the ip->i_flags_lock
+ *	- check ip->i_ino != 0
+ *	- check XFS_IRECLAIM is not set
+ *	- call xfs_ilock_nowait(ip, XFS_ILOCK_[SHARED|EXCL]) to lock the inode
+ *	- drop ip->i_flags_lock
+ *	- rcu_read_unlock()
+ *
+ * Only if all this succeeds and the caller has the inode locked and protected
+ * against it being freed until the ilock is released. If the XFS_IRECLAIM flag
+ * is set or xfs_ilock_nowait() fails, then the caller must either skip the
+ * inode and move on to the next inode (gang lookup) or drop the rcu_read_lock
+ * and start the entire inode lookup process again (individual lookup).
+ *
+ * This works because  i_flags_lock serialises against
+ * xfs_inode_reclaim_isolate() - if the lookup wins the race on i_flags_lock and
+ * XFS_IRECLAIM is not set, then it will be able to lock the inode and hold off
+ * reclaim. If the isolate function wins the race, it will lock the inode and
+ * set the XFS_IRECLAIM flag if it is going to free the inode and this will
+ * prevent the lookup callers from succeeding in getting unreferenced pin via
+ * the ILOCK.
+ */
 static void
 xfs_dispose_inode(
 	struct xfs_inode	*ip)
@@ -977,11 +1031,14 @@ xfs_dispose_inode(
 	struct xfs_perag	*pag;
 	xfs_ino_t		ino;
 
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(xfs_isiflocked(ip));
 	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE) ||
 	       XFS_FORCED_SHUTDOWN(mp));
 	ASSERT(ip->i_ino != 0);
 
+	XFS_STATS_INC(mp, xs_ig_reclaims);
+
 	/*
 	 * Process the shutdown reclaim work we deferred from the LRU isolation
 	 * callback before we go any further.
@@ -1008,9 +1065,6 @@ xfs_dispose_inode(
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
-	XFS_STATS_INC(mp, xs_ig_reclaims);
 
 	/*
 	 * Remove the inode from the per-AG radix tree.
@@ -1022,19 +1076,7 @@ xfs_dispose_inode(
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
 
-	/*
-	 * Here we do an (almost) spurious inode lock in order to coordinate
-	 * with inode cache radix tree lookups.  This is because the lookup
-	 * can reference the inodes in the cache without taking references.
-	 *
-	 * We make that OK here by ensuring that we wait until the inode is
-	 * unlocked after the lookup before we go ahead and free it.
-	 *
-	 * XXX: need to check this is still true. Not sure it is.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_qm_dqdetach(ip);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	__xfs_inode_free(ip);
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 33edb18098ca..5c0be82195fc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2538,60 +2538,63 @@ xfs_ifree_get_one_inode(
 	if (!ip)
 		goto out_rcu_unlock;
 
+
+	spin_lock(&ip->i_flags_lock);
+	if (!ip->i_ino || ip->i_ino != inum ||
+	    __xfs_iflags_test(ip, XFS_IRECLAIM))
+		goto out_iflags_unlock;
+
 	/*
-	 * because this is an RCU protected lookup, we could find a recently
-	 * freed or even reallocated inode during the lookup. We need to check
-	 * under the i_flags_lock for a valid inode here. Skip it if it is not
-	 * valid, the wrong inode or stale.
+	 * We've got the right inode and it isn't in reclaim but it might be
+	 * locked by someone else.  In that case, we retry the inode rather than
+	 * skipping it completely as we have to process it with the cluster
+	 * being freed.
 	 */
-	spin_lock(&ip->i_flags_lock);
-	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
+	if (ip != free_ip && !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
 		spin_unlock(&ip->i_flags_lock);
-		goto out_rcu_unlock;
+		rcu_read_unlock();
+		delay(1);
+		goto retry;
 	}
-	spin_unlock(&ip->i_flags_lock);
 
 	/*
-	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
-	 * other inodes that we did not find in the list attached to the buffer
-	 * and are not already marked stale. If we can't lock it, back off and
-	 * retry.
+	 * Inode is now pinned against reclaim until we unlock it. If the inode
+	 * is already marked stale, then it has already been flush locked and
+	 * attached to the buffer so we don't need to do anything more here.
 	 */
-	if (ip != free_ip) {
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
-			rcu_read_unlock();
-			delay(1);
-			goto retry;
-		}
-
-		/*
-		 * Check the inode number again in case we're racing with
-		 * freeing in xfs_reclaim_inode().  See the comments in that
-		 * function for more information as to why the initial check is
-		 * not sufficient.
-		 */
-		if (ip->i_ino != inum) {
+	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
+		if (ip != free_ip)
 			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			goto out_rcu_unlock;
-		}
+		goto out_iflags_unlock;
 	}
+	__xfs_iflags_set(ip, XFS_ISTALE);
+	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
+	/*
+	 * The flush lock will now hold off inode reclaim until the buffer
+	 * completion routine runs the xfs_istale_done callback and unlocks the
+	 * flush lock. Hence the caller can safely drop the ILOCK when it is
+	 * done attaching the inode to the cluster buffer.
+	 */
 	xfs_iflock(ip);
-	xfs_iflags_set(ip, XFS_ISTALE);
 
 	/*
-	 * We don't need to attach clean inodes or those only with unlogged
-	 * changes (which we throw away, anyway).
+	 * We don't need to attach clean inodes to the buffer - they are marked
+	 * stale in memory now and will need to be re-initialised by inode
+	 * allocation before they can be reused.
 	 */
 	if (!ip->i_itemp || xfs_inode_clean(ip)) {
 		ASSERT(ip != free_ip);
 		xfs_ifunlock(ip);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		if (ip != free_ip)
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		goto out_no_inode;
 	}
 	return ip;
 
+out_iflags_unlock:
+	spin_unlock(&ip->i_flags_lock);
 out_rcu_unlock:
 	rcu_read_unlock();
 out_no_inode:
@@ -3519,44 +3522,40 @@ xfs_iflush_cluster(
 			continue;
 
 		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
+		 * See xfs_dispose_inode() for an explanation of the
+		 * tests here to avoid inode reclaim races.
 		 */
 		spin_lock(&cip->i_flags_lock);
 		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
+		    __xfs_iflags_test(cip, XFS_IRECLAIM)) {
 			spin_unlock(&cip->i_flags_lock);
 			continue;
 		}
 
-		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
-		 */
+		/* ILOCK will pin the inode against reclaim */
+		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) {
+			spin_unlock(&cip->i_flags_lock);
+			continue;
+		}
+
+		if (__xfs_iflags_test(cip, XFS_ISTALE)) {
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			spin_unlock(&cip->i_flags_lock);
+			continue;
+		}
+
+		/* Lookup can find inodes outside the cluster being flushed. */
 		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
 			spin_unlock(&cip->i_flags_lock);
 			break;
 		}
 		spin_unlock(&cip->i_flags_lock);
 
 		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
-		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
-			continue;
-
-		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
+		 * If we can't get the flush lock now or the inode is pinned,
 		 * then this inode cannot be flushed and is skipped.
 		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
 		if (!xfs_iflock_nowait(cip)) {
 			xfs_iunlock(cip, XFS_ILOCK_SHARED);
 			continue;
@@ -3567,22 +3566,9 @@ xfs_iflush_cluster(
 			continue;
 		}
 
-
 		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
-		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
+		 * Arriving here means that this inode can be flushed. First
+		 * check that it's dirty before flushing.
 		 */
 		if (!xfs_inode_clean(cip)) {
 			int	error;
@@ -3596,6 +3582,7 @@ xfs_iflush_cluster(
 			xfs_ifunlock(cip);
 		}
 		xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		/* unsafe to reference cip from here */
 	}
 
 	if (clcount) {
@@ -3634,7 +3621,11 @@ xfs_iflush_cluster(
 
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 
-	/* abort the corrupt inode, as it was not attached to the buffer */
+	/*
+	 * Abort the corrupt inode, as it was not attached to the buffer. It is
+	 * unlocked, but still pinned against reclaim by the flush lock so it is
+	 * safe to reference here until after the flush abort completes.
+	 */
 	xfs_iflush_abort(cip, false);
 	kmem_free(cilist);
 	xfs_perag_put(pag);
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/28] xfs: Throttle commits on delayed background CIL push
  2019-10-31 23:45 ` [PATCH 02/28] xfs: Throttle commits on delayed background CIL push Dave Chinner
@ 2019-11-01 12:04   ` Brian Foster
  2019-11-01 21:40     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-01 12:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:52AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In certain situations the background CIL push can be indefinitely
> delayed. While we have workarounds from the obvious cases now, it
> doesn't solve the underlying issue. This issue is that there is no
> upper limit on the CIL where we will either force or wait for
> a background push to start, hence allowing the CIL to grow without
> bound until it consumes all log space.
> 
> To fix this, add a new wait queue to the CIL which allows background
> pushes to wait for the CIL context to be switched out. This happens
> when the push starts, so it will allow us to block incoming
> transaction commit completion until the push has started. This will
> only affect processes that are running modifications, and only when
> the CIL threshold has been significantly overrun.
> 
> This has no apparent impact on performance, and doesn't even trigger
> until over 45 million inodes had been created in a 16-way fsmark
> test on a 2GB log. That was limiting at 64MB of log space used, so
> the active CIL size is only about 3% of the total log in that case.
> The concurrent removal of those files did not trigger the background
> sleep at all.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---

I don't recall posting an R-b tag for this one...

That said, I think my only outstanding feedback (side discussion aside)
was the code factoring in xlog_cil_push_background().

Brian

>  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h    |  1 +
>  3 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index a1204424a938..fc3f8e849dec 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -670,6 +670,11 @@ xlog_cil_push(
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
>  
> +	/*
> +	 * Wake up any background push waiters now this context is being pushed.
> +	 */
> +	wake_up_all(&ctx->push_wait);
> +
>  	/*
>  	 * Check if we've anything to push. If there is nothing, then we don't
>  	 * move on to a new sequence number and so we have to be able to push
> @@ -746,6 +751,7 @@ xlog_cil_push(
>  	 */
>  	INIT_LIST_HEAD(&new_ctx->committing);
>  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> +	init_waitqueue_head(&new_ctx->push_wait);
>  	new_ctx->sequence = ctx->sequence + 1;
>  	new_ctx->cil = cil;
>  	cil->xc_ctx = new_ctx;
> @@ -900,7 +906,7 @@ xlog_cil_push_work(
>   */
>  static void
>  xlog_cil_push_background(
> -	struct xlog	*log)
> +	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
>  
> @@ -914,14 +920,36 @@ xlog_cil_push_background(
>  	 * don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +		up_read(&cil->xc_ctx_lock);
>  		return;
> +	}
>  
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
>  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	}
> +
> +	/*
> +	 * Drop the context lock now, we can't hold that if we need to sleep
> +	 * because we are over the blocking threshold. The push_lock is still
> +	 * held, so blocking threshold sleep/wakeup is still correctly
> +	 * serialised here.
> +	 */
> +	up_read(&cil->xc_ctx_lock);
> +
> +	/*
> +	 * If we are well over the space limit, throttle the work that is being
> +	 * done until the push work on this context has begun.
> +	 */
> +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> +		return;
> +	}
> +
>  	spin_unlock(&cil->xc_push_lock);
>  
>  }
> @@ -1038,9 +1066,9 @@ xfs_log_commit_cil(
>  		if (lip->li_ops->iop_committing)
>  			lip->li_ops->iop_committing(lip, xc_commit_lsn);
>  	}
> -	xlog_cil_push_background(log);
>  
> -	up_read(&cil->xc_ctx_lock);
> +	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
> +	xlog_cil_push_background(log);
>  }
>  
>  /*
> @@ -1199,6 +1227,7 @@ xlog_cil_init(
>  
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> +	init_waitqueue_head(&ctx->push_wait);
>  	ctx->sequence = 1;
>  	ctx->cil = cil;
>  	cil->xc_ctx = ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index abd382cfffe3..1cc5333a3f6a 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -242,6 +242,7 @@ struct xfs_cil_ctx {
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
> +	wait_queue_head_t	push_wait;	/* background push throttle */
>  	struct work_struct	discard_endio_work;
>  };
>  
> @@ -339,10 +340,33 @@ struct xfs_cil {
>   *   buffer window (32MB) as measurements have shown this to be roughly the
>   *   point of diminishing performance increases under highly concurrent
>   *   modification workloads.
> + *
> + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> + * new threshold at which we block committing transactions until the background
> + * CIL commit commences and switches to a new context. While this is not a hard
> + * limit, it forces the process committing a transaction to the CIL to block and
> + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> + * work. This prevents a process running lots of transactions from overfilling
> + * the CIL because it is not yielding the CPU. We set the blocking limit at
> + * twice the background push space threshold so we keep in line with the AIL
> + * push thresholds.
> + *
> + * Note: this is not a -hard- limit as blocking is applied after the transaction
> + * is inserted into the CIL and the push has been triggered. It is largely a
> + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> + * limit will be difficult to implement without introducing global serialisation
> + * in the CIL commit fast path, and it's not at all clear that we actually need
> + * such hard limits given the ~7 years we've run without a hard limit before
> + * finding the first situation where a checkpoint size overflow actually
> + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> + * we've overrun the max size.
>   */
>  #define XLOG_CIL_SPACE_LIMIT(log)	\
>  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>  
> +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> +
>  /*
>   * ticket grant locks, queues and accounting have their own cachlines
>   * as these are quite hot and can be operated on concurrently.
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index c13bb3655e48..d3635d1e3de6 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub);
> +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
>  
>  DECLARE_EVENT_CLASS(xfs_log_item_class,
>  	TP_PROTO(struct xfs_log_item *lip),
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability
  2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
@ 2019-11-01 12:05   ` Brian Foster
  2019-11-04 23:21   ` Darrick J. Wong
  1 sibling, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-01 12:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I still don't see why we wouldn't set the spread flag on the bli cache
as well, but afaict it doesn't matter in most cases unless the spread
knob is enabled. Unless I'm misunderstanding how that works, I think the
commit log could be improved to describe that since to me it implies the
flag by itself has an effect, but otherwise the change seems fine:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1e63dd3d1257..d34e5d2edacd 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,6 +324,9 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;
>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> @@ -2061,7 +2064,8 @@ int __init
>  xfs_buf_init(void)
>  {
>  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> -						KM_ZONE_HWALIGN, NULL);
> +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> +			NULL);
>  	if (!xfs_buf_zone)
>  		goto out;
>  
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster
  2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
@ 2019-11-01 12:05   ` Brian Foster
  2019-11-04 23:20   ` Darrick J. Wong
  1 sibling, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-01 12:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:58AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There's lots of indent in this code which makes it a bit hard to
> follow. We are also going to completely rework the inode lookup code
> as part of the inode reclaim rework, so factor out the inode lookup
> code from the inode cluster freeing code.
> 
> Based on prototype code from Christoph Hellwig.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_inode.c | 152 +++++++++++++++++++++++++--------------------
>  1 file changed, 84 insertions(+), 68 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e9e4f444f8ce..33edb18098ca 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2516,6 +2516,88 @@ xfs_iunlink_remove(
>  	return error;
>  }
>  
> +/*
> + * Look up the inode number specified and mark it stale if it is found. If it is
> + * dirty, return the inode so it can be attached to the cluster buffer so it can
> + * be processed appropriately when the cluster free transaction completes.
> + */
> +static struct xfs_inode *
> +xfs_ifree_get_one_inode(
> +	struct xfs_perag	*pag,
> +	struct xfs_inode	*free_ip,
> +	int			inum)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
> +
> +retry:
> +	rcu_read_lock();
> +	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
> +
> +	/* Inode not in memory, nothing to do */
> +	if (!ip)
> +		goto out_rcu_unlock;
> +
> +	/*
> +	 * because this is an RCU protected lookup, we could find a recently
> +	 * freed or even reallocated inode during the lookup. We need to check
> +	 * under the i_flags_lock for a valid inode here. Skip it if it is not
> +	 * valid, the wrong inode or stale.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
> +		spin_unlock(&ip->i_flags_lock);
> +		goto out_rcu_unlock;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +
> +	/*
> +	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> +	 * other inodes that we did not find in the list attached to the buffer
> +	 * and are not already marked stale. If we can't lock it, back off and
> +	 * retry.
> +	 */
> +	if (ip != free_ip) {
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> +			rcu_read_unlock();
> +			delay(1);
> +			goto retry;
> +		}
> +
> +		/*
> +		 * Check the inode number again in case we're racing with
> +		 * freeing in xfs_reclaim_inode().  See the comments in that
> +		 * function for more information as to why the initial check is
> +		 * not sufficient.
> +		 */
> +		if (ip->i_ino != inum) {
> +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +			goto out_rcu_unlock;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	xfs_iflock(ip);
> +	xfs_iflags_set(ip, XFS_ISTALE);
> +
> +	/*
> +	 * We don't need to attach clean inodes or those only with unlogged
> +	 * changes (which we throw away, anyway).
> +	 */
> +	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> +		ASSERT(ip != free_ip);
> +		xfs_ifunlock(ip);
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		goto out_no_inode;
> +	}
> +	return ip;
> +
> +out_rcu_unlock:
> +	rcu_read_unlock();
> +out_no_inode:
> +	return NULL;
> +}
> +
>  /*
>   * A big issue when freeing the inode cluster is that we _cannot_ skip any
>   * inodes that are in memory - they all must be marked stale and attached to
> @@ -2616,77 +2698,11 @@ xfs_ifree_cluster(
>  		 * even trying to lock them.
>  		 */
>  		for (i = 0; i < igeo->inodes_per_cluster; i++) {
> -retry:
> -			rcu_read_lock();
> -			ip = radix_tree_lookup(&pag->pag_ici_root,
> -					XFS_INO_TO_AGINO(mp, (inum + i)));
> -
> -			/* Inode not in memory, nothing to do */
> -			if (!ip) {
> -				rcu_read_unlock();
> +			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
> +			if (!ip)
>  				continue;
> -			}
> -
> -			/*
> -			 * because this is an RCU protected lookup, we could
> -			 * find a recently freed or even reallocated inode
> -			 * during the lookup. We need to check under the
> -			 * i_flags_lock for a valid inode here. Skip it if it
> -			 * is not valid, the wrong inode or stale.
> -			 */
> -			spin_lock(&ip->i_flags_lock);
> -			if (ip->i_ino != inum + i ||
> -			    __xfs_iflags_test(ip, XFS_ISTALE)) {
> -				spin_unlock(&ip->i_flags_lock);
> -				rcu_read_unlock();
> -				continue;
> -			}
> -			spin_unlock(&ip->i_flags_lock);
> -
> -			/*
> -			 * Don't try to lock/unlock the current inode, but we
> -			 * _cannot_ skip the other inodes that we did not find
> -			 * in the list attached to the buffer and are not
> -			 * already marked stale. If we can't lock it, back off
> -			 * and retry.
> -			 */
> -			if (ip != free_ip) {
> -				if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> -					rcu_read_unlock();
> -					delay(1);
> -					goto retry;
> -				}
> -
> -				/*
> -				 * Check the inode number again in case we're
> -				 * racing with freeing in xfs_reclaim_inode().
> -				 * See the comments in that function for more
> -				 * information as to why the initial check is
> -				 * not sufficient.
> -				 */
> -				if (ip->i_ino != inum + i) {
> -					xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -					rcu_read_unlock();
> -					continue;
> -				}
> -			}
> -			rcu_read_unlock();
>  
> -			xfs_iflock(ip);
> -			xfs_iflags_set(ip, XFS_ISTALE);
> -
> -			/*
> -			 * we don't need to attach clean inodes or those only
> -			 * with unlogged changes (which we throw away, anyway).
> -			 */
>  			iip = ip->i_itemp;
> -			if (!iip || xfs_inode_clean(ip)) {
> -				ASSERT(ip != free_ip);
> -				xfs_ifunlock(ip);
> -				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -				continue;
> -			}
> -
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/28] xfs: Throttle commits on delayed background CIL push
  2019-11-01 12:04   ` Brian Foster
@ 2019-11-01 21:40     ` Dave Chinner
  2019-11-04 22:48       ` Darrick J. Wong
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-11-01 21:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 08:04:26AM -0400, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:45:52AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > In certain situations the background CIL push can be indefinitely
> > delayed. While we have workarounds from the obvious cases now, it
> > doesn't solve the underlying issue. This issue is that there is no
> > upper limit on the CIL where we will either force or wait for
> > a background push to start, hence allowing the CIL to grow without
> > bound until it consumes all log space.
> > 
> > To fix this, add a new wait queue to the CIL which allows background
> > pushes to wait for the CIL context to be switched out. This happens
> > when the push starts, so it will allow us to block incoming
> > transaction commit completion until the push has started. This will
> > only affect processes that are running modifications, and only when
> > the CIL threshold has been significantly overrun.
> > 
> > This has no apparent impact on performance, and doesn't even trigger
> > until over 45 million inodes had been created in a 16-way fsmark
> > test on a 2GB log. That was limiting at 64MB of log space used, so
> > the active CIL size is only about 3% of the total log in that case.
> > The concurrent removal of those files did not trigger the background
> > sleep at all.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> 
> I don't recall posting an R-b tag for this one...

Argh, sorry. I must have screwed up transcribing them from the
mailing list.

> That said, I think my only outstanding feedback (side discussion aside)
> was the code factoring in xlog_cil_push_background().

I'll go back and look at that, 'cause clearly I was looking at the
wrong patch when I screwed up the rvb tag...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 11/28] mm: factor shrinker work calculations
  2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
@ 2019-11-02 10:55   ` kbuild test robot
  2019-11-04 15:29   ` Brian Foster
  1 sibling, 0 replies; 72+ messages in thread
From: kbuild test robot @ 2019-11-02 10:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kbuild-all, linux-xfs, linux-fsdevel, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4156 bytes --]

Hi Dave,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.4-rc5 next-20191031]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Dave-Chinner/xfs-Lower-CIL-flush-limit-for-large-logs/20191102-153137
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: parisc-c3000_defconfig (attached as .config)
compiler: hppa-linux-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=parisc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from ./arch/parisc/include/generated/asm/div64.h:1:0,
                    from include/linux/kernel.h:18,
                    from arch/parisc/include/asm/bug.h:5,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/vmscan.c:17:
   mm/vmscan.c: In function 'shrink_scan_count':
   include/asm-generic/div64.h:226:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> mm/vmscan.c:502:3: note: in expansion of macro 'do_div'
      do_div(delta, shrinker->seeks);
      ^~~~~~

vim +/do_div +502 mm/vmscan.c

   460	
   461	/*
   462	 * Calculate the number of new objects to scan this time around. Return
   463	 * the work to be done. If there are freeable objects, return that number in
   464	 * @freeable_objects.
   465	 */
   466	static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
   467				    struct shrinker *shrinker, int priority,
   468				    int64_t *freeable_objects)
   469	{
   470		int64_t delta;
   471		int64_t freeable;
   472	
   473		freeable = shrinker->count_objects(shrinker, shrinkctl);
   474		if (freeable == 0 || freeable == SHRINK_EMPTY)
   475			return freeable;
   476	
   477		if (shrinker->seeks) {
   478			/*
   479			 * shrinker->seeks is a measure of how much IO is required to
   480			 * reinstantiate the object in memory. The default value is 2
   481			 * which is typical for a cold inode requiring a directory read
   482			 * and an inode read to re-instantiate.
   483			 *
   484			 * The scan batch size is defined by the shrinker priority, but
   485			 * to be able to bias the reclaim we increase the default batch
   486			 * size by 4. Hence we end up with a scan batch multipler that
   487			 * scales like so:
   488			 *
   489			 * ->seeks	scan batch multiplier
   490			 *    1		      4.00x
   491			 *    2               2.00x
   492			 *    3               1.33x
   493			 *    4               1.00x
   494			 *    8               0.50x
   495			 *
   496			 * IOWs, the more seeks it takes to pull the item into cache,
   497			 * the smaller the reclaim scan batch. Hence we put more reclaim
   498			 * pressure on caches that are fast to repopulate and to keep a
   499			 * rough balance between caches that have different costs.
   500			 */
   501			delta = freeable >> (priority - 2);
 > 502			do_div(delta, shrinker->seeks);
   503		} else {
   504			/*
   505			 * These objects don't require any IO to create. Trim them
   506			 * aggressively under memory pressure to keep them from causing
   507			 * refetches in the IO caches.
   508			 */
   509			delta = freeable / 2;
   510		}
   511	
   512		*freeable_objects = freeable;
   513		return delta > 0 ? delta : 0;
   514	}
   515	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 15685 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/28] mm: directed shrinker work deferral
  2019-10-31 23:45 ` [PATCH 09/28] mm: directed shrinker work deferral Dave Chinner
@ 2019-11-04 15:25   ` Brian Foster
  2019-11-14 20:49     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-04 15:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce a mechanism for ->count_objects() to indicate to the
> shrinker infrastructure that the reclaim context will not allow
> scanning work to be done and so the work it decides is necessary
> needs to be deferred.
> 
> This simplifies the code by separating out the accounting of
> deferred work from the actual doing of the work, and allows better
> decisions to be made by the shrinekr control logic on what action it
> can take.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

My understanding from the previous discussion(s) is that this is not
tied directly to the gfp mask because that is not the only intended use.
While it is currently a boolean tied to the the entire shrinker call,
the longer term objective is per-object granularity.

I find the argument reasonable enough, but if the above is true, why do
we move these checks from ->scan_objects() to ->count_objects() (in the
next patch) when per-object decisions will ultimately need to be made by
the former? That seems like unnecessary churn and inconsistent with the
argument against just temporarily doing something like what Christoph
suggested in the previous version, particularly since IIRC the only use
in this series was for gfp mask purposes.

>  include/linux/shrinker.h | 7 +++++++
>  mm/vmscan.c              | 8 ++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 0f80123650e2..3405c39ab92c 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -31,6 +31,13 @@ struct shrink_control {
>  
>  	/* current memcg being shrunk (for memcg aware shrinkers) */
>  	struct mem_cgroup *memcg;
> +
> +	/*
> +	 * set by ->count_objects if reclaim context prevents reclaim from
> +	 * occurring. This allows the shrinker to immediately defer all the
> +	 * work and not even attempt to scan the cache.
> +	 */
> +	bool defer_work;
>  };
>  
>  #define SHRINK_STOP (~0UL)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ee4eecc7e1c2..a215d71d9d4b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>  				   freeable, delta, total_scan, priority);
>  
> +	/*
> +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> +	 * defer the work to a context that can scan the cache.
> +	 */
> +	if (shrinkctl->defer_work)
> +		goto done;
> +

I still find the fact that this per-shrinker invocation field is never
reset unnecessarily fragile, and I don't see any good reason not to
reset it prior to the shrinker callback that potentially sets it.

Brian

>  	/*
>  	 * Normally, we should not scan less than batch_size objects in one
>  	 * pass to avoid too frequent shrinker calls, but if the slab has less
> @@ -570,6 +577,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		cond_resched();
>  	}
>  
> +done:
>  	if (next_deferred >= scanned)
>  		next_deferred -= scanned;
>  	else
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 11/28] mm: factor shrinker work calculations
  2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
  2019-11-02 10:55   ` kbuild test robot
@ 2019-11-04 15:29   ` Brian Foster
  2019-11-14 20:59     ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-04 15:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:01AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Start to clean up the shrinker code by factoring out the calculation
> that determines how much work to do. This separates the calculation
> from clamping and other adjustments that are done before the
> shrinker work is run. Document the scan batch size calculation
> better while we are there.
> 
> Also convert the calculation for the amount of work to be done to
> use 64 bit logic so we don't have to keep jumping through hoops to
> keep calculations within 32 bits on 32 bit systems.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I assume the kbuild warning thing will be fixed up...

>  mm/vmscan.c | 97 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 70 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a215d71d9d4b..2d39ec37c04d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -459,13 +459,68 @@ EXPORT_SYMBOL(unregister_shrinker);
>  
>  #define SHRINK_BATCH 128
>  
> +/*
> + * Calculate the number of new objects to scan this time around. Return
> + * the work to be done. If there are freeable objects, return that number in
> + * @freeable_objects.
> + */
> +static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
> +			    struct shrinker *shrinker, int priority,
> +			    int64_t *freeable_objects)
> +{
> +	int64_t delta;
> +	int64_t freeable;
> +
> +	freeable = shrinker->count_objects(shrinker, shrinkctl);
> +	if (freeable == 0 || freeable == SHRINK_EMPTY)
> +		return freeable;
> +
> +	if (shrinker->seeks) {
> +		/*
> +		 * shrinker->seeks is a measure of how much IO is required to
> +		 * reinstantiate the object in memory. The default value is 2
> +		 * which is typical for a cold inode requiring a directory read
> +		 * and an inode read to re-instantiate.
> +		 *
> +		 * The scan batch size is defined by the shrinker priority, but
> +		 * to be able to bias the reclaim we increase the default batch
> +		 * size by 4. Hence we end up with a scan batch multipler that
> +		 * scales like so:
> +		 *
> +		 * ->seeks	scan batch multiplier
> +		 *    1		      4.00x
> +		 *    2               2.00x
> +		 *    3               1.33x
> +		 *    4               1.00x
> +		 *    8               0.50x
> +		 *
> +		 * IOWs, the more seeks it takes to pull the item into cache,
> +		 * the smaller the reclaim scan batch. Hence we put more reclaim
> +		 * pressure on caches that are fast to repopulate and to keep a
> +		 * rough balance between caches that have different costs.
> +		 */
> +		delta = freeable >> (priority - 2);

Does anything prevent priority < 2 here?

> +		do_div(delta, shrinker->seeks);
> +	} else {
> +		/*
> +		 * These objects don't require any IO to create. Trim them
> +		 * aggressively under memory pressure to keep them from causing
> +		 * refetches in the IO caches.
> +		 */
> +		delta = freeable / 2;
> +	}
> +
> +	*freeable_objects = freeable;
> +	return delta > 0 ? delta : 0;
> +}
> +
>  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
>  	unsigned long freed = 0;
> -	unsigned long long delta;
>  	long total_scan;
> -	long freeable;
> +	int64_t freeable_objects = 0;
> +	int64_t scan_count;
>  	long nr;
>  	long new_nr;
>  	int nid = shrinkctl->nid;
...
> @@ -487,25 +543,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 */
>  	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>  
> -	total_scan = nr;
> -	if (shrinker->seeks) {
> -		delta = freeable >> priority;
> -		delta *= 4;
> -		do_div(delta, shrinker->seeks);
> -	} else {
> -		/*
> -		 * These objects don't require any IO to create. Trim
> -		 * them aggressively under memory pressure to keep
> -		 * them from causing refetches in the IO caches.
> -		 */
> -		delta = freeable / 2;
> -	}
> -
> -	total_scan += delta;
> +	total_scan = nr + scan_count;
>  	if (total_scan < 0) {
>  		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
>  		       shrinker->scan_objects, total_scan);
> -		total_scan = freeable;
> +		total_scan = scan_count;

Same question as before: why the change in assignment? freeable was the
->count_objects() return value, which is now stored in freeable_objects.

FWIW, the change seems to make sense in that it just factors out the
deferred count, but it's not clear if it's intentional...

Brian

>  		next_deferred = nr;
>  	} else
>  		next_deferred = total_scan;
> @@ -522,19 +564,20 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * Hence only allow the shrinker to scan the entire cache when
>  	 * a large delta change is calculated directly.
>  	 */
> -	if (delta < freeable / 4)
> -		total_scan = min(total_scan, freeable / 2);
> +	if (scan_count < freeable_objects / 4)
> +		total_scan = min_t(long, total_scan, freeable_objects / 2);
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */
> -	if (total_scan > freeable * 2)
> -		total_scan = freeable * 2;
> +	if (total_scan > freeable_objects * 2)
> +		total_scan = freeable_objects * 2;
>  
>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> -				   freeable, delta, total_scan, priority);
> +				   freeable_objects, scan_count,
> +				   total_scan, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -559,7 +602,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * possible.
>  	 */
>  	while (total_scan >= batch_size ||
> -	       total_scan >= freeable) {
> +	       total_scan >= freeable_objects) {
>  		unsigned long ret;
>  		unsigned long nr_to_scan = min(batch_size, total_scan);
>  
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 12/28] shrinker: defer work only to kswapd
  2019-10-31 23:46 ` [PATCH 12/28] shrinker: defer work only to kswapd Dave Chinner
@ 2019-11-04 15:29   ` Brian Foster
  2019-11-14 21:11     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-04 15:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:02AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now deferred work is picked up by whatever GFP_KERNEL context
> reclaimer that wins the race to empty the node's deferred work
> counter. However, if there are lots of direct reclaimers, that
> work might be continually picked up by contexts taht can't do any
> work and so the opportunities to do the work are missed by contexts
> that could do them.
> 
> A further problem with the current code is that the deferred work
> can be picked up by a random direct reclaimer, resulting in that
> specific process having to do all the deferred reclaim work and
> hence can take extremely long latencies if the reclaim work blocks
> regularly. This is not good for direct reclaim fairness or for
> minimising long tail latency events.
> 
> To avoid these problems, simply limit deferred work to kswapd
> contexts. We know kswapd is a context that can always do reclaim
> work, and hence deferring work to kswapd allows the deferred work to
> be done in the background and not adversely affect any specific
> process context doing direct reclaim.
> 
> The advantage of this is that amount of work to be done in direct
> reclaim is now bound and predictable - it is entirely based on
> the cache's freeable objects and the reclaim priority. hence all
> direct reclaimers running at the same time should be doing
> relatively equal amounts of work, thereby reducing the incidence of
> long tail latencies due to uneven reclaim workloads.
> 
> Note that we use signed integers for everything except the freed
> count as the returns from the shrinker callouts cannot be guaranteed
> untainted. Indeed, the shrinkers can return scan counts larger that
> were fed in, so we need scan counts to underflow in a detectable
> manner to terminate loops. This is necessary to avoid a misbehaving
> shrinker from triggering endless scanning loops.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/shrinker.h |   2 +-
>  mm/vmscan.c              | 100 ++++++++++++++++++++-------------------
>  2 files changed, 53 insertions(+), 49 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 3405c39ab92c..30c10f42109f 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -81,7 +81,7 @@ struct shrinker {
>  	int id;
>  #endif
>  	/* objs pending delete, per node */
> -	atomic_long_t *nr_deferred;
> +	atomic64_t *nr_deferred;
>  };
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d39ec37c04d..c0e2bf656e3f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -517,16 +517,16 @@ static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
>  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
> -	unsigned long freed = 0;
> -	long total_scan;
> +	uint64_t freed = 0;
>  	int64_t freeable_objects = 0;
>  	int64_t scan_count;
> -	long nr;
> -	long new_nr;
> +	int64_t scanned_objects = 0;
> +	int64_t next_deferred = 0;
> +	int64_t deferred_count = 0;
> +	int64_t new_nr;
>  	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
> -	long scanned = 0, next_deferred;
>  
>  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>  		nid = 0;
> @@ -537,47 +537,51 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		return scan_count;
>  
>  	/*
> -	 * copy the current shrinker scan count into a local variable
> -	 * and zero it so that other concurrent shrinker invocations
> -	 * don't also do this scanning work.
> -	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> -
> -	total_scan = nr + scan_count;
> -	if (total_scan < 0) {
> -		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> -		       shrinker->scan_objects, total_scan);
> -		total_scan = scan_count;
> -		next_deferred = nr;
> -	} else
> -		next_deferred = total_scan;
> -
> -	/*
> -	 * We need to avoid excessive windup on filesystem shrinkers
> -	 * due to large numbers of GFP_NOFS allocations causing the
> -	 * shrinkers to return -1 all the time. This results in a large
> -	 * nr being built up so when a shrink that can do some work
> -	 * comes along it empties the entire cache due to nr >>>
> -	 * freeable. This is bad for sustaining a working set in
> -	 * memory.
> +	 * If kswapd, we take all the deferred work and do it here. We don't let
> +	 * direct reclaim do this, because then it means some poor sod is going
> +	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
> +	 * amount of reclaim work from concurrent kswapd operations. Hence we do
> +	 * the work in the wrong place, at the wrong time, and it's largely
> +	 * unpredictable.
>  	 *
> -	 * Hence only allow the shrinker to scan the entire cache when
> -	 * a large delta change is calculated directly.
> +	 * By doing the deferred work only in kswapd, we can schedule the work
> +	 * according the the reclaim priority - low priority reclaim will do
> +	 * less deferred work, hence we'll do more of the deferred work the more
> +	 * desperate we become for free memory. This avoids the need for needing
> +	 * to specifically avoid deferred work windup as low amount os memory
> +	 * pressure won't excessive trim caches anymore.

That last sentence is hard to read. ;)

>  	 */
> -	if (scan_count < freeable_objects / 4)
> -		total_scan = min_t(long, total_scan, freeable_objects / 2);
> +	if (current_is_kswapd()) {
> +		int64_t	deferred_scan;
> +
> +		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
> +
> +		/* we want to scan 5-10% of the deferred work here at minimum */
> +		deferred_scan = deferred_count;
> +		if (priority)
> +			do_div(deferred_scan, priority);
> +		scan_count += deferred_scan;
> +
> +		/*
> +		 * If there is more deferred work than the number of freeable
> +		 * items in the cache, limit the amount of work we will carry
> +		 * over to the next kswapd run on this cache. This prevents
> +		 * deferred work windup.
> +		 */
> +		deferred_count = min(deferred_count, freeable_objects * 2);
> +

Extra whitespace above.

> +	}
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */

The comment refers to a variable that no longer exists.

I also wonder if it's a little cleaner to move the deferred_count =
min(...); statement above down here and condense the two comments.

> -	if (total_scan > freeable_objects * 2)
> -		total_scan = freeable_objects * 2;
> +	scan_count = min(scan_count, freeable_objects * 2);
>  
> -	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> +	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
>  				   freeable_objects, scan_count,
> -				   total_scan, priority);
> +				   scan_count, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * scanning at high prio and therefore should try to reclaim as much as
>  	 * possible.
>  	 */
> -	while (total_scan >= batch_size ||
> -	       total_scan >= freeable_objects) {
> +	while (scan_count >= batch_size ||
> +	       scan_count >= freeable_objects) {
>  		unsigned long ret;
> -		unsigned long nr_to_scan = min(batch_size, total_scan);
> +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
>  
>  		shrinkctl->nr_to_scan = nr_to_scan;
>  		shrinkctl->nr_scanned = nr_to_scan;
> @@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		freed += ret;
>  
>  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> -		total_scan -= shrinkctl->nr_scanned;
> -		scanned += shrinkctl->nr_scanned;
> +		scan_count -= shrinkctl->nr_scanned;
> +		scanned_objects += shrinkctl->nr_scanned;
>  
>  		cond_resched();
>  	}
> -
>  done:
> -	if (next_deferred >= scanned)
> -		next_deferred -= scanned;
> +	if (deferred_count)
> +		next_deferred = deferred_count - scanned_objects;
>  	else
> -		next_deferred = 0;
> +		next_deferred = scan_count;

Hmm.. so if there was no deferred count on this cycle, we set
next_deferred to whatever is left from scan_count and add that back into
the shrinker struct below. If there was a pending deferred count on this
cycle, we subtract what we scanned from that and add that value back.
But what happens to the remaining scan_count in the latter case? Is it
lost, or am I missing something?

For example, suppose we start this cycle with a large scan_count and
->scan_objects() returned SHRINK_STOP before doing much work. In that
scenario, it looks like whether ->nr_deferred is 0 or not is the only
thing that determines whether we defer the entire remaining scan_count
or just what is left from the previous ->nr_deferred. The existing code
appears to consistently factor in what is left from the current scan
with the previous deferred count. Hm?

>  	/*
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates. If we exhausted the
>  	 * scan, there is no need to do an update.
>  	 */
>  	if (next_deferred > 0)
> -		new_nr = atomic_long_add_return(next_deferred,
> +		new_nr = atomic64_add_return(next_deferred,
>  						&shrinker->nr_deferred[nid]);
>  	else
> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> +		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);

It looks like we could kill new_nr and just reuse next_deferred here
too.

Brian

>  
> -	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> +	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
> +					scan_count);
>  	return freed;
>  }
>  
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 13/28] shrinker: clean up variable types and tracepoints
  2019-10-31 23:46 ` [PATCH 13/28] shrinker: clean up variable types and tracepoints Dave Chinner
@ 2019-11-04 15:30   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-04 15:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:03AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The tracepoint information in the shrinker code don't make a lot of

Nit:						  doesn't

> sense anymore and contain redundant information as a result of the
> changes in the patchset. Refine the information passed to the
> tracepoints so they expose the operation of the shrinkers more
> precisely and clean up the remaining code and varibles in the

Nit:						variables

> shrinker code so it all makes sense.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/trace/events/vmscan.h | 69 ++++++++++++++++-------------------
>  mm/vmscan.c                   | 24 +++++-------
>  2 files changed, 41 insertions(+), 52 deletions(-)
> 
...
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c0e2bf656e3f..7a8256322150 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
...
> @@ -624,23 +622,21 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		cond_resched();
>  	}
>  done:
...
>  	if (next_deferred > 0)
> -		new_nr = atomic64_add_return(next_deferred,
> -						&shrinker->nr_deferred[nid]);
> -	else
> -		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);
> +		atomic64_add(next_deferred, &shrinker->nr_deferred[nid]);
>  
> -	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
> -					scan_count);
> +	trace_mm_shrink_slab_end(shrinker, nid, freed, scanned_objects,
> +				 next_deferred);

I guess this invalidates my comment on the previous patch around new_nr.
Looks Ok to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  	return freed;
>  }
>  
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs
  2019-10-31 23:46 ` [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
@ 2019-11-04 19:58   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-04 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:04AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add a wrapper to account for page freeing in shrinker reclaim so
> that the high level scanning accounts for all the memory freed
> during a shrinker scan.
> 
> No logic changes, just replacing open coded checks with a simple
> wrapper.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks straightforward:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/inode.c           |  3 +--
>  fs/xfs/xfs_buf.c     |  4 +---
>  include/linux/swap.h | 20 ++++++++++++++++++--
>  mm/slab.c            |  3 +--
>  mm/slob.c            |  4 +---
>  mm/slub.c            |  3 +--
>  mm/vmscan.c          |  4 ++--
>  7 files changed, 25 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a42882..a77caf216659 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -764,8 +764,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  				__count_vm_events(KSWAPD_INODESTEAL, reap);
>  			else
>  				__count_vm_events(PGINODESTEAL, reap);
> -			if (current->reclaim_state)
> -				current->reclaim_state->reclaimed_slab += reap;
> +			current_reclaim_account_pages(reap);
>  		}
>  		iput(inode);
>  		spin_lock(lru_lock);
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index d34e5d2edacd..55b082bc53b3 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,9 +324,7 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> -		if (current->reclaim_state)
> -			current->reclaim_state->reclaimed_slab +=
> -							bp->b_page_count;
> +		current_reclaim_account_pages(bp->b_page_count);
>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 063c0c1e112b..72b855fe20b0 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -126,12 +126,28 @@ union swap_header {
>  
>  /*
>   * current->reclaim_state points to one of these when a task is running
> - * memory reclaim
> + * memory reclaim. It is typically used by shrinkers to return reclaim
> + * information back to the main vmscan loop.
>   */
>  struct reclaim_state {
> -	unsigned long reclaimed_slab;
> +	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
>  };
>  
> +/*
> + * When code frees a page that may be run from a memory reclaim context, it
> + * needs to account for the pages it frees so memory reclaim can track them.
> + * Slab memory that is freed is accounted via this mechanism, so this is not
> + * necessary for slab or heap memory being freed. However, if the object being
> + * freed frees pages directly, then those pages should be accounted as well when
> + * in memory reclaim. This helper function takes care accounting for the pages
> + * being reclaimed when it is required.
> + */
> +static inline void current_reclaim_account_pages(int nr_pages)
> +{
> +	if (current->reclaim_state)
> +		current->reclaim_state->reclaimed_pages += nr_pages;
> +}
> +
>  #ifdef __KERNEL__
>  
>  struct address_space;
> diff --git a/mm/slab.c b/mm/slab.c
> index 66e5d8032bae..419be005f41a 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1395,8 +1395,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
>  	page_mapcount_reset(page);
>  	page->mapping = NULL;
>  
> -	if (current->reclaim_state)
> -		current->reclaim_state->reclaimed_slab += 1 << order;
> +	current_reclaim_account_pages(1 << order);
>  	uncharge_slab_page(page, order, cachep);
>  	__free_pages(page, order);
>  }
> diff --git a/mm/slob.c b/mm/slob.c
> index fa53e9f73893..c54a7eeee86d 100644
> --- a/mm/slob.c
> +++ b/mm/slob.c
> @@ -211,9 +211,7 @@ static void slob_free_pages(void *b, int order)
>  {
>  	struct page *sp = virt_to_page(b);
>  
> -	if (current->reclaim_state)
> -		current->reclaim_state->reclaimed_slab += 1 << order;
> -
> +	current_reclaim_account_pages(1 << order);
>  	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
>  			    -(1 << order));
>  	__free_pages(sp, order);
> diff --git a/mm/slub.c b/mm/slub.c
> index b25c807a111f..478554082079 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1746,8 +1746,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
>  	__ClearPageSlab(page);
>  
>  	page->mapping = NULL;
> -	if (current->reclaim_state)
> -		current->reclaim_state->reclaimed_slab += pages;
> +	current_reclaim_account_pages(pages);
>  	uncharge_slab_page(page, order, s);
>  	__free_pages(page, order);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7a8256322150..967e3d3c7748 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2870,8 +2870,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
>  
>  		if (reclaim_state) {
> -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> +			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
> +			reclaim_state->reclaimed_pages = 0;
>  		}
>  
>  		/* Record the subtree's reclaim efficiency */
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral
  2019-10-31 23:46 ` [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
@ 2019-11-04 19:58   ` Brian Foster
  2019-11-14 21:28     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-04 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:05AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When the majority of possible shrinker reclaim work is deferred by
> the shrinkers (e.g. due to GFP_NOFS context), and there is more work
> defered than LRU pages were scanned, back off reclaim if there are

  deferred

> large amounts of IO in progress.
> 
> This tends to occur when there are inode cache heavy workloads that
> have little page cache or application memory pressure on filesytems
> like XFS. Inode cache heavy workloads involve lots of IO, so if we
> are getting device congestion it is indicative of memory reclaim
> running up against an IO throughput limitation. in this situation
> we need to throttle direct reclaim as we nee dto wait for kswapd to

					   need to

> get some of the deferred work done.
> 
> However, if there is no device congestion, then the system is
> keeping up with both the workload and memory reclaim and so there's
> no need to throttle.
> 
> Hence we should only back off scanning for a bit if we see this
> condition and there is block device congestion present.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/swap.h |  2 ++
>  mm/vmscan.c          | 30 +++++++++++++++++++++++++++++-
>  2 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 72b855fe20b0..da0913e14bb9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -131,6 +131,8 @@ union swap_header {
>   */
>  struct reclaim_state {
>  	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
> +	unsigned long	scanned_objects;	/* quantity of work done */ 

Trailing whitespace at the end of the above line.

> +	unsigned long	deferred_objects;	/* work that wasn't done */
>  };
>  
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 967e3d3c7748..13c11e10c9c5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -570,6 +570,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		deferred_count = min(deferred_count, freeable_objects * 2);
>  
>  	}
> +	if (current->reclaim_state)
> +		current->reclaim_state->scanned_objects += scanned_objects;

Looks like scanned_objects is always zero here.

>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
> @@ -585,8 +587,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
>  	 * defer the work to a context that can scan the cache.
>  	 */
> -	if (shrinkctl->defer_work)
> +	if (shrinkctl->defer_work) {
> +		if (current->reclaim_state)
> +			current->reclaim_state->deferred_objects += scan_count;
>  		goto done;
> +	}
>  
>  	/*
>  	 * Normally, we should not scan less than batch_size objects in one
> @@ -2871,7 +2876,30 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		if (reclaim_state) {
>  			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
> +
> +			/*
> +			 * If we are deferring more work than we are actually
> +			 * doing in the shrinkers, and we are scanning more
> +			 * objects than we are pages, the we have a large amount
> +			 * of slab caches we are deferring work to kswapd for.
> +			 * We better back off here for a while, otherwise
> +			 * we risk priority windup, swap storms and OOM kills
> +			 * once we empty the page lists but still can't make
> +			 * progress on the shrinker memory.
> +			 *
> +			 * kswapd won't ever defer work as it's run under a
> +			 * GFP_KERNEL context and can always do work.
> +			 */
> +			if ((reclaim_state->deferred_objects >
> +					sc->nr_scanned - nr_scanned) &&

Out of curiosity, what's the reasoning behind the direct comparison
between ->deferred_objects and pages? Shouldn't we generally expect more
slab objects to exist than pages by the nature of slab?

Also, the comment says "if we are scanning more objects than we are
pages," yet the code is checking whether we defer more objects than
scanned pages. Which is more accurate?

Brian

> +			    (reclaim_state->deferred_objects >
> +					reclaim_state->scanned_objects)) {
> +				wait_iff_congested(BLK_RW_ASYNC, HZ/50);
> +			}
> +
>  			reclaim_state->reclaimed_pages = 0;
> +			reclaim_state->deferred_objects = 0;
> +			reclaim_state->scanned_objects = 0;
>  		}
>  
>  		/* Record the subtree's reclaim efficiency */
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 16/28] mm: kswapd backoff for shrinkers
  2019-10-31 23:46 ` [PATCH 16/28] mm: kswapd backoff for shrinkers Dave Chinner
@ 2019-11-04 19:58   ` Brian Foster
  2019-11-14 21:41     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-04 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:06AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When kswapd reaches the end of the page LRU and starts hitting dirty
> pages, the logic in shrink_node() allows it to back off and wait for
> IO to complete, thereby preventing kswapd from scanning excessively
> and driving the system into swap thrashing and OOM conditions.
> 
> When we have inode cache heavy workloads on XFS, we have exactly the
> same problem with reclaim inodes. The non-blocking kswapd reclaim
> will keep putting pressure onto the inode cache which is unable to
> make progress. When the system gets to the point where there is no
> pages in the LRU to free, there is no swap left and there are no
> clean inodes that can be freed, it will OOM. This has a specific
> signature in OOM:
> 
> [  110.841987] Mem-Info:
> [  110.842816] active_anon:241 inactive_anon:82 isolated_anon:1
>                 active_file:168 inactive_file:143 isolated_file:0
>                 unevictable:2621523 dirty:1 writeback:8 unstable:0
>                 slab_reclaimable:564445 slab_unreclaimable:420046
>                 mapped:1042 shmem:11 pagetables:6509 bounce:0
>                 free:77626 free_pcp:2 free_cma:0
> 
> In this case, we have about 500-600 pages left in teh LRUs, but we
> have ~565000 reclaimable slab pages still available for reclaim.
> Unfortunately, they are mostly dirty inodes, and so we really need
> to be able to throttle kswapd when shrinker progress is limited due
> to reaching the dirty end of the LRU...
> 
> So, add a flag into the reclaim_state so if the shrinker decides it
> needs kswapd to back off and wait for a while (for whatever reason)
> it can do so.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/swap.h |  1 +
>  mm/vmscan.c          | 10 +++++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index da0913e14bb9..76fc28f0e483 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -133,6 +133,7 @@ struct reclaim_state {
>  	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
>  	unsigned long	scanned_objects;	/* quantity of work done */ 
>  	unsigned long	deferred_objects;	/* work that wasn't done */
> +	bool		need_backoff;		/* tell kswapd to slow down */
>  };
>  
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13c11e10c9c5..0f7d35820057 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2949,8 +2949,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  			 * implies that pages are cycling through the LRU
>  			 * faster than they are written so also forcibly stall.
>  			 */
> -			if (sc->nr.immediate)
> +			if (sc->nr.immediate) {
>  				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			} else if (reclaim_state && reclaim_state->need_backoff) {
> +				/*
> +				 * Ditto, but it's a slab cache that is cycling
> +				 * through the LRU faster than they are written
> +				 */
> +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +				reclaim_state->need_backoff = false;
> +			}

Seems reasonable from a functional standpoint, but why not plug in to
the existing stall instead of duplicate it? E.g., add a corresponding
->nr_immediate field to reclaim_state rather than a bool, then transfer
that to the scan_control earlier in the function where we already check
for reclaim_state and handle transferring fields (or alternatively just
leave the bool and use it to bump the scan_control field). That seems a
bit more consistent with the page processing code, keeps the
reclaim_state resets in one place and also wouldn't leave us with an
if/else here for the same stall. Hm?

Brian

>  		}
>  
>  		/*
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/28] xfs: Throttle commits on delayed background CIL push
  2019-11-01 21:40     ` Dave Chinner
@ 2019-11-04 22:48       ` Darrick J. Wong
  0 siblings, 0 replies; 72+ messages in thread
From: Darrick J. Wong @ 2019-11-04 22:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Sat, Nov 02, 2019 at 08:40:40AM +1100, Dave Chinner wrote:
> On Fri, Nov 01, 2019 at 08:04:26AM -0400, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:45:52AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > In certain situations the background CIL push can be indefinitely
> > > delayed. While we have workarounds from the obvious cases now, it
> > > doesn't solve the underlying issue. This issue is that there is no
> > > upper limit on the CIL where we will either force or wait for
> > > a background push to start, hence allowing the CIL to grow without
> > > bound until it consumes all log space.
> > > 
> > > To fix this, add a new wait queue to the CIL which allows background
> > > pushes to wait for the CIL context to be switched out. This happens
> > > when the push starts, so it will allow us to block incoming
> > > transaction commit completion until the push has started. This will
> > > only affect processes that are running modifications, and only when
> > > the CIL threshold has been significantly overrun.
> > > 
> > > This has no apparent impact on performance, and doesn't even trigger
> > > until over 45 million inodes had been created in a 16-way fsmark
> > > test on a 2GB log. That was limiting at 64MB of log space used, so
> > > the active CIL size is only about 3% of the total log in that case.
> > > The concurrent removal of those files did not trigger the background
> > > sleep at all.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > 
> > I don't recall posting an R-b tag for this one...
> 
> Argh, sorry. I must have screwed up transcribing them from the
> mailing list.
> 
> > That said, I think my only outstanding feedback (side discussion aside)
> > was the code factoring in xlog_cil_push_background().
> 
> I'll go back and look at that, 'cause clearly I was looking at the
> wrong patch when I screwed up the rvb tag...

I'll keep an eye on the list for a revised series.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/28] xfs: factor common AIL item deletion code
  2019-10-31 23:45 ` [PATCH 06/28] xfs: factor common AIL item deletion code Dave Chinner
@ 2019-11-04 23:16   ` Darrick J. Wong
  0 siblings, 0 replies; 72+ messages in thread
From: Darrick J. Wong @ 2019-11-04 23:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:56AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_inode_item.c | 12 +----------
>  fs/xfs/xfs_trans_ail.c  | 48 ++++++++++++++++++++++-------------------
>  fs/xfs/xfs_trans_priv.h |  4 +++-
>  3 files changed, 30 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index bb8f076805b9..ab12e526540a 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -743,17 +743,7 @@ xfs_iflush_done(
>  				xfs_clear_li_failed(blip);
>  			}
>  		}
> -
> -		if (mlip_changed) {
> -			if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -				xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -			if (list_empty(&ailp->ail_head))
> -				wake_up_all(&ailp->ail_empty);
> -		}
> -		spin_unlock(&ailp->ail_lock);
> -
> -		if (mlip_changed)
> -			xfs_log_space_wake(ailp->ail_mount);
> +		xfs_ail_update_finish(ailp, mlip_changed);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 6ccfd75d3c24..656819523bbd 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -678,6 +678,27 @@ xfs_ail_push_all_sync(
>  	finish_wait(&ailp->ail_empty, &wait);
>  }
>  
> +void
> +xfs_ail_update_finish(
> +	struct xfs_ail		*ailp,
> +	bool			do_tail_update) __releases(ailp->ail_lock)
> +{
> +	struct xfs_mount	*mp = ailp->ail_mount;
> +
> +	if (!do_tail_update) {
> +		spin_unlock(&ailp->ail_lock);
> +		return;
> +	}
> +
> +	if (!XFS_FORCED_SHUTDOWN(mp))
> +		xlog_assign_tail_lsn_locked(mp);
> +
> +	if (list_empty(&ailp->ail_head))
> +		wake_up_all(&ailp->ail_empty);
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(mp);
> +}
> +
>  /*
>   * xfs_trans_ail_update - bulk AIL insertion operation.
>   *
> @@ -737,15 +758,7 @@ xfs_trans_ail_update_bulk(
>  	if (!list_empty(&tmp))
>  		xfs_ail_splice(ailp, cur, &tmp, lsn);
>  
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> -			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -		spin_unlock(&ailp->ail_lock);
> -
> -		xfs_log_space_wake(ailp->ail_mount);

This call site didn't have a wake_up_all and now it does; is that going
to make a difference?  I /think/ the answer is that this function
usually puts things on the AIL so we won't trigger the ail_empty wakeup;
and if the AIL was previously empty and we didn't match any log items
(such that it's still empty) then it's fine to wake up anyone who was
waiting for the ail to clear out?

--D

> -	} else {
> -		spin_unlock(&ailp->ail_lock);
> -	}
> +	xfs_ail_update_finish(ailp, mlip_changed);
>  }
>  
>  bool
> @@ -789,10 +802,10 @@ void
>  xfs_trans_ail_delete(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip,
> -	int			shutdown_type) __releases(ailp->ail_lock)
> +	int			shutdown_type)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			mlip_changed;
> +	bool			need_update;
>  
>  	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>  		spin_unlock(&ailp->ail_lock);
> @@ -805,17 +818,8 @@ xfs_trans_ail_delete(
>  		return;
>  	}
>  
> -	mlip_changed = xfs_ail_delete_one(ailp, lip);
> -	if (mlip_changed) {
> -		if (!XFS_FORCED_SHUTDOWN(mp))
> -			xlog_assign_tail_lsn_locked(mp);
> -		if (list_empty(&ailp->ail_head))
> -			wake_up_all(&ailp->ail_empty);
> -	}
> -
> -	spin_unlock(&ailp->ail_lock);
> -	if (mlip_changed)
> -		xfs_log_space_wake(ailp->ail_mount);
> +	need_update = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, need_update);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 2e073c1c4614..64ffa746730e 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -92,8 +92,10 @@ xfs_trans_ail_update(
>  }
>  
>  bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
> -		int shutdown_type) __releases(ailp->ail_lock);
> +		int shutdown_type);
>  
>  static inline void
>  xfs_trans_ail_remove(
> -- 
> 2.24.0.rc0
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 07/28] xfs: tail updates only need to occur when LSN changes
  2019-10-31 23:45 ` [PATCH 07/28] xfs: tail updates only need to occur when LSN changes Dave Chinner
@ 2019-11-04 23:18   ` Darrick J. Wong
  0 siblings, 0 replies; 72+ messages in thread
From: Darrick J. Wong @ 2019-11-04 23:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:57AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently wake anything waiting on the log tail to move whenever
> the log item at the tail of the log is removed. Historically this
> was fine behaviour because there were very few items at any given
> LSN. But with delayed logging, there may be thousands of items at
> any given LSN, and we can't move the tail until they are all gone.
> 
> Hence if we are removing them in near tail-first order, we might be
> waking up processes waiting on the tail LSN to change (e.g. log
> space waiters) repeatedly without them being able to make progress.
> This also occurs with the new sync push waiters, and can result in
> thousands of spurious wakeups every second when under heavy direct
> reclaim pressure.
> 
> To fix this, check that the tail LSN has actually changed on the
> AIL before triggering wakeups. This will reduce the number of
> spurious wakeups when doing bulk AIL removal and make this code much
> more efficient.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode_item.c | 18 ++++++++++----
>  fs/xfs/xfs_trans_ail.c  | 52 ++++++++++++++++++++++++++++-------------
>  fs/xfs/xfs_trans_priv.h |  4 ++--
>  3 files changed, 51 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ab12e526540a..79ffe6dff115 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -731,19 +731,27 @@ xfs_iflush_done(
>  	 * holding the lock before removing the inode from the AIL.
>  	 */
>  	if (need_ail) {
> -		bool			mlip_changed = false;
> +		xfs_lsn_t	tail_lsn = 0;
>  
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(blip, &tmp, li_bio_list) {
>  			if (INODE_ITEM(blip)->ili_logged &&
> -			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> -				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> -			else {
> +			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> +				/*
> +				 * xfs_ail_update_finish() only cares about the
> +				 * lsn of the first tail item removed, any
> +				 * others will be at the same or higher lsn so
> +				 * we just ignore them.
> +				 */
> +				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
> +				if (!tail_lsn && lsn)
> +					tail_lsn = lsn;
> +			} else {
>  				xfs_clear_li_failed(blip);
>  			}
>  		}
> -		xfs_ail_update_finish(ailp, mlip_changed);
> +		xfs_ail_update_finish(ailp, tail_lsn);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 656819523bbd..685a21cd24c0 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -108,17 +108,25 @@ xfs_ail_next(
>   * We need the AIL lock in order to get a coherent read of the lsn of the last
>   * item in the AIL.
>   */
> +static xfs_lsn_t
> +__xfs_ail_min_lsn(
> +	struct xfs_ail		*ailp)
> +{
> +	struct xfs_log_item	*lip = xfs_ail_min(ailp);
> +
> +	if (lip)
> +		return lip->li_lsn;
> +	return 0;
> +}
> +
>  xfs_lsn_t
>  xfs_ail_min_lsn(
>  	struct xfs_ail		*ailp)
>  {
> -	xfs_lsn_t		lsn = 0;
> -	struct xfs_log_item	*lip;
> +	xfs_lsn_t		lsn;
>  
>  	spin_lock(&ailp->ail_lock);
> -	lip = xfs_ail_min(ailp);
> -	if (lip)
> -		lsn = lip->li_lsn;
> +	lsn = __xfs_ail_min_lsn(ailp);
>  	spin_unlock(&ailp->ail_lock);
>  
>  	return lsn;
> @@ -681,11 +689,12 @@ xfs_ail_push_all_sync(
>  void
>  xfs_ail_update_finish(
>  	struct xfs_ail		*ailp,
> -	bool			do_tail_update) __releases(ailp->ail_lock)
> +	xfs_lsn_t		old_lsn) __releases(ailp->ail_lock)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
>  
> -	if (!do_tail_update) {
> +	/* if the tail lsn hasn't changed, don't do updates or wakeups. */
> +	if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) {
>  		spin_unlock(&ailp->ail_lock);
>  		return;
>  	}
> @@ -730,7 +739,7 @@ xfs_trans_ail_update_bulk(
>  	xfs_lsn_t		lsn) __releases(ailp->ail_lock)
>  {
>  	struct xfs_log_item	*mlip;
> -	int			mlip_changed = 0;
> +	xfs_lsn_t		tail_lsn = 0;
>  	int			i;
>  	LIST_HEAD(tmp);
>  
> @@ -745,9 +754,10 @@ xfs_trans_ail_update_bulk(
>  				continue;
>  
>  			trace_xfs_ail_move(lip, lip->li_lsn, lsn);
> +			if (mlip == lip && !tail_lsn)
> +				tail_lsn = lip->li_lsn;
> +
>  			xfs_ail_delete(ailp, lip);
> -			if (mlip == lip)
> -				mlip_changed = 1;
>  		} else {
>  			trace_xfs_ail_insert(lip, 0, lsn);
>  		}
> @@ -758,15 +768,23 @@ xfs_trans_ail_update_bulk(
>  	if (!list_empty(&tmp))
>  		xfs_ail_splice(ailp, cur, &tmp, lsn);
>  
> -	xfs_ail_update_finish(ailp, mlip_changed);
> +	xfs_ail_update_finish(ailp, tail_lsn);
>  }
>  
> -bool
> +/*
> + * Delete one log item from the AIL.
> + *
> + * If this item was at the tail of the AIL, return the LSN of the log item so
> + * that we can use it to check if the LSN of the tail of the log has moved
> + * when finishing up the AIL delete process in xfs_ail_update_finish().
> + */
> +xfs_lsn_t
>  xfs_ail_delete_one(
>  	struct xfs_ail		*ailp,
>  	struct xfs_log_item	*lip)
>  {
>  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> +	xfs_lsn_t		lsn = lip->li_lsn;
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
> @@ -774,7 +792,9 @@ xfs_ail_delete_one(
>  	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
>  	lip->li_lsn = 0;
>  
> -	return mlip == lip;
> +	if (mlip == lip)
> +		return lsn;
> +	return 0;
>  }
>  
>  /**
> @@ -805,7 +825,7 @@ xfs_trans_ail_delete(
>  	int			shutdown_type)
>  {
>  	struct xfs_mount	*mp = ailp->ail_mount;
> -	bool			need_update;
> +	xfs_lsn_t		tail_lsn;
>  
>  	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
>  		spin_unlock(&ailp->ail_lock);
> @@ -818,8 +838,8 @@ xfs_trans_ail_delete(
>  		return;
>  	}
>  
> -	need_update = xfs_ail_delete_one(ailp, lip);
> -	xfs_ail_update_finish(ailp, need_update);
> +	tail_lsn = xfs_ail_delete_one(ailp, lip);
> +	xfs_ail_update_finish(ailp, tail_lsn);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 64ffa746730e..35655eac01a6 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -91,8 +91,8 @@ xfs_trans_ail_update(
>  	xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
>  }
>  
> -bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> -void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update)
> +xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
> +void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn)
>  			__releases(ailp->ail_lock);
>  void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
>  		int shutdown_type);
> -- 
> 2.24.0.rc0
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster
  2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
  2019-11-01 12:05   ` Brian Foster
@ 2019-11-04 23:20   ` Darrick J. Wong
  1 sibling, 0 replies; 72+ messages in thread
From: Darrick J. Wong @ 2019-11-04 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:58AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There's lots of indent in this code which makes it a bit hard to
> follow. We are also going to completely rework the inode lookup code
> as part of the inode reclaim rework, so factor out the inode lookup
> code from the inode cluster freeing code.
> 
> Based on prototype code from Christoph Hellwig.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seems pretty straightforward,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 152 +++++++++++++++++++++++++--------------------
>  1 file changed, 84 insertions(+), 68 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e9e4f444f8ce..33edb18098ca 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2516,6 +2516,88 @@ xfs_iunlink_remove(
>  	return error;
>  }
>  
> +/*
> + * Look up the inode number specified and mark it stale if it is found. If it is
> + * dirty, return the inode so it can be attached to the cluster buffer so it can
> + * be processed appropriately when the cluster free transaction completes.
> + */
> +static struct xfs_inode *
> +xfs_ifree_get_one_inode(
> +	struct xfs_perag	*pag,
> +	struct xfs_inode	*free_ip,
> +	int			inum)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
> +
> +retry:
> +	rcu_read_lock();
> +	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
> +
> +	/* Inode not in memory, nothing to do */
> +	if (!ip)
> +		goto out_rcu_unlock;
> +
> +	/*
> +	 * because this is an RCU protected lookup, we could find a recently
> +	 * freed or even reallocated inode during the lookup. We need to check
> +	 * under the i_flags_lock for a valid inode here. Skip it if it is not
> +	 * valid, the wrong inode or stale.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
> +		spin_unlock(&ip->i_flags_lock);
> +		goto out_rcu_unlock;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +
> +	/*
> +	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> +	 * other inodes that we did not find in the list attached to the buffer
> +	 * and are not already marked stale. If we can't lock it, back off and
> +	 * retry.
> +	 */
> +	if (ip != free_ip) {
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> +			rcu_read_unlock();
> +			delay(1);
> +			goto retry;
> +		}
> +
> +		/*
> +		 * Check the inode number again in case we're racing with
> +		 * freeing in xfs_reclaim_inode().  See the comments in that
> +		 * function for more information as to why the initial check is
> +		 * not sufficient.
> +		 */
> +		if (ip->i_ino != inum) {
> +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +			goto out_rcu_unlock;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	xfs_iflock(ip);
> +	xfs_iflags_set(ip, XFS_ISTALE);
> +
> +	/*
> +	 * We don't need to attach clean inodes or those only with unlogged
> +	 * changes (which we throw away, anyway).
> +	 */
> +	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> +		ASSERT(ip != free_ip);
> +		xfs_ifunlock(ip);
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		goto out_no_inode;
> +	}
> +	return ip;
> +
> +out_rcu_unlock:
> +	rcu_read_unlock();
> +out_no_inode:
> +	return NULL;
> +}
> +
>  /*
>   * A big issue when freeing the inode cluster is that we _cannot_ skip any
>   * inodes that are in memory - they all must be marked stale and attached to
> @@ -2616,77 +2698,11 @@ xfs_ifree_cluster(
>  		 * even trying to lock them.
>  		 */
>  		for (i = 0; i < igeo->inodes_per_cluster; i++) {
> -retry:
> -			rcu_read_lock();
> -			ip = radix_tree_lookup(&pag->pag_ici_root,
> -					XFS_INO_TO_AGINO(mp, (inum + i)));
> -
> -			/* Inode not in memory, nothing to do */
> -			if (!ip) {
> -				rcu_read_unlock();
> +			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
> +			if (!ip)
>  				continue;
> -			}
> -
> -			/*
> -			 * because this is an RCU protected lookup, we could
> -			 * find a recently freed or even reallocated inode
> -			 * during the lookup. We need to check under the
> -			 * i_flags_lock for a valid inode here. Skip it if it
> -			 * is not valid, the wrong inode or stale.
> -			 */
> -			spin_lock(&ip->i_flags_lock);
> -			if (ip->i_ino != inum + i ||
> -			    __xfs_iflags_test(ip, XFS_ISTALE)) {
> -				spin_unlock(&ip->i_flags_lock);
> -				rcu_read_unlock();
> -				continue;
> -			}
> -			spin_unlock(&ip->i_flags_lock);
> -
> -			/*
> -			 * Don't try to lock/unlock the current inode, but we
> -			 * _cannot_ skip the other inodes that we did not find
> -			 * in the list attached to the buffer and are not
> -			 * already marked stale. If we can't lock it, back off
> -			 * and retry.
> -			 */
> -			if (ip != free_ip) {
> -				if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> -					rcu_read_unlock();
> -					delay(1);
> -					goto retry;
> -				}
> -
> -				/*
> -				 * Check the inode number again in case we're
> -				 * racing with freeing in xfs_reclaim_inode().
> -				 * See the comments in that function for more
> -				 * information as to why the initial check is
> -				 * not sufficient.
> -				 */
> -				if (ip->i_ino != inum + i) {
> -					xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -					rcu_read_unlock();
> -					continue;
> -				}
> -			}
> -			rcu_read_unlock();
>  
> -			xfs_iflock(ip);
> -			xfs_iflags_set(ip, XFS_ISTALE);
> -
> -			/*
> -			 * we don't need to attach clean inodes or those only
> -			 * with unlogged changes (which we throw away, anyway).
> -			 */
>  			iip = ip->i_itemp;
> -			if (!iip || xfs_inode_clean(ip)) {
> -				ASSERT(ip != free_ip);
> -				xfs_ifunlock(ip);
> -				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -				continue;
> -			}
> -
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> -- 
> 2.24.0.rc0
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability
  2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
  2019-11-01 12:05   ` Brian Foster
@ 2019-11-04 23:21   ` Darrick J. Wong
  1 sibling, 0 replies; 72+ messages in thread
From: Darrick J. Wong @ 2019-11-04 23:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:45:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1e63dd3d1257..d34e5d2edacd 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,6 +324,9 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;
>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> @@ -2061,7 +2064,8 @@ int __init
>  xfs_buf_init(void)
>  {
>  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> -						KM_ZONE_HWALIGN, NULL);
> +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,

As discussed on the previous iteration of this series, I'd like to
capture the reasons for adding KM_ZONE_SPREAD as a separate patch.

--D

> +			NULL);
>  	if (!xfs_buf_zone)
>  		goto out;
>  
> -- 
> 2.24.0.rc0
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/28] xfs: synchronous AIL pushing
  2019-10-31 23:46 ` [PATCH 17/28] xfs: synchronous AIL pushing Dave Chinner
@ 2019-11-05 17:05   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-05 17:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:07AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Provide an interface to push the AIL to a target LSN and wait for
> the tail of the log to move past that LSN. This is used to wait for
> all items older than a specific LSN to either be cleaned (written
> back) or relogged to a higher LSN in the AIL. The primary use for
> this is to allow IO free inode reclaim throttling.
> 
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
> 

The above paragraph doesn't seem applicable to this patch. With that
fixed:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> xfs_ail_push_sync() is temporary infrastructure to facilitate
> non-blocking, IO-less inode reclaim throttling that allows further
> structural changes to be made. Once those structural changes are
> made, the need for this function goes away and it is removed. In
> essence, it is only provided to ensure git bisects don't break while
> the changes to the reclaim algorithms are in progress.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_trans_ail.c  | 32 ++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans_priv.h |  2 ++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 685a21cd24c0..3e1d0e1439e2 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -662,6 +662,36 @@ xfs_ail_push_all(
>  		xfs_ail_push(ailp, threshold_lsn);
>  }
>  
> +/*
> + * Push the AIL to a specific lsn and wait for it to complete.
> + */
> +void
> +xfs_ail_push_sync(
> +	struct xfs_ail		*ailp,
> +	xfs_lsn_t		threshold_lsn)
> +{
> +	struct xfs_log_item	*lip;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock(&ailp->ail_lock);
> +	while ((lip = xfs_ail_min(ailp)) != NULL) {
> +		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> +		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> +			break;
> +		if (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> +			ailp->ail_target = threshold_lsn;
> +		wake_up_process(ailp->ail_task);
> +		spin_unlock(&ailp->ail_lock);
> +		schedule();
> +		spin_lock(&ailp->ail_lock);
> +	}
> +	spin_unlock(&ailp->ail_lock);
> +
> +	finish_wait(&ailp->ail_push, &wait);
> +}
> +
> +
>  /*
>   * Push out all items in the AIL immediately and wait until the AIL is empty.
>   */
> @@ -702,6 +732,7 @@ xfs_ail_update_finish(
>  	if (!XFS_FORCED_SHUTDOWN(mp))
>  		xlog_assign_tail_lsn_locked(mp);
>  
> +	wake_up_all(&ailp->ail_push);
>  	if (list_empty(&ailp->ail_head))
>  		wake_up_all(&ailp->ail_empty);
>  	spin_unlock(&ailp->ail_lock);
> @@ -858,6 +889,7 @@ xfs_trans_ail_init(
>  	spin_lock_init(&ailp->ail_lock);
>  	INIT_LIST_HEAD(&ailp->ail_buf_list);
>  	init_waitqueue_head(&ailp->ail_empty);
> +	init_waitqueue_head(&ailp->ail_push);
>  
>  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>  			ailp->ail_mount->m_fsname);
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 35655eac01a6..1b6f4bbd47c0 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> +	wait_queue_head_t	ail_push;
>  };
>  
>  /*
> @@ -113,6 +114,7 @@ xfs_trans_ail_remove(
>  }
>  
>  void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
> +void			xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
>  void			xfs_ail_push_all(struct xfs_ail *);
>  void			xfs_ail_push_all_sync(struct xfs_ail *);
>  struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 19/28] xfs: reduce kswapd blocking on inode locking.
  2019-10-31 23:46 ` [PATCH 19/28] xfs: reduce kswapd blocking on inode locking Dave Chinner
@ 2019-11-05 17:05   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-05 17:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When doing async node reclaiming, we grab a batch of inodes that we
> are likely able to reclaim and ignore those that are already
> flushing. However, when we actually go to reclaim them, the first
> thing we do is lock the inode. If we are racing with something
> else reclaiming the inode or flushing it because it is dirty,
> we block on the inode lock. Hence we can still block kswapd here.
> 
> Further, if we flush an inode, we also cluster all the other dirty
> inodes in that cluster into the same IO, flush locking them all.
> However, if the workload is operating on sequential inodes (e.g.
> created by a tarball extraction) most of these inodes will be
> sequntial in the cache and so in the same batch
> we've already grabbed for reclaim scanning.
> 
> As a result, it is common for all the inodes in the batch to be
> dirty and it is common for the first inode flushed to also flush all
> the inodes in the reclaim batch. In which case, they are now all
> going to be flush locked and we do not want to block on them.
> 
> Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use
> trylock semantics and abort reclaim of an inode as quickly as we can
> without blocking kswapd. This will be necessary for the upcoming
> conversion to LRU lists for inode reclaim tracking.
> 
> Found via tracing and finding big batches of repeated lock/unlock
> runs on inodes that we just flushed by write clustering during
> reclaim.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_icache.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index edcc3f6bb3bf..189cf423fe8f 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1104,11 +1104,23 @@ xfs_reclaim_inode(
>  
>  restart:
>  	error = 0;
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	if (!xfs_iflock_nowait(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> +	/*
> +	 * Don't try to flush the inode if another inode in this cluster has
> +	 * already flushed it after we did the initial checks in
> +	 * xfs_reclaim_inode_grab().
> +	 */
> +	if (sync_mode & SYNC_TRYLOCK) {
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>  			goto out;
> -		xfs_iflock(ip);
> +		if (!xfs_iflock_nowait(ip))
> +			goto out_unlock;
> +	} else {
> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
> +		if (!xfs_iflock_nowait(ip)) {
> +			if (!(sync_mode & SYNC_WAIT))
> +				goto out_unlock;
> +			xfs_iflock(ip);
> +		}
>  	}
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> @@ -1215,9 +1227,10 @@ xfs_reclaim_inode(
>  
>  out_ifunlock:
>  	xfs_ifunlock(ip);
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  out:
>  	xfs_iflags_clear(ip, XFS_IRECLAIM);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	/*
>  	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
>  	 * a short while. However, this just burns CPU time scanning the tree
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 20/28] xfs: kill background reclaim work
  2019-10-31 23:46 ` [PATCH 20/28] xfs: kill background reclaim work Dave Chinner
@ 2019-11-05 17:05   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-05 17:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:10AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This function is now entirely done by kswapd, so we don't need the
> worker thread to do async reclaim anymore.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_icache.c | 44 --------------------------------------------
>  fs/xfs/xfs_icache.h |  2 --
>  fs/xfs/xfs_mount.c  |  2 --
>  fs/xfs/xfs_mount.h  |  2 --
>  fs/xfs/xfs_super.c  | 11 +----------
>  5 files changed, 1 insertion(+), 60 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 189cf423fe8f..7e175304e146 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -138,44 +138,6 @@ xfs_inode_free(
>  	__xfs_inode_free(ip);
>  }
>  
> -/*
> - * Queue a new inode reclaim pass if there are reclaimable inodes and there
> - * isn't a reclaim pass already in progress. By default it runs every 5s based
> - * on the xfs periodic sync default of 30s. Perhaps this should have it's own
> - * tunable, but that can be done if this method proves to be ineffective or too
> - * aggressive.
> - */
> -static void
> -xfs_reclaim_work_queue(
> -	struct xfs_mount        *mp)
> -{
> -
> -	rcu_read_lock();
> -	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
> -		queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work,
> -			msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
> -	}
> -	rcu_read_unlock();
> -}
> -
> -/*
> - * This is a fast pass over the inode cache to try to get reclaim moving on as
> - * many inodes as possible in a short period of time. It kicks itself every few
> - * seconds, as well as being kicked by the inode cache shrinker when memory
> - * goes low. It scans as quickly as possible avoiding locked inodes or those
> - * already being flushed, and once done schedules a future pass.
> - */
> -void
> -xfs_reclaim_worker(
> -	struct work_struct *work)
> -{
> -	struct xfs_mount *mp = container_of(to_delayed_work(work),
> -					struct xfs_mount, m_reclaim_work);
> -
> -	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
> -	xfs_reclaim_work_queue(mp);
> -}
> -
>  static void
>  xfs_perag_set_reclaim_tag(
>  	struct xfs_perag	*pag)
> @@ -192,9 +154,6 @@ xfs_perag_set_reclaim_tag(
>  			   XFS_ICI_RECLAIM_TAG);
>  	spin_unlock(&mp->m_perag_lock);
>  
> -	/* schedule periodic background inode reclaim */
> -	xfs_reclaim_work_queue(mp);
> -
>  	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
>  }
>  
> @@ -1393,9 +1352,6 @@ xfs_reclaim_inodes_nr(
>  {
>  	int			sync_mode = SYNC_TRYLOCK;
>  
> -	/* kick background reclaimer */
> -	xfs_reclaim_work_queue(mp);
> -
>  	/*
>  	 * For kswapd, we kick background inode writeback. For direct
>  	 * reclaim, we issue and wait on inode writeback to throttle
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index 48f1fd2bb6ad..4c0d8920cc54 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -49,8 +49,6 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
>  struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
>  void xfs_inode_free(struct xfs_inode *ip);
>  
> -void xfs_reclaim_worker(struct work_struct *work);
> -
>  int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
>  int xfs_reclaim_inodes_count(struct xfs_mount *mp);
>  long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 3e8eedf01eb2..8f76c2add18b 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -952,7 +952,6 @@ xfs_mountfs(
>  	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
>  	 * quota inodes.
>  	 */
> -	cancel_delayed_work_sync(&mp->m_reclaim_work);
>  	xfs_reclaim_inodes(mp, SYNC_WAIT);
>  	xfs_health_unmount(mp);
>   out_log_dealloc:
> @@ -1035,7 +1034,6 @@ xfs_unmountfs(
>  	 * reclaim just to be sure. We can stop background inode reclaim
>  	 * here as well if it is still running.
>  	 */
> -	cancel_delayed_work_sync(&mp->m_reclaim_work);
>  	xfs_reclaim_inodes(mp, SYNC_WAIT);
>  	xfs_health_unmount(mp);
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index a46cb3fd24b1..8c6885d3b085 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -163,7 +163,6 @@ typedef struct xfs_mount {
>  	uint			m_chsize;	/* size of next field */
>  	atomic_t		m_active_trans;	/* number trans frozen */
>  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
> -	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
>  	struct delayed_work	m_eofblocks_work; /* background eof blocks
>  						     trimming */
>  	struct delayed_work	m_cowblocks_work; /* background cow blocks
> @@ -180,7 +179,6 @@ typedef struct xfs_mount {
>  	struct workqueue_struct *m_buf_workqueue;
>  	struct workqueue_struct	*m_unwritten_workqueue;
>  	struct workqueue_struct	*m_cil_workqueue;
> -	struct workqueue_struct	*m_reclaim_workqueue;
>  	struct workqueue_struct *m_eofblocks_workqueue;
>  	struct workqueue_struct	*m_sync_workqueue;
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ebe2ccd36127..a4fe679207ef 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -794,15 +794,10 @@ xfs_init_mount_workqueues(
>  	if (!mp->m_cil_workqueue)
>  		goto out_destroy_unwritten;
>  
> -	mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s",
> -			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
> -	if (!mp->m_reclaim_workqueue)
> -		goto out_destroy_cil;
> -
>  	mp->m_eofblocks_workqueue = alloc_workqueue("xfs-eofblocks/%s",
>  			WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
>  	if (!mp->m_eofblocks_workqueue)
> -		goto out_destroy_reclaim;
> +		goto out_destroy_cil;
>  
>  	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s", WQ_FREEZABLE, 0,
>  					       mp->m_fsname);
> @@ -813,8 +808,6 @@ xfs_init_mount_workqueues(
>  
>  out_destroy_eofb:
>  	destroy_workqueue(mp->m_eofblocks_workqueue);
> -out_destroy_reclaim:
> -	destroy_workqueue(mp->m_reclaim_workqueue);
>  out_destroy_cil:
>  	destroy_workqueue(mp->m_cil_workqueue);
>  out_destroy_unwritten:
> @@ -831,7 +824,6 @@ xfs_destroy_mount_workqueues(
>  {
>  	destroy_workqueue(mp->m_sync_workqueue);
>  	destroy_workqueue(mp->m_eofblocks_workqueue);
> -	destroy_workqueue(mp->m_reclaim_workqueue);
>  	destroy_workqueue(mp->m_cil_workqueue);
>  	destroy_workqueue(mp->m_unwritten_workqueue);
>  	destroy_workqueue(mp->m_buf_workqueue);
> @@ -1520,7 +1512,6 @@ xfs_mount_alloc(
>  	spin_lock_init(&mp->m_perag_lock);
>  	mutex_init(&mp->m_growlock);
>  	atomic_set(&mp->m_active_trans, 0);
> -	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
>  	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
>  	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
>  	mp->m_kobj.kobject.kset = xfs_kset;
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO
  2019-10-31 23:46 ` [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO Dave Chinner
@ 2019-11-05 17:06   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-05 17:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:11AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim currently issues it's own inode IO when it comes
> across dirty inodes. This is used to throttle direct reclaim down to
> the rate at which we can reclaim dirty inodes. Failure to throttle
> in this manner results in the OOM killer being trivial to trigger
> even when there is lots of free memory available.
> 
> However, having direct reclaimers issue IO causes an amount of
> IO thrashing to occur. We can have up to the number of AGs in the
> filesystem concurrently issuing IO, plus the AIL pushing thread as
> well. This means we can many competing sources of IO and they all
> end up thrashing and competing for the request slots in the block
> device.
> 
> Similar to dirty page throttling and the BDI flusher thread, we can
> use the AIL pushing thread the sole place we issue inode writeback
> from and everything else waits for it to make progress. To do this,
> reclaim will skip over dirty inodes, but in doing so will record the
> lowest LSN of all the dirty inodes it skips. It will then push the
> AIL to this LSN and wait for it to complete that work.
> 
> In doing so, we block direct reclaim on the IO of at least one IO,
> thereby providing some level of throttling for when we encounter
> dirty inodes. However we gain the ability to scan and reclaim clean
> inodes in a non-blocking fashion.
> 
> Hence direct reclaim will be throttled directly by the rate at which
> dirty inodes are cleaned by AIL pushing, rather than by delays
> caused by competing IO submissions. This allows us to reduce the
> locking that limits direct reclaim concurrency to just protecting
> the reclaim cursor state, hence greatly simplifying the inode
> reclaim code as it now just skips dirty inodes.
> 
> Note: this patch by itself isn't completely able to throttle direct
> reclaim sufficiently to prevent OOM killer madness. We can't do that
> until we change the way we index reclaimable inodes in the next
> patch and can feed back state to the mm core sanely.  However, we
> can't change the way we index reclaimable inodes until we have
> IO-less non-blocking reclaim for both direct reclaim and kswapd
> reclaim.  Catch-22...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

A couple random nits and a question...

>  fs/xfs/xfs_icache.c | 218 ++++++++++++++++++--------------------------
>  1 file changed, 89 insertions(+), 129 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 7e175304e146..ff8ae32614a6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
...
> @@ -1262,9 +1227,13 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				struct xfs_inode *ip = batch[i];
>  
> -				if (done || xfs_reclaim_inode_grab(ip, flags))
> +				if (done ||
> +				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
>  					batch[i] = NULL;

Doesn't look like we can get here with done != 0.

>  
> +				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
> +					lowest_lsn = lsn;
> +

This should probably have the same NULLCOMMITLSN treatment as the
similar check below.

>  				/*
>  				 * Update the index for the next lookup. Catch
>  				 * overflows into the next AG range which can
> @@ -1289,41 +1258,34 @@ xfs_reclaim_inodes_ag(
...
>  
> -	/*
> -	 * if we skipped any AG, and we still have scan count remaining, do
> -	 * another pass this time using blocking reclaim semantics (i.e
> -	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
> -	 * ensure that when we get more reclaimers than AGs we block rather
> -	 * than spin trying to execute reclaim.
> -	 */
> -	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
> -		trylock = 0;
> -		goto restart;
> -	}
> -	return last_error;
> +	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
> +		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
> +
> +	return freed;

Hm, this should have always been returning a free count instead of an
error code right? If so, I'd normally suggest to fix this as an
independent patch, but it's probably not worth splitting up at this
point.

The aforementioned nits seem harmless and the code is going away, so:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  }
>  
>  int
> @@ -1331,9 +1293,7 @@ xfs_reclaim_inodes(
>  	xfs_mount_t	*mp,
>  	int		mode)
>  {
> -	int		nr_to_scan = INT_MAX;
> -
> -	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
>  }
>  
>  /*
> @@ -1350,7 +1310,7 @@ xfs_reclaim_inodes_nr(
>  	struct xfs_mount	*mp,
>  	int			nr_to_scan)
>  {
> -	int			sync_mode = SYNC_TRYLOCK;
> +	int			sync_mode = 0;
>  
>  	/*
>  	 * For kswapd, we kick background inode writeback. For direct
> @@ -1362,7 +1322,7 @@ xfs_reclaim_inodes_nr(
>  	else
>  		sync_mode |= SYNC_WAIT;
>  
> -	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
> +	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
>  }
>  
>  /*
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 24/28] xfs: reclaim inodes from the LRU
  2019-10-31 23:46 ` [PATCH 24/28] xfs: reclaim inodes from the LRU Dave Chinner
@ 2019-11-06 17:21   ` Brian Foster
  2019-11-14 21:51     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-06 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Replace the AG radix tree walking reclaim code with a list_lru
> walker, giving us both node-aware and memcg-aware inode reclaim
> at the XFS level. This requires adding an inode isolation function to
> determine if the inode can be reclaim, and a list walker to
> dispose of the inodes that were isolated.
> 
> We want the isolation function to be non-blocking. If we can't
> grab an inode then we either skip it or rotate it. If it's clean
> then we skip it, if it's dirty then we rotate to give it time to be
> cleaned before it is scanned again.
> 
> This congregates the dirty inodes at the tail of the LRU, which
> means that if we start hitting a majority of dirty inodes either
> there are lots of unlinked inodes in the reclaim list or we've
> reclaimed all the clean inodes and we're looped back on the dirty
> inodes. Either way, this is an indication we should tell kswapd to
> back off.
> 
> The non-blocking isolation function introduces a complexity for the
> filesystem shutdown case. When the filesystem is shut down, we want
> to free the inode even if it is dirty, and this may require
> blocking. We already hold the locks needed to do this blocking, so
> what we do is that we leave inodes locked - both the ILOCK and the
> flush lock - while they are sitting on the dispose list to be freed
> after the LRU walk completes.  This allows us to process the
> shutdown state outside the LRU walk where we can block safely.
> 
> Because we now are reclaiming inodes from the context that it needs
> memory in (memcg and/or node), direct reclaim throttling within the
> high level reclaim code in now much more effective. Hence we don't
> wait on IO for either kswapd or direct reclaim. However, we have to
> tell kswapd to back off if we start hitting too many dirty inodes.
> This implies we've wrapped around the LRU and don't have many clean
> inodes left to reclaim, so it needs to wait a while for the AIL
> pushing to clean some of the remaining reclaimable inodes.
> 
> Keep in mind we don't have to care about inode lock order or
> blocking with inode locks held here because a) we are using
> trylocks, and b) once marked with XFS_IRECLAIM they can't be found
> via the LRU and inode cache lookups will abort and retry. Hence
> nobody will try to lock them in any other context that might also be
> holding other inode locks.
> 
> Also convert xfs_reclaim_all_inodes() to use a LRU walk to free all
> the reclaimable inodes in the filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks fundamentally sane. Some logic quibbles..

>  fs/xfs/xfs_icache.c | 404 +++++++++++++-------------------------------
>  fs/xfs/xfs_icache.h |  18 +-
>  fs/xfs/xfs_inode.h  |  18 ++
>  fs/xfs/xfs_super.c  |  46 ++++-
>  4 files changed, 190 insertions(+), 296 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 350f42e7730b..05dd292bfdb6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -968,160 +968,110 @@ xfs_inode_ag_iterator_tag(
>  	return last_error;
>  }
>  
> -/*
> - * Grab the inode for reclaim.
> - *
> - * Return false if we aren't going to reclaim it, true if it is a reclaim
> - * candidate.
> - *
> - * If the inode is clean or unreclaimable, return 0 to tell the caller it does
> - * not require flushing. Otherwise return the log item lsn of the inode so the
> - * caller can determine it's inode flush target.  If we get the clean/dirty
> - * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks
> - * held.
> - */
> -STATIC bool
> -xfs_reclaim_inode_grab(
> -	struct xfs_inode	*ip,
> -	int			flags,
> -	xfs_lsn_t		*lsn)
> +enum lru_status
> +xfs_inode_reclaim_isolate(
> +	struct list_head	*item,
> +	struct list_lru_one	*lru,
> +	spinlock_t		*lru_lock,

Did we ever establish whether we should cycle the lru_lock during long
running scans?

> +	void			*arg)
>  {
> -	ASSERT(rcu_read_lock_held());
> -	*lsn = 0;
> +        struct xfs_ireclaim_args *ra = arg;
> +        struct inode		*inode = container_of(item, struct inode,
> +						      i_lru);
> +        struct xfs_inode	*ip = XFS_I(inode);

Whitespace damage on the above lines (space indentation vs tabs).

> +	enum lru_status		ret;
> +	xfs_lsn_t		lsn = 0;
> +
> +	/* Careful: inversion of iflags_lock and everything else here */
> +	if (!spin_trylock(&ip->i_flags_lock))
> +		return LRU_SKIP;
> +
> +	/* if we are in shutdown, we'll reclaim it even if dirty */
> +	ret = LRU_ROTATE;
> +	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) &&
> +	    !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;
> +		ra->dirty_skipped++;
> +		goto out_unlock_flags;
> +	}
>  
> -	/* quick check for stale RCU freed inode */
> -	if (!ip->i_ino)
> -		return false;
> +	ret = LRU_SKIP;
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> +		goto out_unlock_flags;
>  
> -	/*
> -	 * Do unlocked checks to see if the inode already is being flushed or in
> -	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> -	 * position in the AIL for the caller to push to.
> -	 */
> -	if (!xfs_inode_clean(ip)) {
> -		*lsn = ip->i_itemp->ili_item.li_lsn;
> -		return false;
> +	if (!__xfs_iflock_nowait(ip)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;

This looks like a potential crash vector if we ever got here with a
clean inode.

> +		ra->dirty_skipped++;
> +		goto out_unlock_inode;
>  	}
>  
> -	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> -		return false;
> +	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> +		goto reclaim;
>  
>  	/*
> -	 * The radix tree lock here protects a thread in xfs_iget from racing
> -	 * with us starting reclaim on the inode.  Once we have the
> -	 * XFS_IRECLAIM flag set it will not touch us.
> -	 *
> -	 * Due to RCU lookup, we may find inodes that have been freed and only
> -	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> -	 * aren't candidates for reclaim at all, so we must check the
> -	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> +	 * Now the inode is locked, we can actually determine if it is dirty
> +	 * without racing with anything.
>  	 */
> -	spin_lock(&ip->i_flags_lock);
> -	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
> -	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
> -		/* not a reclaim candidate. */
> -		spin_unlock(&ip->i_flags_lock);
> -		return false;
> +	ret = LRU_ROTATE;
> +	if (xfs_ipincount(ip)) {
> +		ra->dirty_skipped++;

Hmm.. didn't we have an LSN check here?

Altogether, I think the logic in this function would be a lot more
simple if we had something like the following:

	...
	/* ret == LRU_SKIP */
        if (!xfs_inode_clean(ip)) {
		ret = LRU_ROTATE;
                lsn = ip->i_itemp->ili_item.li_lsn;
                ra->dirty_skipped++;
        }
        if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
                ra->lowest_lsn = lsn;
        return ret;

... as the non-reclaim exit path. Then the earlier logic simply dictates
how we process the inode instead of conflating lru processing with
lsn/dirty checks. Otherwise for example (based on the current logic),
it's not really clear to me whether ->dirty_skipped cares about dirty
inodes or just the fact that we skipped an inode.

> +		goto out_ifunlock;
> +	}
> +	if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
> +		lsn = ip->i_itemp->ili_item.li_lsn;
> +		ra->dirty_skipped++;
> +		goto out_ifunlock;
>  	}
> +
...
> @@ -1165,167 +1108,52 @@ xfs_reclaim_inode(
...
>  void
>  xfs_reclaim_all_inodes(
>  	struct xfs_mount	*mp)
>  {
...
> +	while (list_lru_count(&mp->m_inode_lru)) {

It seems unnecessary to call this twice per-iter:

	while ((to_free = list_lru_count(&mp->m_inode_lru))) {
		...
	}

Hm?

Brian

> +		struct xfs_ireclaim_args ra;
> +		long freed, to_free;
> +
> +		xfs_ireclaim_args_init(&ra);
> +
> +		to_free = list_lru_count(&mp->m_inode_lru);
> +		freed = list_lru_walk(&mp->m_inode_lru,
> +				      xfs_inode_reclaim_isolate, &ra, to_free);
> +		xfs_dispose_inodes(&ra.freeable);
> +
> +		if (freed == 0) {
> +			xfs_log_force(mp, XFS_LOG_SYNC);
> +			xfs_ail_push_all(mp->m_ail);
> +		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
> +			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
> +		}
> +		cond_resched();
> +	}
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index afd692b06c13..86e858e4a281 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -49,8 +49,24 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
>  struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
>  void xfs_inode_free(struct xfs_inode *ip);
>  
> +struct xfs_ireclaim_args {
> +	struct list_head	freeable;
> +	xfs_lsn_t		lowest_lsn;
> +	unsigned long		dirty_skipped;
> +};
> +
> +static inline void
> +xfs_ireclaim_args_init(struct xfs_ireclaim_args *ra)
> +{
> +	INIT_LIST_HEAD(&ra->freeable);
> +	ra->lowest_lsn = NULLCOMMITLSN;
> +	ra->dirty_skipped = 0;
> +}
> +
> +enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
> +		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
> +void xfs_dispose_inodes(struct list_head *freeable);
>  void xfs_reclaim_all_inodes(struct xfs_mount *mp);
> -long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
>  
>  void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bcfb35a9c5ca..00145debf820 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -270,6 +270,15 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  
>  extern void __xfs_iflock(struct xfs_inode *ip);
>  
> +static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
> +{
> +	lockdep_assert_held(&ip->i_flags_lock);
> +	if (ip->i_flags & XFS_IFLOCK)
> +		return false;
> +	ip->i_flags |= XFS_IFLOCK;
> +	return true;
> +}
> +
>  static inline int xfs_iflock_nowait(struct xfs_inode *ip)
>  {
>  	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
> @@ -281,6 +290,15 @@ static inline void xfs_iflock(struct xfs_inode *ip)
>  		__xfs_iflock(ip);
>  }
>  
> +static inline void __xfs_ifunlock(struct xfs_inode *ip)
> +{
> +	lockdep_assert_held(&ip->i_flags_lock);
> +	ASSERT(ip->i_flags & XFS_IFLOCK);
> +	ip->i_flags &= ~XFS_IFLOCK;
> +	smp_mb();
> +	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
> +}
> +
>  static inline void xfs_ifunlock(struct xfs_inode *ip)
>  {
>  	ASSERT(xfs_isiflocked(ip));
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 98ffbe42f8ae..096ae31b5436 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -17,6 +17,7 @@
>  #include "xfs_alloc.h"
>  #include "xfs_fsops.h"
>  #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  #include "xfs_log_priv.h"
> @@ -1772,23 +1773,54 @@ xfs_fs_mount(
>  }
>  
>  static long
> -xfs_fs_nr_cached_objects(
> +xfs_fs_free_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	/* Paranoia: catch incorrect calls during mount setup or teardown */
> -	if (WARN_ON_ONCE(!sb->s_fs_info))
> -		return 0;
> +	struct xfs_mount	*mp = XFS_M(sb);
> +	struct xfs_ireclaim_args ra;
> +	long			freed;
>  
> -	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> +	xfs_ireclaim_args_init(&ra);
> +
> +	freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> +					xfs_inode_reclaim_isolate, &ra);
> +	xfs_dispose_inodes(&ra.freeable);
> +
> +	/*
> +	 * Deal with dirty inodes. We will have the LSN of
> +	 * the oldest dirty inode in our reclaim args if we skipped any.
> +	 *
> +	 * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
> +	 * we freed) then we need kswapd to back off once it's scan has been
> +	 * completed. That way it will have some clean inodes once it comes back
> +	 * and can make progress, but make sure we have inode cleaning in
> +	 * progress.
> +	 *
> +	 * Direct reclaim will be throttled by the caller as it winds the
> +	 * priority up. All we need to do is keep pushing on dirty inodes
> +	 * in the background so when we come back progress will be made.
> +	 */
> +	if (current_is_kswapd() && ra.dirty_skipped >= freed) {
> +		if (current->reclaim_state)
> +			current->reclaim_state->need_backoff = true;
> +	}
> +	if (ra.lowest_lsn != NULLCOMMITLSN)
> +		xfs_ail_push(mp->m_ail, ra.lowest_lsn);
> +
> +	return freed;
>  }
>  
>  static long
> -xfs_fs_free_cached_objects(
> +xfs_fs_nr_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> -	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
> +	/* Paranoia: catch incorrect calls during mount setup or teardown */
> +	if (WARN_ON_ONCE(!sb->s_fs_info))
> +		return 0;
> +
> +	return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
>  }
>  
>  static const struct super_operations xfs_super_operations = {
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/28] xfs: remove unusued old inode reclaim code
  2019-10-31 23:46 ` [PATCH 25/28] xfs: remove unusued old inode reclaim code Dave Chinner
@ 2019-11-06 17:21   ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-06 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:15AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that the custom AG radix tree walker has been replaced and
> removed, we don't need the radix tree tags anymore, nor the reclaim
> cursors or the locks taht protect it. Remove all remaining traces of
> these things.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_icache.c | 82 +--------------------------------------------
>  fs/xfs/xfs_icache.h |  7 ++--
>  fs/xfs/xfs_mount.c  |  4 ---
>  fs/xfs/xfs_mount.h  |  3 --
>  fs/xfs/xfs_super.c  |  5 +--
>  5 files changed, 6 insertions(+), 95 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 05dd292bfdb6..71a729e29260 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -139,83 +139,6 @@ xfs_inode_free(
>  	__xfs_inode_free(ip);
>  }
>  
> -static void
> -xfs_perag_set_reclaim_tag(
> -	struct xfs_perag	*pag)
> -{
> -	struct xfs_mount	*mp = pag->pag_mount;
> -
> -	lockdep_assert_held(&pag->pag_ici_lock);
> -	if (pag->pag_ici_reclaimable++)
> -		return;
> -
> -	/* propagate the reclaim tag up into the perag radix tree */
> -	spin_lock(&mp->m_perag_lock);
> -	radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
> -			   XFS_ICI_RECLAIM_TAG);
> -	spin_unlock(&mp->m_perag_lock);
> -
> -	trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
> -}
> -
> -static void
> -xfs_perag_clear_reclaim_tag(
> -	struct xfs_perag	*pag)
> -{
> -	struct xfs_mount	*mp = pag->pag_mount;
> -
> -	lockdep_assert_held(&pag->pag_ici_lock);
> -	if (--pag->pag_ici_reclaimable)
> -		return;
> -
> -	/* clear the reclaim tag from the perag radix tree */
> -	spin_lock(&mp->m_perag_lock);
> -	radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
> -			     XFS_ICI_RECLAIM_TAG);
> -	spin_unlock(&mp->m_perag_lock);
> -	trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
> -}
> -
> -
> -/*
> - * We set the inode flag atomically with the radix tree tag.
> - * Once we get tag lookups on the radix tree, this inode flag
> - * can go away.
> - */
> -void
> -xfs_inode_set_reclaim_tag(
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_perag	*pag;
> -
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> -	spin_lock(&pag->pag_ici_lock);
> -	spin_lock(&ip->i_flags_lock);
> -
> -	radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			   XFS_ICI_RECLAIM_TAG);
> -	xfs_perag_set_reclaim_tag(pag);
> -	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> -
> -	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
> -
> -	spin_unlock(&ip->i_flags_lock);
> -	spin_unlock(&pag->pag_ici_lock);
> -	xfs_perag_put(pag);
> -}
> -
> -STATIC void
> -xfs_inode_clear_reclaim_tag(
> -	struct xfs_perag	*pag,
> -	xfs_ino_t		ino)
> -{
> -	radix_tree_tag_clear(&pag->pag_ici_root,
> -			     XFS_INO_TO_AGINO(pag->pag_mount, ino),
> -			     XFS_ICI_RECLAIM_TAG);
> -	xfs_perag_clear_reclaim_tag(pag);
> -}
> -
>  static void
>  xfs_inew_wait(
>  	struct xfs_inode	*ip)
> @@ -397,18 +320,16 @@ xfs_iget_cache_hit(
>  			goto out_error;
>  		}
>  
> -		spin_lock(&pag->pag_ici_lock);
> -		spin_lock(&ip->i_flags_lock);
>  
>  		/*
>  		 * Clear the per-lifetime state in the inode as we are now
>  		 * effectively a new inode and need to return to the initial
>  		 * state before reuse occurs.
>  		 */
> +		spin_lock(&ip->i_flags_lock);
>  		ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
>  		ip->i_flags |= XFS_INEW;
>  		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
> -		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
>  		inode->i_state = I_NEW;
>  		ip->i_sick = 0;
>  		ip->i_checked = 0;
> @@ -417,7 +338,6 @@ xfs_iget_cache_hit(
>  		init_rwsem(&inode->i_rwsem);
>  
>  		spin_unlock(&ip->i_flags_lock);
> -		spin_unlock(&pag->pag_ici_lock);
>  	} else {
>  		/* If the VFS inode is being torn down, pause and try again. */
>  		if (!igrab(inode)) {
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index 86e858e4a281..ec646b9e88b7 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -25,9 +25,8 @@ struct xfs_eofblocks {
>   */
>  #define XFS_ICI_NO_TAG		(-1)	/* special flag for an untagged lookup
>  					   in xfs_inode_ag_iterator */
> -#define XFS_ICI_RECLAIM_TAG	0	/* inode is to be reclaimed */
> -#define XFS_ICI_EOFBLOCKS_TAG	1	/* inode has blocks beyond EOF */
> -#define XFS_ICI_COWBLOCKS_TAG	2	/* inode can have cow blocks to gc */
> +#define XFS_ICI_EOFBLOCKS_TAG	0	/* inode has blocks beyond EOF */
> +#define XFS_ICI_COWBLOCKS_TAG	1	/* inode can have cow blocks to gc */
>  
>  /*
>   * Flags for xfs_iget()
> @@ -68,8 +67,6 @@ enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
>  void xfs_dispose_inodes(struct list_head *freeable);
>  void xfs_reclaim_all_inodes(struct xfs_mount *mp);
>  
> -void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
> -
>  void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
>  void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
>  int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *);
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 5f3fd1d8f63f..9d60a4e033a0 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -148,7 +148,6 @@ xfs_free_perag(
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
>  		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
> -		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
>  	}
>  }
> @@ -200,7 +199,6 @@ xfs_initialize_perag(
>  		pag->pag_agno = index;
>  		pag->pag_mount = mp;
>  		spin_lock_init(&pag->pag_ici_lock);
> -		mutex_init(&pag->pag_ici_reclaim_lock);
>  		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
>  		if (xfs_buf_hash_init(pag))
>  			goto out_free_pag;
> @@ -242,7 +240,6 @@ xfs_initialize_perag(
>  out_hash_destroy:
>  	xfs_buf_hash_destroy(pag);
>  out_free_pag:
> -	mutex_destroy(&pag->pag_ici_reclaim_lock);
>  	kmem_free(pag);
>  out_unwind_new_pags:
>  	/* unwind any prior newly initialized pags */
> @@ -252,7 +249,6 @@ xfs_initialize_perag(
>  			break;
>  		xfs_buf_hash_destroy(pag);
>  		xfs_iunlink_destroy(pag);
> -		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		kmem_free(pag);
>  	}
>  	return error;
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 4f153ee17e18..dea05cd867bf 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -343,9 +343,6 @@ typedef struct xfs_perag {
>  
>  	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
>  	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
> -	int		pag_ici_reclaimable;	/* reclaimable inodes */
> -	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
> -	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
>  
>  	/* buffer cache index */
>  	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 096ae31b5436..d2200fbce139 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -916,7 +916,6 @@ xfs_fs_destroy_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
>  	ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIM));
> -	spin_unlock(&ip->i_flags_lock);
>  
>  	/*
>  	 * We always use background reclaim here because even if the
> @@ -925,7 +924,9 @@ xfs_fs_destroy_inode(
>  	 * this more efficiently than we can here, so simply let background
>  	 * reclaim tear down all inodes.
>  	 */
> -	xfs_inode_set_reclaim_tag(ip);
> +	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
> +	list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
> +	spin_unlock(&ip->i_flags_lock);
>  }
>  
>  static void
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes
  2019-10-31 23:46 ` [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes Dave Chinner
@ 2019-11-06 17:22   ` Brian Foster
  2019-11-14 21:53     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-06 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:16AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we are reclaiming all inodes, it is likely we need to flush the
> entire AIL to do that. We have mechanisms to do that without needing
> to push to a specific LSN.
> 
> Convert xfs_relaim_all_inodes() to use xfs_ail_push_all variant so
> we can get rid of the hacky xfs_ail_push_sync() scaffolding we used
> to support the intermediate stages of the non-blocking reclaim
> changeset.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c     | 17 +++++++++++------
>  fs/xfs/xfs_trans_ail.c  | 32 --------------------------------
>  fs/xfs/xfs_trans_priv.h |  2 --
>  3 files changed, 11 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 71a729e29260..11bf4768d491 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
...
> @@ -1066,13 +1074,10 @@ xfs_reclaim_all_inodes(
>  				      xfs_inode_reclaim_isolate, &ra, to_free);
>  		xfs_dispose_inodes(&ra.freeable);
>  
> -		if (freed == 0) {
> +		if (freed == 0)
>  			xfs_log_force(mp, XFS_LOG_SYNC);
> -			xfs_ail_push_all(mp->m_ail);
> -		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
> -			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
> -		}
> -		cond_resched();
> +		else if (ra.dirty_skipped)
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);

Why not use xfs_ail_push_all_sync() in this function and skip the direct
stall? This is only used in the unmount and quiesce paths so the big
hammer approach seems reasonable. As it is, the former already calls
xfs_ail_push_all_sync() before xfs_reclaim_all_inodes() and the latter
calls xfs_log_force(mp, XFS_LOG_SYNC).

Brian

>  	}
>  }
>  
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 3e1d0e1439e2..685a21cd24c0 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -662,36 +662,6 @@ xfs_ail_push_all(
>  		xfs_ail_push(ailp, threshold_lsn);
>  }
>  
> -/*
> - * Push the AIL to a specific lsn and wait for it to complete.
> - */
> -void
> -xfs_ail_push_sync(
> -	struct xfs_ail		*ailp,
> -	xfs_lsn_t		threshold_lsn)
> -{
> -	struct xfs_log_item	*lip;
> -	DEFINE_WAIT(wait);
> -
> -	spin_lock(&ailp->ail_lock);
> -	while ((lip = xfs_ail_min(ailp)) != NULL) {
> -		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> -		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> -		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> -			break;
> -		if (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> -			ailp->ail_target = threshold_lsn;
> -		wake_up_process(ailp->ail_task);
> -		spin_unlock(&ailp->ail_lock);
> -		schedule();
> -		spin_lock(&ailp->ail_lock);
> -	}
> -	spin_unlock(&ailp->ail_lock);
> -
> -	finish_wait(&ailp->ail_push, &wait);
> -}
> -
> -
>  /*
>   * Push out all items in the AIL immediately and wait until the AIL is empty.
>   */
> @@ -732,7 +702,6 @@ xfs_ail_update_finish(
>  	if (!XFS_FORCED_SHUTDOWN(mp))
>  		xlog_assign_tail_lsn_locked(mp);
>  
> -	wake_up_all(&ailp->ail_push);
>  	if (list_empty(&ailp->ail_head))
>  		wake_up_all(&ailp->ail_empty);
>  	spin_unlock(&ailp->ail_lock);
> @@ -889,7 +858,6 @@ xfs_trans_ail_init(
>  	spin_lock_init(&ailp->ail_lock);
>  	INIT_LIST_HEAD(&ailp->ail_buf_list);
>  	init_waitqueue_head(&ailp->ail_empty);
> -	init_waitqueue_head(&ailp->ail_push);
>  
>  	ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
>  			ailp->ail_mount->m_fsname);
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 1b6f4bbd47c0..35655eac01a6 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,7 +61,6 @@ struct xfs_ail {
>  	int			ail_log_flush;
>  	struct list_head	ail_buf_list;
>  	wait_queue_head_t	ail_empty;
> -	wait_queue_head_t	ail_push;
>  };
>  
>  /*
> @@ -114,7 +113,6 @@ xfs_trans_ail_remove(
>  }
>  
>  void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
> -void			xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
>  void			xfs_ail_push_all(struct xfs_ail *);
>  void			xfs_ail_push_all_sync(struct xfs_ail *);
>  struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-10-31 23:46 ` [PATCH 28/28] xfs: rework unreferenced inode lookups Dave Chinner
@ 2019-11-06 22:18   ` Brian Foster
  2019-11-14 22:16     ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-06 22:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Looking up an unreferenced inode in the inode cache is a bit hairy.
> We do this for inode invalidation and writeback clustering purposes,
> which is all invisible to the VFS. Hence we can't take reference
> counts to the inode and so must be very careful how we do it.
> 
> There are several different places that all do the lookups and
> checks slightly differently. Fundamentally, though, they are all
> racy and inode reclaim has to block waiting for the inode lock if it
> loses the race. This is not very optimal given all the work we;ve
> already done to make reclaim non-blocking.
> 
> We can make the reclaim process nonblocking with a couple of simple
> changes. If we define the unreferenced lookup process in a way that
> will either always grab an inode in a way that reclaim will notice
> and skip, or will notice a reclaim has grabbed the inode so it can
> skip the inode, then there is no need for reclaim to need to cycle
> the inode ILOCK at all.
> 
> Selecting an inode for reclaim is already non-blocking, so if the
> ILOCK is held the inode will be skipped. If we ensure that reclaim
> holds the ILOCK until the inode is freed, then we can do the same
> thing in the unreferenced lookup to avoid inodes in reclaim. We can
> do this simply by holding the ILOCK until the RCU grace period
> expires and the inode freeing callback is run. As all unreferenced
> lookups have to hold the rcu_read_lock(), we are guaranteed that
> a reclaimed inode will be noticed as the trylock will fail.
> 
...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/mrlock.h     |  27 +++++++++
>  fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
>  fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
>  3 files changed, 153 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> index 79155eec341b..1752a2592bcc 100644
> --- a/fs/xfs/mrlock.h
> +++ b/fs/xfs/mrlock.h
...
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 11bf4768d491..45ee3b5cd873 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -106,6 +106,7 @@ xfs_inode_free_callback(
>  		ip->i_itemp = NULL;
>  	}
>  
> +	mrunlock_excl_non_owner(&ip->i_lock);
>  	kmem_zone_free(xfs_inode_zone, ip);
>  }
>  
> @@ -132,6 +133,7 @@ xfs_inode_free(
>  	 * free state. The ip->i_flags_lock provides the barrier against lookup
>  	 * races.
>  	 */
> +	mrupdate_non_owner(&ip->i_lock);

Can we tie these into the proper locking interface using flags? For
example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?

>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
> @@ -295,11 +297,24 @@ xfs_iget_cache_hit(
>  		}
>  
>  		/*
> -		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
> -		 * from stomping over us while we recycle the inode. Remove it
> -		 * from the LRU straight away so we can re-init the VFS inode.
> +		 * Before we reinitialise the inode, we need to make sure
> +		 * reclaim does not pull it out from underneath us. We already
> +		 * hold the i_flags_lock, and because the XFS_IRECLAIM is not
> +		 * set we know the inode is still on the LRU. However, the LRU
> +		 * code may have just selected this inode to reclaim, so we need
> +		 * to ensure we hold the i_flags_lock long enough for the
> +		 * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
> +		 * removing the inode from the LRU, which will spin on the LRU
> +		 * list locks until reclaim stops walking, at which point we
> +		 * know there is no possible race between reclaim isolation and
> +		 * this lookup.
> +		 *

Somewhat related to my question about the lru_lock on the earlier patch.

> +		 * We also set the XFS_IRECLAIM flag here while trying to do the
> +		 * re-initialisation to prevent multiple racing lookups on this
> +		 * inode from all landing here at the same time.
>  		 */
>  		ip->i_flags |= XFS_IRECLAIM;
> +		list_lru_del(&mp->m_inode_lru, &inode->i_lru);
>  		spin_unlock(&ip->i_flags_lock);
>  		rcu_read_unlock();
>  
...
> @@ -1022,19 +1076,7 @@ xfs_dispose_inode(
>  	spin_unlock(&pag->pag_ici_lock);
>  	xfs_perag_put(pag);
>  
> -	/*
> -	 * Here we do an (almost) spurious inode lock in order to coordinate
> -	 * with inode cache radix tree lookups.  This is because the lookup
> -	 * can reference the inodes in the cache without taking references.
> -	 *
> -	 * We make that OK here by ensuring that we wait until the inode is
> -	 * unlocked after the lookup before we go ahead and free it.
> -	 *
> -	 * XXX: need to check this is still true. Not sure it is.
> -	 */
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_qm_dqdetach(ip);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);

Ok, so I'm staring at this a bit more and think I'm missing something.
If we put aside the change to hold ilock until the inode is freed, we
basically have the following (simplified) flow as the inode goes from
isolation to disposal:

	ilock	(isolate)
	iflock
	set XFS_IRECLAIM
	ifunlock (disposal)
	iunlock
	radix delete
	ilock cycle (drain)
	rcu free

What we're trying to eliminate is the ilock cycle to drain any
concurrent unreferenced lookups from accessing the inode once it is
freed. The free itself is still RCU protected.

Looking over at the ifree path, we now have something like this:

	rcu_read_lock()
	radix lookup
	check XFS_IRECLAIM
	ilock
	if XFS_ISTALE, skip
	set XFS_ISTALE
	rcu_read_unlock()
	iflock
	/* return locked down inode */

Given that we set XFS_IRECLAIM under ilock, would we still need either
the ilock cycle or to hold ilock through the RCU free if the ifree side
(re)checked XFS_IRECLAIM after it has the ilock (but before it drops the
rcu read lock)? ISTM we should either have a non-reclaim inode with
ilock protection or a reclaim inode with RCU protection (so we can skip
it before it frees), but I could easily be missing something here..

>  
>  	__xfs_inode_free(ip);
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 33edb18098ca..5c0be82195fc 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2538,60 +2538,63 @@ xfs_ifree_get_one_inode(
>  	if (!ip)
>  		goto out_rcu_unlock;
>  
> +

Extra whitespace here.

> +	spin_lock(&ip->i_flags_lock);
> +	if (!ip->i_ino || ip->i_ino != inum ||
> +	    __xfs_iflags_test(ip, XFS_IRECLAIM))
> +		goto out_iflags_unlock;
> +
>  	/*
> -	 * because this is an RCU protected lookup, we could find a recently
> -	 * freed or even reallocated inode during the lookup. We need to check
> -	 * under the i_flags_lock for a valid inode here. Skip it if it is not
> -	 * valid, the wrong inode or stale.
> +	 * We've got the right inode and it isn't in reclaim but it might be
> +	 * locked by someone else.  In that case, we retry the inode rather than
> +	 * skipping it completely as we have to process it with the cluster
> +	 * being freed.
>  	 */
> -	spin_lock(&ip->i_flags_lock);
> -	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
> +	if (ip != free_ip && !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
>  		spin_unlock(&ip->i_flags_lock);
> -		goto out_rcu_unlock;
> +		rcu_read_unlock();
> +		delay(1);
> +		goto retry;
>  	}
> -	spin_unlock(&ip->i_flags_lock);
>  
>  	/*
> -	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> -	 * other inodes that we did not find in the list attached to the buffer
> -	 * and are not already marked stale. If we can't lock it, back off and
> -	 * retry.
> +	 * Inode is now pinned against reclaim until we unlock it. If the inode
> +	 * is already marked stale, then it has already been flush locked and
> +	 * attached to the buffer so we don't need to do anything more here.
>  	 */
> -	if (ip != free_ip) {
> -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> -			rcu_read_unlock();
> -			delay(1);
> -			goto retry;
> -		}
> -
> -		/*
> -		 * Check the inode number again in case we're racing with
> -		 * freeing in xfs_reclaim_inode().  See the comments in that
> -		 * function for more information as to why the initial check is
> -		 * not sufficient.
> -		 */
> -		if (ip->i_ino != inum) {
> +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {

Is there a correctness reason for why we move the stale check to under
ilock (in both iflush/ifree)?

> +		if (ip != free_ip)
>  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -			goto out_rcu_unlock;
> -		}
> +		goto out_iflags_unlock;
>  	}
> +	__xfs_iflags_set(ip, XFS_ISTALE);
> +	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
>  
> +	/*
> +	 * The flush lock will now hold off inode reclaim until the buffer
> +	 * completion routine runs the xfs_istale_done callback and unlocks the
> +	 * flush lock. Hence the caller can safely drop the ILOCK when it is
> +	 * done attaching the inode to the cluster buffer.
> +	 */
>  	xfs_iflock(ip);
> -	xfs_iflags_set(ip, XFS_ISTALE);
>  
>  	/*
> -	 * We don't need to attach clean inodes or those only with unlogged
> -	 * changes (which we throw away, anyway).
> +	 * We don't need to attach clean inodes to the buffer - they are marked
> +	 * stale in memory now and will need to be re-initialised by inode
> +	 * allocation before they can be reused.
>  	 */
>  	if (!ip->i_itemp || xfs_inode_clean(ip)) {
>  		ASSERT(ip != free_ip);
>  		xfs_ifunlock(ip);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		if (ip != free_ip)
> +			xfs_iunlock(ip, XFS_ILOCK_EXCL);

There's an assert against this case just above, though I suppose there's
nothing wrong with just keeping it and making the functional code more
cautious.

Brian

>  		goto out_no_inode;
>  	}
>  	return ip;
>  
> +out_iflags_unlock:
> +	spin_unlock(&ip->i_flags_lock);
>  out_rcu_unlock:
>  	rcu_read_unlock();
>  out_no_inode:
> @@ -3519,44 +3522,40 @@ xfs_iflush_cluster(
>  			continue;
>  
>  		/*
> -		 * because this is an RCU protected lookup, we could find a
> -		 * recently freed or even reallocated inode during the lookup.
> -		 * We need to check under the i_flags_lock for a valid inode
> -		 * here. Skip it if it is not valid or the wrong inode.
> +		 * See xfs_dispose_inode() for an explanation of the
> +		 * tests here to avoid inode reclaim races.
>  		 */
>  		spin_lock(&cip->i_flags_lock);
>  		if (!cip->i_ino ||
> -		    __xfs_iflags_test(cip, XFS_ISTALE)) {
> +		    __xfs_iflags_test(cip, XFS_IRECLAIM)) {
>  			spin_unlock(&cip->i_flags_lock);
>  			continue;
>  		}
>  
> -		/*
> -		 * Once we fall off the end of the cluster, no point checking
> -		 * any more inodes in the list because they will also all be
> -		 * outside the cluster.
> -		 */
> +		/* ILOCK will pin the inode against reclaim */
> +		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) {
> +			spin_unlock(&cip->i_flags_lock);
> +			continue;
> +		}
> +
> +		if (__xfs_iflags_test(cip, XFS_ISTALE)) {
> +			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +			spin_unlock(&cip->i_flags_lock);
> +			continue;
> +		}
> +
> +		/* Lookup can find inodes outside the cluster being flushed. */
>  		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> +			xfs_iunlock(cip, XFS_ILOCK_SHARED);
>  			spin_unlock(&cip->i_flags_lock);
>  			break;
>  		}
>  		spin_unlock(&cip->i_flags_lock);
>  
>  		/*
> -		 * Do an un-protected check to see if the inode is dirty and
> -		 * is a candidate for flushing.  These checks will be repeated
> -		 * later after the appropriate locks are acquired.
> -		 */
> -		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
> -			continue;
> -
> -		/*
> -		 * Try to get locks.  If any are unavailable or it is pinned,
> +		 * If we can't get the flush lock now or the inode is pinned,
>  		 * then this inode cannot be flushed and is skipped.
>  		 */
> -
> -		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
> -			continue;
>  		if (!xfs_iflock_nowait(cip)) {
>  			xfs_iunlock(cip, XFS_ILOCK_SHARED);
>  			continue;
> @@ -3567,22 +3566,9 @@ xfs_iflush_cluster(
>  			continue;
>  		}
>  
> -
>  		/*
> -		 * Check the inode number again, just to be certain we are not
> -		 * racing with freeing in xfs_reclaim_inode(). See the comments
> -		 * in that function for more information as to why the initial
> -		 * check is not sufficient.
> -		 */
> -		if (!cip->i_ino) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -
> -		/*
> -		 * arriving here means that this inode can be flushed.  First
> -		 * re-check that it's dirty before flushing.
> +		 * Arriving here means that this inode can be flushed. First
> +		 * check that it's dirty before flushing.
>  		 */
>  		if (!xfs_inode_clean(cip)) {
>  			int	error;
> @@ -3596,6 +3582,7 @@ xfs_iflush_cluster(
>  			xfs_ifunlock(cip);
>  		}
>  		xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +		/* unsafe to reference cip from here */
>  	}
>  
>  	if (clcount) {
> @@ -3634,7 +3621,11 @@ xfs_iflush_cluster(
>  
>  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  
> -	/* abort the corrupt inode, as it was not attached to the buffer */
> +	/*
> +	 * Abort the corrupt inode, as it was not attached to the buffer. It is
> +	 * unlocked, but still pinned against reclaim by the flush lock so it is
> +	 * safe to reference here until after the flush abort completes.
> +	 */
>  	xfs_iflush_abort(cip, false);
>  	kmem_free(cilist);
>  	xfs_perag_put(pag);
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/28] mm: directed shrinker work deferral
  2019-11-04 15:25   ` Brian Foster
@ 2019-11-14 20:49     ` Dave Chinner
  2019-11-15 17:21       ` Brian Foster
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 20:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Introduce a mechanism for ->count_objects() to indicate to the
> > shrinker infrastructure that the reclaim context will not allow
> > scanning work to be done and so the work it decides is necessary
> > needs to be deferred.
> > 
> > This simplifies the code by separating out the accounting of
> > deferred work from the actual doing of the work, and allows better
> > decisions to be made by the shrinekr control logic on what action it
> > can take.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> My understanding from the previous discussion(s) is that this is not
> tied directly to the gfp mask because that is not the only intended use.
> While it is currently a boolean tied to the the entire shrinker call,
> the longer term objective is per-object granularity.

Longer term, yes, but right now such things are not possible as the
shrinker needs more context to be able to make sane per-object
decisions. shrinker policy decisions that affect the entire run
scope should be handled by the ->count operation - it's the one that
says whether the scan loop should run or not, and right now GFP_NOFS
for all filesystem shrinkers is a pure boolean policy
implementation.

The next future step is to provide a superblock context with
GFP_NOFS to indicate which filesystem we cannot recurse into. That
is also a shrinker instance wide check, so again it's something that
->count should be deciding.

i.e. ->count determines what is to be done, ->scan iterates the work
that has to be done until we are done.

> I find the argument reasonable enough, but if the above is true, why do
> we move these checks from ->scan_objects() to ->count_objects() (in the
> next patch) when per-object decisions will ultimately need to be made by
> the former?

Because run/no-run policy belongs in one place, and things like
GFP_NOFS do no change across calls to the ->scan loop. i.e. after
the first ->scan call in a loop that calls it hundreds to thousands
of times, the GFP_NOFS run/no-run check is completely redundant.

Once we introduce a new policy that allows the fs shrinker to do
careful reclaim in GFP_NOFS conditions, we need to do substantial
rework the shrinker scan loop and how it accounts the work that is
done - we now have at least 3 or 4 different return counters
(skipped because locked, skipped because referenced,
reclaimed, deferred reclaim because couldn't lock/recursion) and
the accounting and decisions to be made are a lot more complex.

In that case, the ->count function will drop the GFP_NOFS check, but
still do all the other things is needs to do. The GFP_NOFS check
will go deep in the guts of the shrinker scan implementation where
the per-object recursion problem exists. But for most shrinkers,
it's still going to be a global boolean check...

> That seems like unnecessary churn and inconsistent with the
> argument against just temporarily doing something like what Christoph
> suggested in the previous version, particularly since IIRC the only use
> in this series was for gfp mask purposes.

If people want to call avoiding repeated, unnecessary evaluation of
the same condition hundreds of times instead of once "unnecessary
churn", then I'll drop it.

> >  include/linux/shrinker.h | 7 +++++++
> >  mm/vmscan.c              | 8 ++++++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 0f80123650e2..3405c39ab92c 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -31,6 +31,13 @@ struct shrink_control {
> >  
> >  	/* current memcg being shrunk (for memcg aware shrinkers) */
> >  	struct mem_cgroup *memcg;
> > +
> > +	/*
> > +	 * set by ->count_objects if reclaim context prevents reclaim from
> > +	 * occurring. This allows the shrinker to immediately defer all the
> > +	 * work and not even attempt to scan the cache.
> > +	 */
> > +	bool defer_work;
> >  };
> >  
> >  #define SHRINK_STOP (~0UL)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ee4eecc7e1c2..a215d71d9d4b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> >  				   freeable, delta, total_scan, priority);
> >  
> > +	/*
> > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > +	 * defer the work to a context that can scan the cache.
> > +	 */
> > +	if (shrinkctl->defer_work)
> > +		goto done;
> > +
> 
> I still find the fact that this per-shrinker invocation field is never
> reset unnecessarily fragile, and I don't see any good reason not to
> reset it prior to the shrinker callback that potentially sets it.

I missed that when updating. I'll reset it in the next version.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 11/28] mm: factor shrinker work calculations
  2019-11-04 15:29   ` Brian Foster
@ 2019-11-14 20:59     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 20:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 04, 2019 at 10:29:39AM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:01AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Start to clean up the shrinker code by factoring out the calculation
> > that determines how much work to do. This separates the calculation
> > from clamping and other adjustments that are done before the
> > shrinker work is run. Document the scan batch size calculation
> > better while we are there.
> > 
> > Also convert the calculation for the amount of work to be done to
> > use 64 bit logic so we don't have to keep jumping through hoops to
> > keep calculations within 32 bits on 32 bit systems.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> I assume the kbuild warning thing will be fixed up...
> 
> >  mm/vmscan.c | 97 ++++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 70 insertions(+), 27 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a215d71d9d4b..2d39ec37c04d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -459,13 +459,68 @@ EXPORT_SYMBOL(unregister_shrinker);
> >  
> >  #define SHRINK_BATCH 128
> >  
> > +/*
> > + * Calculate the number of new objects to scan this time around. Return
> > + * the work to be done. If there are freeable objects, return that number in
> > + * @freeable_objects.
> > + */
> > +static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
> > +			    struct shrinker *shrinker, int priority,
> > +			    int64_t *freeable_objects)
> > +{
> > +	int64_t delta;
> > +	int64_t freeable;
> > +
> > +	freeable = shrinker->count_objects(shrinker, shrinkctl);
> > +	if (freeable == 0 || freeable == SHRINK_EMPTY)
> > +		return freeable;
> > +
> > +	if (shrinker->seeks) {
> > +		/*
> > +		 * shrinker->seeks is a measure of how much IO is required to
> > +		 * reinstantiate the object in memory. The default value is 2
> > +		 * which is typical for a cold inode requiring a directory read
> > +		 * and an inode read to re-instantiate.
> > +		 *
> > +		 * The scan batch size is defined by the shrinker priority, but
> > +		 * to be able to bias the reclaim we increase the default batch
> > +		 * size by 4. Hence we end up with a scan batch multipler that
> > +		 * scales like so:
> > +		 *
> > +		 * ->seeks	scan batch multiplier
> > +		 *    1		      4.00x
> > +		 *    2               2.00x
> > +		 *    3               1.33x
> > +		 *    4               1.00x
> > +		 *    8               0.50x
> > +		 *
> > +		 * IOWs, the more seeks it takes to pull the item into cache,
> > +		 * the smaller the reclaim scan batch. Hence we put more reclaim
> > +		 * pressure on caches that are fast to repopulate and to keep a
> > +		 * rough balance between caches that have different costs.
> > +		 */
> > +		delta = freeable >> (priority - 2);
> 
> Does anything prevent priority < 2 here?

Nope. I regularly see priority 1 here when the OOM killer is about
to strike. Doesn't appear to have caused any problems - the scan
counts have all come out correct (i.e. ends up as a >> 0) according
to the tracing, but I'll fix this up to avoid hitting this.

> 
> > -		delta = freeable >> priority;
> > -		delta *= 4;
> > -		do_div(delta, shrinker->seeks);
> > -	} else {
> > -		/*
> > -		 * These objects don't require any IO to create. Trim
> > -		 * them aggressively under memory pressure to keep
> > -		 * them from causing refetches in the IO caches.
> > -		 */
> > -		delta = freeable / 2;
> > -	}
> > -
> > -	total_scan += delta;
> > +	total_scan = nr + scan_count;
> >  	if (total_scan < 0) {
> >  		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> >  		       shrinker->scan_objects, total_scan);
> > -		total_scan = freeable;
> > +		total_scan = scan_count;
> 
> Same question as before: why the change in assignment? freeable was the
> ->count_objects() return value, which is now stored in freeable_objects.

we don't want to try to free the entire cache on an 64-bit integer
overflow. scan_count is the work we calculated we need to do this
shrinker invocation, so if we overflow because of other factors then
we should just do the work we need to do in this scan.

> FWIW, the change seems to make sense in that it just factors out the
> deferred count, but it's not clear if it's intentional...

It was intentional.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 12/28] shrinker: defer work only to kswapd
  2019-11-04 15:29   ` Brian Foster
@ 2019-11-14 21:11     ` Dave Chinner
  2019-11-15 17:23       ` Brian Foster
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 21:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 04, 2019 at 10:29:54AM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:02AM +1100, Dave Chinner wrote:
> > @@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  	 * scanning at high prio and therefore should try to reclaim as much as
> >  	 * possible.
> >  	 */
> > -	while (total_scan >= batch_size ||
> > -	       total_scan >= freeable_objects) {
> > +	while (scan_count >= batch_size ||
> > +	       scan_count >= freeable_objects) {
> >  		unsigned long ret;
> > -		unsigned long nr_to_scan = min(batch_size, total_scan);
> > +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
> >  
> >  		shrinkctl->nr_to_scan = nr_to_scan;
> >  		shrinkctl->nr_scanned = nr_to_scan;
> > @@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  		freed += ret;
> >  
> >  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> > -		total_scan -= shrinkctl->nr_scanned;
> > -		scanned += shrinkctl->nr_scanned;
> > +		scan_count -= shrinkctl->nr_scanned;
> > +		scanned_objects += shrinkctl->nr_scanned;
> >  
> >  		cond_resched();
> >  	}
> > -
> >  done:
> > -	if (next_deferred >= scanned)
> > -		next_deferred -= scanned;
> > +	if (deferred_count)
> > +		next_deferred = deferred_count - scanned_objects;
> >  	else
> > -		next_deferred = 0;
> > +		next_deferred = scan_count;
> 
> Hmm.. so if there was no deferred count on this cycle, we set
> next_deferred to whatever is left from scan_count and add that back into
> the shrinker struct below. If there was a pending deferred count on this
> cycle, we subtract what we scanned from that and add that value back.
> But what happens to the remaining scan_count in the latter case? Is it
> lost, or am I missing something?

if deferred_count is not zero, then it is kswapd that is running. It
does the deferred work, and if it doesn't make progress then adding
it's scan count to the deferred work doesn't matter. That's because
it will come back with an increased priority in a short while and
try to scan more of the deferred count plus it's larger scan count.

IOWs, if we defer kswapd unused scan count, we effectively increase
the pressure as the priority goes up, potentially making the
deferred count increase out of control. i.e. kswapd can make
progress and free items, but the result is that it increased the
deferred scan count rather than reducing it. This leads to excessive
reclaim of the slab caches and kswapd can trash the caches long
after the memory pressure has gone away...

> For example, suppose we start this cycle with a large scan_count and
> ->scan_objects() returned SHRINK_STOP before doing much work. In that
> scenario, it looks like whether ->nr_deferred is 0 or not is the only
> thing that determines whether we defer the entire remaining scan_count
> or just what is left from the previous ->nr_deferred. The existing code
> appears to consistently factor in what is left from the current scan
> with the previous deferred count. Hm?

If kswapd doesn't have any deferred work, then it's largely no
different in behaviour to direct reclaim. If it has no deferred
work, then the shrinker is not getting stopped early in direct
reclaim, so it's unlikely that kswapd is going to get stopped early,
either....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral
  2019-11-04 19:58   ` Brian Foster
@ 2019-11-14 21:28     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 21:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 04, 2019 at 02:58:22PM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:05AM +1100, Dave Chinner wrote:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 967e3d3c7748..13c11e10c9c5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -570,6 +570,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  		deferred_count = min(deferred_count, freeable_objects * 2);
> >  
> >  	}
> > +	if (current->reclaim_state)
> > +		current->reclaim_state->scanned_objects += scanned_objects;
> 
> Looks like scanned_objects is always zero here.

Yeah, that was a rebase mis-merge. It should be after the scan loop.

> >  	/*
> >  	 * Avoid risking looping forever due to too large nr value:
> > @@ -585,8 +587,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> >  	 * defer the work to a context that can scan the cache.
> >  	 */
> > -	if (shrinkctl->defer_work)
> > +	if (shrinkctl->defer_work) {
> > +		if (current->reclaim_state)
> > +			current->reclaim_state->deferred_objects += scan_count;
> >  		goto done;
> > +	}
> >  
> >  	/*
> >  	 * Normally, we should not scan less than batch_size objects in one
> > @@ -2871,7 +2876,30 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  
> >  		if (reclaim_state) {
> >  			sc->nr_reclaimed += reclaim_state->reclaimed_pages;
> > +
> > +			/*
> > +			 * If we are deferring more work than we are actually
> > +			 * doing in the shrinkers, and we are scanning more
> > +			 * objects than we are pages, the we have a large amount
> > +			 * of slab caches we are deferring work to kswapd for.
> > +			 * We better back off here for a while, otherwise
> > +			 * we risk priority windup, swap storms and OOM kills
> > +			 * once we empty the page lists but still can't make
> > +			 * progress on the shrinker memory.
> > +			 *
> > +			 * kswapd won't ever defer work as it's run under a
> > +			 * GFP_KERNEL context and can always do work.
> > +			 */
> > +			if ((reclaim_state->deferred_objects >
> > +					sc->nr_scanned - nr_scanned) &&
> 
> Out of curiosity, what's the reasoning behind the direct comparison
> between ->deferred_objects and pages? Shouldn't we generally expect more
> slab objects to exist than pages by the nature of slab?

No, we can't make any assumptions about the amount of memory a
reclaimed object pins. e.g. the xfs buf shrinker frees objects that
might have many pages attached to them (e.g. 64k dir buffer, 16k
inode cluster), the GEM/TTM shrinkers track and free pages, the
ashmem shrinker tracks pages, etc.

What we try to do is balance the cost of reinstantiating objects in
memory against each other. Reading in a page generally takes two
IOs, instantiating a new inode generally requires 2 IOs (dir read,
inode read), etc. That's what shrinker->seeks encodes, and it's an
attempt to balance object counts of the different caches in a
predictable manner.


> Also, the comment says "if we are scanning more objects than we are
> pages," yet the code is checking whether we defer more objects than
> scanned pages. Which is more accurate?

Both. :)

if reclaim_state->deferred_objects is larger than the page scan
count,  then we either have a very small page cache or we are
deferring a lot of shrinker work.

if we have a small page cache and shrinker reclaim is not making
good progress (i.e. defer more than scan), then we want to back off
for a while rather than rapidly ramp up the reclaim priority to give
the shrinker owner a chance to make progress. The current XFS inode
shrinker does this internally by blocking on IO, but we're getting
rid of that backoff so we need so other way to throttle reclaim when
we have lots of deferral going on.  THis reduces the pressure on the
page reclaim code, and goes some way to prevent swap storms (caused
by winding up the reclaim priority on a LRU with no file pages left
on it) when we have pure slab cache memory pressure.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 16/28] mm: kswapd backoff for shrinkers
  2019-11-04 19:58   ` Brian Foster
@ 2019-11-14 21:41     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 21:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 04, 2019 at 02:58:53PM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:06AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When kswapd reaches the end of the page LRU and starts hitting dirty
> > pages, the logic in shrink_node() allows it to back off and wait for
> > IO to complete, thereby preventing kswapd from scanning excessively
> > and driving the system into swap thrashing and OOM conditions.
> > 
> > When we have inode cache heavy workloads on XFS, we have exactly the
> > same problem with reclaim inodes. The non-blocking kswapd reclaim
> > will keep putting pressure onto the inode cache which is unable to
> > make progress. When the system gets to the point where there is no
> > pages in the LRU to free, there is no swap left and there are no
> > clean inodes that can be freed, it will OOM. This has a specific
> > signature in OOM:
> > 
> > [  110.841987] Mem-Info:
> > [  110.842816] active_anon:241 inactive_anon:82 isolated_anon:1
> >                 active_file:168 inactive_file:143 isolated_file:0
> >                 unevictable:2621523 dirty:1 writeback:8 unstable:0
> >                 slab_reclaimable:564445 slab_unreclaimable:420046
> >                 mapped:1042 shmem:11 pagetables:6509 bounce:0
> >                 free:77626 free_pcp:2 free_cma:0
> > 
> > In this case, we have about 500-600 pages left in teh LRUs, but we
> > have ~565000 reclaimable slab pages still available for reclaim.
> > Unfortunately, they are mostly dirty inodes, and so we really need
> > to be able to throttle kswapd when shrinker progress is limited due
> > to reaching the dirty end of the LRU...
> > 
> > So, add a flag into the reclaim_state so if the shrinker decides it
> > needs kswapd to back off and wait for a while (for whatever reason)
> > it can do so.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/linux/swap.h |  1 +
> >  mm/vmscan.c          | 10 +++++++++-
> >  2 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index da0913e14bb9..76fc28f0e483 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -133,6 +133,7 @@ struct reclaim_state {
> >  	unsigned long	reclaimed_pages;	/* pages freed by shrinkers */
> >  	unsigned long	scanned_objects;	/* quantity of work done */ 
> >  	unsigned long	deferred_objects;	/* work that wasn't done */
> > +	bool		need_backoff;		/* tell kswapd to slow down */
> >  };
> >  
> >  /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 13c11e10c9c5..0f7d35820057 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2949,8 +2949,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  			 * implies that pages are cycling through the LRU
> >  			 * faster than they are written so also forcibly stall.
> >  			 */
> > -			if (sc->nr.immediate)
> > +			if (sc->nr.immediate) {
> >  				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +			} else if (reclaim_state && reclaim_state->need_backoff) {
> > +				/*
> > +				 * Ditto, but it's a slab cache that is cycling
> > +				 * through the LRU faster than they are written
> > +				 */
> > +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +				reclaim_state->need_backoff = false;
> > +			}
> 
> Seems reasonable from a functional standpoint, but why not plug in to
> the existing stall instead of duplicate it? E.g., add a corresponding
> ->nr_immediate field to reclaim_state rather than a bool, then transfer
> that to the scan_control earlier in the function where we already check
> for reclaim_state and handle transferring fields (or alternatively just
> leave the bool and use it to bump the scan_control field). That seems a
> bit more consistent with the page processing code, keeps the
> reclaim_state resets in one place and also wouldn't leave us with an
> if/else here for the same stall. Hm?

Because I didn't want to touch the page reclaim logic. That code a
horrible unmaintainalbe spaghetti nightmare of undocumented
hueristics, conditional behaviours and stuff that doesn't work
anymore (e.g. IO load driven congestion backoffs).

Hence folding new things into existing variables is likely to have
unforseen side effects. e.g.  sc->nr.immediate only changes when the
PGDAT_WRITEBACK bit is set.  Hence the immediate reclaim behaviour
is very specific to a set of conditions in the page reclaim
algorithm and I don't want to risk perturbing this horiffic mess if
I can avoid it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 24/28] xfs: reclaim inodes from the LRU
  2019-11-06 17:21   ` Brian Foster
@ 2019-11-14 21:51     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 21:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Wed, Nov 06, 2019 at 12:21:04PM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:14AM +1100, Dave Chinner wrote:
> > -	struct xfs_inode	*ip,
> > -	int			flags,
> > -	xfs_lsn_t		*lsn)
> > +enum lru_status
> > +xfs_inode_reclaim_isolate(
> > +	struct list_head	*item,
> > +	struct list_lru_one	*lru,
> > +	spinlock_t		*lru_lock,
> 
> Did we ever establish whether we should cycle the lru_lock during long
> running scans?

I'm still evaluating this.

In theory, because it's non-blocking, the lock hold time isn't huge,
but OTOH I think the hold time is causing lock contention problems on
unlink workloads.  I've found a bunch of perf/blocking problems in
the last few days, and each one of them I sort out puts more
pressure on the lru list lock on unlinks.

> > -	/*
> > -	 * Do unlocked checks to see if the inode already is being flushed or in
> > -	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> > -	 * position in the AIL for the caller to push to.
> > -	 */
> > -	if (!xfs_inode_clean(ip)) {
> > -		*lsn = ip->i_itemp->ili_item.li_lsn;
> > -		return false;
> > +	if (!__xfs_iflock_nowait(ip)) {
> > +		lsn = ip->i_itemp->ili_item.li_lsn;
> 
> This looks like a potential crash vector if we ever got here with a
> clean inode.

I'm not sure we can ever fail a flush lock attempt on a clean inode.
But I'll rework the lsn grabbing, I think.

> > +		ra->dirty_skipped++;
> > +		goto out_unlock_inode;
> >  	}
> >  
> > -	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> > -		return false;
> > +	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> > +		goto reclaim;
> >  
> >  	/*
> > -	 * The radix tree lock here protects a thread in xfs_iget from racing
> > -	 * with us starting reclaim on the inode.  Once we have the
> > -	 * XFS_IRECLAIM flag set it will not touch us.
> > -	 *
> > -	 * Due to RCU lookup, we may find inodes that have been freed and only
> > -	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
> > -	 * aren't candidates for reclaim at all, so we must check the
> > -	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> > +	 * Now the inode is locked, we can actually determine if it is dirty
> > +	 * without racing with anything.
> >  	 */
> > -	spin_lock(&ip->i_flags_lock);
> > -	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
> > -	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
> > -		/* not a reclaim candidate. */
> > -		spin_unlock(&ip->i_flags_lock);
> > -		return false;
> > +	ret = LRU_ROTATE;
> > +	if (xfs_ipincount(ip)) {
> > +		ra->dirty_skipped++;
> 
> Hmm.. didn't we have an LSN check here?

Yes, but if the inode was not in the AIL, it would crash, so I
removed it :P

> Altogether, I think the logic in this function would be a lot more
> simple if we had something like the following:
> 
> 	...
> 	/* ret == LRU_SKIP */
>         if (!xfs_inode_clean(ip)) {
> 		ret = LRU_ROTATE;
>                 lsn = ip->i_itemp->ili_item.li_lsn;
>                 ra->dirty_skipped++;
>         }
>         if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
>                 ra->lowest_lsn = lsn;
>         return ret;
> 
> ... as the non-reclaim exit path.

Yeah, that was what I was thinking when you pointed out the
iflock_nowait issue above. I'll end up with something like this, I
think....

> >  void
> >  xfs_reclaim_all_inodes(
> >  	struct xfs_mount	*mp)
> >  {
> ...
> > +	while (list_lru_count(&mp->m_inode_lru)) {
> 
> It seems unnecessary to call this twice per-iter:
> 
> 	while ((to_free = list_lru_count(&mp->m_inode_lru))) {
> 		...
> 	}

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes
  2019-11-06 17:22   ` Brian Foster
@ 2019-11-14 21:53     ` Dave Chinner
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 21:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Wed, Nov 06, 2019 at 12:22:15PM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:16AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If we are reclaiming all inodes, it is likely we need to flush the
> > entire AIL to do that. We have mechanisms to do that without needing
> > to push to a specific LSN.
> > 
> > Convert xfs_relaim_all_inodes() to use xfs_ail_push_all variant so
> > we can get rid of the hacky xfs_ail_push_sync() scaffolding we used
> > to support the intermediate stages of the non-blocking reclaim
> > changeset.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c     | 17 +++++++++++------
> >  fs/xfs/xfs_trans_ail.c  | 32 --------------------------------
> >  fs/xfs/xfs_trans_priv.h |  2 --
> >  3 files changed, 11 insertions(+), 40 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 71a729e29260..11bf4768d491 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> ...
> > @@ -1066,13 +1074,10 @@ xfs_reclaim_all_inodes(
> >  				      xfs_inode_reclaim_isolate, &ra, to_free);
> >  		xfs_dispose_inodes(&ra.freeable);
> >  
> > -		if (freed == 0) {
> > +		if (freed == 0)
> >  			xfs_log_force(mp, XFS_LOG_SYNC);
> > -			xfs_ail_push_all(mp->m_ail);
> > -		} else if (ra.lowest_lsn != NULLCOMMITLSN) {
> > -			xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
> > -		}
> > -		cond_resched();
> > +		else if (ra.dirty_skipped)
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> Why not use xfs_ail_push_all_sync() in this function and skip the direct
> stall? This is only used in the unmount and quiesce paths so the big
> hammer approach seems reasonable.

Ok, that's a good simplification :)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-06 22:18   ` Brian Foster
@ 2019-11-14 22:16     ` Dave Chinner
  2019-11-15 13:13       ` Christoph Hellwig
  2019-11-15 17:26       ` Brian Foster
  0 siblings, 2 replies; 72+ messages in thread
From: Dave Chinner @ 2019-11-14 22:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Looking up an unreferenced inode in the inode cache is a bit hairy.
> > We do this for inode invalidation and writeback clustering purposes,
> > which is all invisible to the VFS. Hence we can't take reference
> > counts to the inode and so must be very careful how we do it.
> > 
> > There are several different places that all do the lookups and
> > checks slightly differently. Fundamentally, though, they are all
> > racy and inode reclaim has to block waiting for the inode lock if it
> > loses the race. This is not very optimal given all the work we;ve
> > already done to make reclaim non-blocking.
> > 
> > We can make the reclaim process nonblocking with a couple of simple
> > changes. If we define the unreferenced lookup process in a way that
> > will either always grab an inode in a way that reclaim will notice
> > and skip, or will notice a reclaim has grabbed the inode so it can
> > skip the inode, then there is no need for reclaim to need to cycle
> > the inode ILOCK at all.
> > 
> > Selecting an inode for reclaim is already non-blocking, so if the
> > ILOCK is held the inode will be skipped. If we ensure that reclaim
> > holds the ILOCK until the inode is freed, then we can do the same
> > thing in the unreferenced lookup to avoid inodes in reclaim. We can
> > do this simply by holding the ILOCK until the RCU grace period
> > expires and the inode freeing callback is run. As all unreferenced
> > lookups have to hold the rcu_read_lock(), we are guaranteed that
> > a reclaimed inode will be noticed as the trylock will fail.
> > 
> ...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/mrlock.h     |  27 +++++++++
> >  fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
> >  fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
> >  3 files changed, 153 insertions(+), 93 deletions(-)
> > 
> > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> > index 79155eec341b..1752a2592bcc 100644
> > --- a/fs/xfs/mrlock.h
> > +++ b/fs/xfs/mrlock.h
> ...
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 11bf4768d491..45ee3b5cd873 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -106,6 +106,7 @@ xfs_inode_free_callback(
> >  		ip->i_itemp = NULL;
> >  	}
> >  
> > +	mrunlock_excl_non_owner(&ip->i_lock);
> >  	kmem_zone_free(xfs_inode_zone, ip);
> >  }
> >  
> > @@ -132,6 +133,7 @@ xfs_inode_free(
> >  	 * free state. The ip->i_flags_lock provides the barrier against lookup
> >  	 * races.
> >  	 */
> > +	mrupdate_non_owner(&ip->i_lock);
> 
> Can we tie these into the proper locking interface using flags? For
> example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
> or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?

I'd prefer not to make this part of the common locking interface -
it's a one off special use case, not something we want to progate
elsewhere into the code.

Now that I think over it, I probably should have tagged this with
patch with [RFC]. I think we should just get rid of the mrlock
wrappers rather than add more, and that would simplify this a lot.


> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags = XFS_IRECLAIM;
> >  	ip->i_ino = 0;
> > @@ -295,11 +297,24 @@ xfs_iget_cache_hit(
> >  		}
> >  
> >  		/*
> > -		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
> > -		 * from stomping over us while we recycle the inode. Remove it
> > -		 * from the LRU straight away so we can re-init the VFS inode.
> > +		 * Before we reinitialise the inode, we need to make sure
> > +		 * reclaim does not pull it out from underneath us. We already
> > +		 * hold the i_flags_lock, and because the XFS_IRECLAIM is not
> > +		 * set we know the inode is still on the LRU. However, the LRU
> > +		 * code may have just selected this inode to reclaim, so we need
> > +		 * to ensure we hold the i_flags_lock long enough for the
> > +		 * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
> > +		 * removing the inode from the LRU, which will spin on the LRU
> > +		 * list locks until reclaim stops walking, at which point we
> > +		 * know there is no possible race between reclaim isolation and
> > +		 * this lookup.
> > +		 *
> 
> Somewhat related to my question about the lru_lock on the earlier patch.

*nod*

The caveat here is that this is the slow path so spinning for a
while doesn't really matter.

> > @@ -1022,19 +1076,7 @@ xfs_dispose_inode(
> >  	spin_unlock(&pag->pag_ici_lock);
> >  	xfs_perag_put(pag);
> >  
> > -	/*
> > -	 * Here we do an (almost) spurious inode lock in order to coordinate
> > -	 * with inode cache radix tree lookups.  This is because the lookup
> > -	 * can reference the inodes in the cache without taking references.
> > -	 *
> > -	 * We make that OK here by ensuring that we wait until the inode is
> > -	 * unlocked after the lookup before we go ahead and free it.
> > -	 *
> > -	 * XXX: need to check this is still true. Not sure it is.
> > -	 */
> > -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_qm_dqdetach(ip);
> > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 
> Ok, so I'm staring at this a bit more and think I'm missing something.
> If we put aside the change to hold ilock until the inode is freed, we
> basically have the following (simplified) flow as the inode goes from
> isolation to disposal:
> 
> 	ilock	(isolate)
> 	iflock
> 	set XFS_IRECLAIM
> 	ifunlock (disposal)
> 	iunlock
> 	radix delete
> 	ilock cycle (drain)
> 	rcu free
> 
> What we're trying to eliminate is the ilock cycle to drain any
> concurrent unreferenced lookups from accessing the inode once it is
> freed. The free itself is still RCU protected.
> 
> Looking over at the ifree path, we now have something like this:
> 
> 	rcu_read_lock()
> 	radix lookup
> 	check XFS_IRECLAIM
> 	ilock
> 	if XFS_ISTALE, skip
> 	set XFS_ISTALE
> 	rcu_read_unlock()
> 	iflock
> 	/* return locked down inode */

You missed a lock.

	rcu_read_lock()
	radix lookup
>>>	i_flags_lock
	check XFS_IRECLAIM
	ilock
	if XFS_ISTALE, skip
	set XFS_ISTALE
>>>	i_flags_unlock
	rcu_read_unlock()
	iflock

> Given that we set XFS_IRECLAIM under ilock, would we still need either
> the ilock cycle or to hold ilock through the RCU free if the ifree side
> (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the
> rcu read lock)?

We set XFS_IRECLAIM under the i_flags_lock.

It is the combination of rcu_read_lock() and i_flags_lock() that
provides the RCU lookup state barriers - the ILOCK is not part of
that at all.

The key point here is that once we've validated the inode we found
in the radix tree under the i_flags_lock, we then take the ILOCK,
thereby serialising the taking of the ILOCK here with the taking of
the ILOCK in the reclaim isolation code.

i.e. all the reclaim state serialisation is actually based around
holding the i_flags_lock, not the ILOCK. 

Once we have grabbed the ILOCK under the i_flags_lock, we can
drop the i_flags_lock knowing that reclaim will not be able isolate
this inode and set XFS_IRECLAIM.

> ISTM we should either have a non-reclaim inode with
> ilock protection or a reclaim inode with RCU protection (so we can skip
> it before it frees), but I could easily be missing something here..

Heh. Yeah, it's a complex dance, and it's all based around how
RCU lookups and the i_flags_lock interact to provide coherent
detection of freed inodes.

I have a nagging feeling that this whole ILOCK-held-to-rcu-free game
can be avoided. I need to walk myself through the lookup state
machine again and determine if ordering the XFS_IRECLAIM flag check
after greabbing the ILOCK is sufficient to prevent ifree/iflush
lookups from accessing the inode outside the rcu_read_lock()
context.

If so, most of this patch will go away....

> > +	 * attached to the buffer so we don't need to do anything more here.
> >  	 */
> > -	if (ip != free_ip) {
> > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > -			rcu_read_unlock();
> > -			delay(1);
> > -			goto retry;
> > -		}
> > -
> > -		/*
> > -		 * Check the inode number again in case we're racing with
> > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > -		 * function for more information as to why the initial check is
> > -		 * not sufficient.
> > -		 */
> > -		if (ip->i_ino != inum) {
> > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> 
> Is there a correctness reason for why we move the stale check to under
> ilock (in both iflush/ifree)?

It's under the i_flags_lock, and so I moved it up under the lookup
hold of the i_flags_lock so we don't need to cycle it again.

> >  	/*
> > -	 * We don't need to attach clean inodes or those only with unlogged
> > -	 * changes (which we throw away, anyway).
> > +	 * We don't need to attach clean inodes to the buffer - they are marked
> > +	 * stale in memory now and will need to be re-initialised by inode
> > +	 * allocation before they can be reused.
> >  	 */
> >  	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> >  		ASSERT(ip != free_ip);
> >  		xfs_ifunlock(ip);
> > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +		if (ip != free_ip)
> > +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 
> There's an assert against this case just above, though I suppose there's
> nothing wrong with just keeping it and making the functional code more
> cautious.

*nod*

It follows Darrick's lead of making sure that production kernels
don't do something stupid because of some whacky corruption we
didn't expect to ever see.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-14 22:16     ` Dave Chinner
@ 2019-11-15 13:13       ` Christoph Hellwig
  2019-11-15 17:26       ` Brian Foster
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2019-11-15 13:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> > Can we tie these into the proper locking interface using flags? For
> > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
> > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?
> 
> I'd prefer not to make this part of the common locking interface -
> it's a one off special use case, not something we want to progate
> elsewhere into the code.
> 
> Now that I think over it, I probably should have tagged this with
> patch with [RFC]. I think we should just get rid of the mrlock
> wrappers rather than add more, and that would simplify this a lot.

Yes, killing off the mrlock wrappers would be very helpful.  The only
thing we use them for is asserts on the locking state.  We could either
switch to lockdep_assert_held*, or just open code the write locked bit.
While it is a little more ugly I'd tend towards the latter given that
the locking asserts are too useful to require lockdep builds with their
performance impact.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/28] mm: directed shrinker work deferral
  2019-11-14 20:49     ` Dave Chinner
@ 2019-11-15 17:21       ` Brian Foster
  2019-11-18  0:49         ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-15 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 15, 2019 at 07:49:26AM +1100, Dave Chinner wrote:
> On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Introduce a mechanism for ->count_objects() to indicate to the
> > > shrinker infrastructure that the reclaim context will not allow
> > > scanning work to be done and so the work it decides is necessary
> > > needs to be deferred.
> > > 
> > > This simplifies the code by separating out the accounting of
> > > deferred work from the actual doing of the work, and allows better
> > > decisions to be made by the shrinekr control logic on what action it
> > > can take.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > My understanding from the previous discussion(s) is that this is not
> > tied directly to the gfp mask because that is not the only intended use.
> > While it is currently a boolean tied to the the entire shrinker call,
> > the longer term objective is per-object granularity.
> 
> Longer term, yes, but right now such things are not possible as the
> shrinker needs more context to be able to make sane per-object
> decisions. shrinker policy decisions that affect the entire run
> scope should be handled by the ->count operation - it's the one that
> says whether the scan loop should run or not, and right now GFP_NOFS
> for all filesystem shrinkers is a pure boolean policy
> implementation.
> 
> The next future step is to provide a superblock context with
> GFP_NOFS to indicate which filesystem we cannot recurse into. That
> is also a shrinker instance wide check, so again it's something that
> ->count should be deciding.
> 
> i.e. ->count determines what is to be done, ->scan iterates the work
> that has to be done until we are done.
> 

Sure, makes sense in general.

> > I find the argument reasonable enough, but if the above is true, why do
> > we move these checks from ->scan_objects() to ->count_objects() (in the
> > next patch) when per-object decisions will ultimately need to be made by
> > the former?
> 
> Because run/no-run policy belongs in one place, and things like
> GFP_NOFS do no change across calls to the ->scan loop. i.e. after
> the first ->scan call in a loop that calls it hundreds to thousands
> of times, the GFP_NOFS run/no-run check is completely redundant.
> 

What loop is currently called hundreds to thousands of times that this
change prevents? AFAICT the current nofs checks in the ->scan calls
explicitly terminate the scan loop. So we're effectively saving a
function call by doing this earlier in the count ->call. (Nothing wrong
with that, I'm just not following the numbers used in this reasoning..).

> Once we introduce a new policy that allows the fs shrinker to do
> careful reclaim in GFP_NOFS conditions, we need to do substantial
> rework the shrinker scan loop and how it accounts the work that is
> done - we now have at least 3 or 4 different return counters
> (skipped because locked, skipped because referenced,
> reclaimed, deferred reclaim because couldn't lock/recursion) and
> the accounting and decisions to be made are a lot more complex.
> 

Yeah, that's generally what I expected from your previous description.

> In that case, the ->count function will drop the GFP_NOFS check, but
> still do all the other things is needs to do. The GFP_NOFS check
> will go deep in the guts of the shrinker scan implementation where
> the per-object recursion problem exists. But for most shrinkers,
> it's still going to be a global boolean check...
> 

So once the nofs checks are lifted out of the ->count callback and into
the core shrinker, is there still a use case to defer an entire ->count
instance from the callback?

> > That seems like unnecessary churn and inconsistent with the
> > argument against just temporarily doing something like what Christoph
> > suggested in the previous version, particularly since IIRC the only use
> > in this series was for gfp mask purposes.
> 
> If people want to call avoiding repeated, unnecessary evaluation of
> the same condition hundreds of times instead of once "unnecessary
> churn", then I'll drop it.
> 

I'm not referring to the functional change as churn. What I was
referring to is that we're shuffling around the boilerplate gfp checking
code between the different shrinker callbacks, knowing that it's
eventually going to be lifted out, when we could potentially just lift
that code up a level now.

Brian

> > >  include/linux/shrinker.h | 7 +++++++
> > >  mm/vmscan.c              | 8 ++++++++
> > >  2 files changed, 15 insertions(+)
> > > 
> > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > index 0f80123650e2..3405c39ab92c 100644
> > > --- a/include/linux/shrinker.h
> > > +++ b/include/linux/shrinker.h
> > > @@ -31,6 +31,13 @@ struct shrink_control {
> > >  
> > >  	/* current memcg being shrunk (for memcg aware shrinkers) */
> > >  	struct mem_cgroup *memcg;
> > > +
> > > +	/*
> > > +	 * set by ->count_objects if reclaim context prevents reclaim from
> > > +	 * occurring. This allows the shrinker to immediately defer all the
> > > +	 * work and not even attempt to scan the cache.
> > > +	 */
> > > +	bool defer_work;
> > >  };
> > >  
> > >  #define SHRINK_STOP (~0UL)
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index ee4eecc7e1c2..a215d71d9d4b 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > >  				   freeable, delta, total_scan, priority);
> > >  
> > > +	/*
> > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > +	 * defer the work to a context that can scan the cache.
> > > +	 */
> > > +	if (shrinkctl->defer_work)
> > > +		goto done;
> > > +
> > 
> > I still find the fact that this per-shrinker invocation field is never
> > reset unnecessarily fragile, and I don't see any good reason not to
> > reset it prior to the shrinker callback that potentially sets it.
> 
> I missed that when updating. I'll reset it in the next version.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 12/28] shrinker: defer work only to kswapd
  2019-11-14 21:11     ` Dave Chinner
@ 2019-11-15 17:23       ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-15 17:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 15, 2019 at 08:11:50AM +1100, Dave Chinner wrote:
> On Mon, Nov 04, 2019 at 10:29:54AM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:46:02AM +1100, Dave Chinner wrote:
> > > @@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  	 * scanning at high prio and therefore should try to reclaim as much as
> > >  	 * possible.
> > >  	 */
> > > -	while (total_scan >= batch_size ||
> > > -	       total_scan >= freeable_objects) {
> > > +	while (scan_count >= batch_size ||
> > > +	       scan_count >= freeable_objects) {
> > >  		unsigned long ret;
> > > -		unsigned long nr_to_scan = min(batch_size, total_scan);
> > > +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
> > >  
> > >  		shrinkctl->nr_to_scan = nr_to_scan;
> > >  		shrinkctl->nr_scanned = nr_to_scan;
> > > @@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  		freed += ret;
> > >  
> > >  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> > > -		total_scan -= shrinkctl->nr_scanned;
> > > -		scanned += shrinkctl->nr_scanned;
> > > +		scan_count -= shrinkctl->nr_scanned;
> > > +		scanned_objects += shrinkctl->nr_scanned;
> > >  
> > >  		cond_resched();
> > >  	}
> > > -
> > >  done:
> > > -	if (next_deferred >= scanned)
> > > -		next_deferred -= scanned;
> > > +	if (deferred_count)
> > > +		next_deferred = deferred_count - scanned_objects;
> > >  	else
> > > -		next_deferred = 0;
> > > +		next_deferred = scan_count;
> > 
> > Hmm.. so if there was no deferred count on this cycle, we set
> > next_deferred to whatever is left from scan_count and add that back into
> > the shrinker struct below. If there was a pending deferred count on this
> > cycle, we subtract what we scanned from that and add that value back.
> > But what happens to the remaining scan_count in the latter case? Is it
> > lost, or am I missing something?
> 
> if deferred_count is not zero, then it is kswapd that is running. It
> does the deferred work, and if it doesn't make progress then adding
> it's scan count to the deferred work doesn't matter. That's because
> it will come back with an increased priority in a short while and
> try to scan more of the deferred count plus it's larger scan count.
> 

Ok, so perhaps there is no functional reason to defer remaining scan
count from a context (i.e. kswapd) that attempts to process deferred
work...

> IOWs, if we defer kswapd unused scan count, we effectively increase
> the pressure as the priority goes up, potentially making the
> deferred count increase out of control. i.e. kswapd can make
> progress and free items, but the result is that it increased the
> deferred scan count rather than reducing it. This leads to excessive
> reclaim of the slab caches and kswapd can trash the caches long
> after the memory pressure has gone away...
> 

... yet if kswapd runs without pre-existing deferred work, that's
precisely what it does. next_deferred is set to remaining scan_count and
that is added back to the shrinker struct. So should kswapd generally
defer work or not? If the answer is sometimes, then please add a comment
to the next_deferred assignment to explain when/why.

> > For example, suppose we start this cycle with a large scan_count and
> > ->scan_objects() returned SHRINK_STOP before doing much work. In that
> > scenario, it looks like whether ->nr_deferred is 0 or not is the only
> > thing that determines whether we defer the entire remaining scan_count
> > or just what is left from the previous ->nr_deferred. The existing code
> > appears to consistently factor in what is left from the current scan
> > with the previous deferred count. Hm?
> 
> If kswapd doesn't have any deferred work, then it's largely no
> different in behaviour to direct reclaim. If it has no deferred
> work, then the shrinker is not getting stopped early in direct
> reclaim, so it's unlikely that kswapd is going to get stopped early,
> either....
> 

Then perhaps the logic could be simplified to explicitly not defer from
kswapd..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-14 22:16     ` Dave Chinner
  2019-11-15 13:13       ` Christoph Hellwig
@ 2019-11-15 17:26       ` Brian Foster
  2019-11-18  1:00         ` Dave Chinner
  1 sibling, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-15 17:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Looking up an unreferenced inode in the inode cache is a bit hairy.
> > > We do this for inode invalidation and writeback clustering purposes,
> > > which is all invisible to the VFS. Hence we can't take reference
> > > counts to the inode and so must be very careful how we do it.
> > > 
> > > There are several different places that all do the lookups and
> > > checks slightly differently. Fundamentally, though, they are all
> > > racy and inode reclaim has to block waiting for the inode lock if it
> > > loses the race. This is not very optimal given all the work we;ve
> > > already done to make reclaim non-blocking.
> > > 
> > > We can make the reclaim process nonblocking with a couple of simple
> > > changes. If we define the unreferenced lookup process in a way that
> > > will either always grab an inode in a way that reclaim will notice
> > > and skip, or will notice a reclaim has grabbed the inode so it can
> > > skip the inode, then there is no need for reclaim to need to cycle
> > > the inode ILOCK at all.
> > > 
> > > Selecting an inode for reclaim is already non-blocking, so if the
> > > ILOCK is held the inode will be skipped. If we ensure that reclaim
> > > holds the ILOCK until the inode is freed, then we can do the same
> > > thing in the unreferenced lookup to avoid inodes in reclaim. We can
> > > do this simply by holding the ILOCK until the RCU grace period
> > > expires and the inode freeing callback is run. As all unreferenced
> > > lookups have to hold the rcu_read_lock(), we are guaranteed that
> > > a reclaimed inode will be noticed as the trylock will fail.
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/mrlock.h     |  27 +++++++++
> > >  fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
> > >  fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
> > >  3 files changed, 153 insertions(+), 93 deletions(-)
> > > 
> > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> > > index 79155eec341b..1752a2592bcc 100644
> > > --- a/fs/xfs/mrlock.h
> > > +++ b/fs/xfs/mrlock.h
> > ...
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 11bf4768d491..45ee3b5cd873 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -106,6 +106,7 @@ xfs_inode_free_callback(
> > >  		ip->i_itemp = NULL;
> > >  	}
> > >  
> > > +	mrunlock_excl_non_owner(&ip->i_lock);
> > >  	kmem_zone_free(xfs_inode_zone, ip);
> > >  }
> > >  
> > > @@ -132,6 +133,7 @@ xfs_inode_free(
> > >  	 * free state. The ip->i_flags_lock provides the barrier against lookup
> > >  	 * races.
> > >  	 */
> > > +	mrupdate_non_owner(&ip->i_lock);
> > 
> > Can we tie these into the proper locking interface using flags? For
> > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
> > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?
> 
> I'd prefer not to make this part of the common locking interface -
> it's a one off special use case, not something we want to progate
> elsewhere into the code.
> 

What urks me about this is that it obfuscates rather than highlights
that fact because I have no idea what mrtryupdate_non_owner() actually
does without looking it up. We could easily name a flag
XFS_ILOCK_PENDING_RECLAIM or something similarly ridiculous to make it
blindingly obvious it should only be used in a special context.

> Now that I think over it, I probably should have tagged this with
> patch with [RFC]. I think we should just get rid of the mrlock
> wrappers rather than add more, and that would simplify this a lot.
> 

Yeah, FWIW I've been reviewing this patch as a WIP on top of all of the
nonblocking bits as opposed to being some fundamental part of that work.
That aside, I also agree that cleaning up these wrappers might address
that concern because something like:

	/* open code ilock because ... */
	down_write_trylock_non_owner(&ip->i_lock);

... is at least readable code.

> 
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	ip->i_flags = XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit(
> > >  		}
> > >  
> > >  		/*
> > > -		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
> > > -		 * from stomping over us while we recycle the inode. Remove it
> > > -		 * from the LRU straight away so we can re-init the VFS inode.
> > > +		 * Before we reinitialise the inode, we need to make sure
> > > +		 * reclaim does not pull it out from underneath us. We already
> > > +		 * hold the i_flags_lock, and because the XFS_IRECLAIM is not
> > > +		 * set we know the inode is still on the LRU. However, the LRU
> > > +		 * code may have just selected this inode to reclaim, so we need
> > > +		 * to ensure we hold the i_flags_lock long enough for the
> > > +		 * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
> > > +		 * removing the inode from the LRU, which will spin on the LRU
> > > +		 * list locks until reclaim stops walking, at which point we
> > > +		 * know there is no possible race between reclaim isolation and
> > > +		 * this lookup.
> > > +		 *
> > 
> > Somewhat related to my question about the lru_lock on the earlier patch.
> 
> *nod*
> 
> The caveat here is that this is the slow path so spinning for a
> while doesn't really matter.
> 
> > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode(
> > >  	spin_unlock(&pag->pag_ici_lock);
> > >  	xfs_perag_put(pag);
> > >  
> > > -	/*
> > > -	 * Here we do an (almost) spurious inode lock in order to coordinate
> > > -	 * with inode cache radix tree lookups.  This is because the lookup
> > > -	 * can reference the inodes in the cache without taking references.
> > > -	 *
> > > -	 * We make that OK here by ensuring that we wait until the inode is
> > > -	 * unlocked after the lookup before we go ahead and free it.
> > > -	 *
> > > -	 * XXX: need to check this is still true. Not sure it is.
> > > -	 */
> > > -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > >  	xfs_qm_dqdetach(ip);
> > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > Ok, so I'm staring at this a bit more and think I'm missing something.
> > If we put aside the change to hold ilock until the inode is freed, we
> > basically have the following (simplified) flow as the inode goes from
> > isolation to disposal:
> > 
> > 	ilock	(isolate)
> > 	iflock
> > 	set XFS_IRECLAIM
> > 	ifunlock (disposal)
> > 	iunlock
> > 	radix delete
> > 	ilock cycle (drain)
> > 	rcu free
> > 
> > What we're trying to eliminate is the ilock cycle to drain any
> > concurrent unreferenced lookups from accessing the inode once it is
> > freed. The free itself is still RCU protected.
> > 
> > Looking over at the ifree path, we now have something like this:
> > 
> > 	rcu_read_lock()
> > 	radix lookup
> > 	check XFS_IRECLAIM
> > 	ilock
> > 	if XFS_ISTALE, skip
> > 	set XFS_ISTALE
> > 	rcu_read_unlock()
> > 	iflock
> > 	/* return locked down inode */
> 
> You missed a lock.
> 
> 	rcu_read_lock()
> 	radix lookup
> >>>	i_flags_lock
> 	check XFS_IRECLAIM
> 	ilock
> 	if XFS_ISTALE, skip
> 	set XFS_ISTALE
> >>>	i_flags_unlock
> 	rcu_read_unlock()
> 	iflock
> 
> > Given that we set XFS_IRECLAIM under ilock, would we still need either
> > the ilock cycle or to hold ilock through the RCU free if the ifree side
> > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the
> > rcu read lock)?
> 
> We set XFS_IRECLAIM under the i_flags_lock.
> 
> It is the combination of rcu_read_lock() and i_flags_lock() that
> provides the RCU lookup state barriers - the ILOCK is not part of
> that at all.
> 
> The key point here is that once we've validated the inode we found
> in the radix tree under the i_flags_lock, we then take the ILOCK,
> thereby serialising the taking of the ILOCK here with the taking of
> the ILOCK in the reclaim isolation code.
> 
> i.e. all the reclaim state serialisation is actually based around
> holding the i_flags_lock, not the ILOCK. 
> 
> Once we have grabbed the ILOCK under the i_flags_lock, we can
> drop the i_flags_lock knowing that reclaim will not be able isolate
> this inode and set XFS_IRECLAIM.
> 

Hmm, Ok. I knew i_flags_lock was in there when I wrote this up. I
intentionally left it out as a simplification because it wasn't clear to
me that it was a critical part of the lookup. I'll keep this in mind the
next time I walk through this code.

> > ISTM we should either have a non-reclaim inode with
> > ilock protection or a reclaim inode with RCU protection (so we can skip
> > it before it frees), but I could easily be missing something here..
> 
> Heh. Yeah, it's a complex dance, and it's all based around how
> RCU lookups and the i_flags_lock interact to provide coherent
> detection of freed inodes.
> 
> I have a nagging feeling that this whole ILOCK-held-to-rcu-free game
> can be avoided. I need to walk myself through the lookup state
> machine again and determine if ordering the XFS_IRECLAIM flag check
> after greabbing the ILOCK is sufficient to prevent ifree/iflush
> lookups from accessing the inode outside the rcu_read_lock()
> context.
> 

That's pretty much what I was wondering...

> If so, most of this patch will go away....
> 
> > > +	 * attached to the buffer so we don't need to do anything more here.
> > >  	 */
> > > -	if (ip != free_ip) {
> > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > -			rcu_read_unlock();
> > > -			delay(1);
> > > -			goto retry;
> > > -		}
> > > -
> > > -		/*
> > > -		 * Check the inode number again in case we're racing with
> > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > -		 * function for more information as to why the initial check is
> > > -		 * not sufficient.
> > > -		 */
> > > -		if (ip->i_ino != inum) {
> > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > 
> > Is there a correctness reason for why we move the stale check to under
> > ilock (in both iflush/ifree)?
> 
> It's under the i_flags_lock, and so I moved it up under the lookup
> hold of the i_flags_lock so we don't need to cycle it again.
> 

Yeah, but in both cases it looks like it moved to under the ilock as
well, which comes after i_flags_lock. IOW, why grab ilock for stale
inodes when we're just going to skip them?

Brian

> > >  	/*
> > > -	 * We don't need to attach clean inodes or those only with unlogged
> > > -	 * changes (which we throw away, anyway).
> > > +	 * We don't need to attach clean inodes to the buffer - they are marked
> > > +	 * stale in memory now and will need to be re-initialised by inode
> > > +	 * allocation before they can be reused.
> > >  	 */
> > >  	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> > >  		ASSERT(ip != free_ip);
> > >  		xfs_ifunlock(ip);
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +		if (ip != free_ip)
> > > +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > There's an assert against this case just above, though I suppose there's
> > nothing wrong with just keeping it and making the functional code more
> > cautious.
> 
> *nod*
> 
> It follows Darrick's lead of making sure that production kernels
> don't do something stupid because of some whacky corruption we
> didn't expect to ever see.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/28] mm: directed shrinker work deferral
  2019-11-15 17:21       ` Brian Foster
@ 2019-11-18  0:49         ` Dave Chinner
  2019-11-19 15:12           ` Brian Foster
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-11-18  0:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 15, 2019 at 12:21:40PM -0500, Brian Foster wrote:
> On Fri, Nov 15, 2019 at 07:49:26AM +1100, Dave Chinner wrote:
> > On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> > > On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Introduce a mechanism for ->count_objects() to indicate to the
> > > > shrinker infrastructure that the reclaim context will not allow
> > > > scanning work to be done and so the work it decides is necessary
> > > > needs to be deferred.
> > > > 
> > > > This simplifies the code by separating out the accounting of
> > > > deferred work from the actual doing of the work, and allows better
> > > > decisions to be made by the shrinekr control logic on what action it
> > > > can take.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > 
> > > My understanding from the previous discussion(s) is that this is not
> > > tied directly to the gfp mask because that is not the only intended use.
> > > While it is currently a boolean tied to the the entire shrinker call,
> > > the longer term objective is per-object granularity.
> > 
> > Longer term, yes, but right now such things are not possible as the
> > shrinker needs more context to be able to make sane per-object
> > decisions. shrinker policy decisions that affect the entire run
> > scope should be handled by the ->count operation - it's the one that
> > says whether the scan loop should run or not, and right now GFP_NOFS
> > for all filesystem shrinkers is a pure boolean policy
> > implementation.
> > 
> > The next future step is to provide a superblock context with
> > GFP_NOFS to indicate which filesystem we cannot recurse into. That
> > is also a shrinker instance wide check, so again it's something that
> > ->count should be deciding.
> > 
> > i.e. ->count determines what is to be done, ->scan iterates the work
> > that has to be done until we are done.
> > 
> 
> Sure, makes sense in general.
> 
> > > I find the argument reasonable enough, but if the above is true, why do
> > > we move these checks from ->scan_objects() to ->count_objects() (in the
> > > next patch) when per-object decisions will ultimately need to be made by
> > > the former?
> > 
> > Because run/no-run policy belongs in one place, and things like
> > GFP_NOFS do no change across calls to the ->scan loop. i.e. after
> > the first ->scan call in a loop that calls it hundreds to thousands
> > of times, the GFP_NOFS run/no-run check is completely redundant.
> > 
> 
> What loop is currently called hundreds to thousands of times that this
> change prevents? AFAICT the current nofs checks in the ->scan calls
> explicitly terminate the scan loop.

Right, but when we are in GFP_KERNEL context, every call to ->scan()
checks it and says "ok". If we are scanning tens of thousands of
objects in a scan, and we are using a befault batch size of 128
objects per scan, then we have hundreds of calls in a single scan
loop that check the GFP context and say "ok"....

> So we're effectively saving a
> function call by doing this earlier in the count ->call. (Nothing wrong
> with that, I'm just not following the numbers used in this reasoning..).

It's the don't terminate case. :)

> > Once we introduce a new policy that allows the fs shrinker to do
> > careful reclaim in GFP_NOFS conditions, we need to do substantial
> > rework the shrinker scan loop and how it accounts the work that is
> > done - we now have at least 3 or 4 different return counters
> > (skipped because locked, skipped because referenced,
> > reclaimed, deferred reclaim because couldn't lock/recursion) and
> > the accounting and decisions to be made are a lot more complex.
> > 
> 
> Yeah, that's generally what I expected from your previous description.
> 
> > In that case, the ->count function will drop the GFP_NOFS check, but
> > still do all the other things is needs to do. The GFP_NOFS check
> > will go deep in the guts of the shrinker scan implementation where
> > the per-object recursion problem exists. But for most shrinkers,
> > it's still going to be a global boolean check...
> > 
> 
> So once the nofs checks are lifted out of the ->count callback and into
> the core shrinker, is there still a use case to defer an entire ->count
> instance from the callback?

Not right now. There may be in future, but I don't want to make
things more complex than they need to be by trying to support
functionality that isn't used.

> > If people want to call avoiding repeated, unnecessary evaluation of
> > the same condition hundreds of times instead of once "unnecessary
> > churn", then I'll drop it.
> > 
> 
> I'm not referring to the functional change as churn. What I was
> referring to is that we're shuffling around the boilerplate gfp checking
> code between the different shrinker callbacks, knowing that it's
> eventually going to be lifted out, when we could potentially just lift
> that code up a level now.

I don't think that lifting it up will save much code at all, once we
add all the gfp mask intialisation to all the shrinkers, etc. It's
just means we can't look at the shrinker implementation and know
that it can't run in GFP_NOFS context - we have to go look up
where it is instantiated instead to see if there are gfp context
constraints.

I think it's better where it is, documenting the constraints the
shrinker implementation runs under in the implementation itself...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-15 17:26       ` Brian Foster
@ 2019-11-18  1:00         ` Dave Chinner
  2019-11-19 15:13           ` Brian Foster
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-11-18  1:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote:
> On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > If so, most of this patch will go away....
> > 
> > > > +	 * attached to the buffer so we don't need to do anything more here.
> > > >  	 */
> > > > -	if (ip != free_ip) {
> > > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > > -			rcu_read_unlock();
> > > > -			delay(1);
> > > > -			goto retry;
> > > > -		}
> > > > -
> > > > -		/*
> > > > -		 * Check the inode number again in case we're racing with
> > > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > > -		 * function for more information as to why the initial check is
> > > > -		 * not sufficient.
> > > > -		 */
> > > > -		if (ip->i_ino != inum) {
> > > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > > 
> > > Is there a correctness reason for why we move the stale check to under
> > > ilock (in both iflush/ifree)?
> > 
> > It's under the i_flags_lock, and so I moved it up under the lookup
> > hold of the i_flags_lock so we don't need to cycle it again.
> > 
> 
> Yeah, but in both cases it looks like it moved to under the ilock as
> well, which comes after i_flags_lock. IOW, why grab ilock for stale
> inodes when we're just going to skip them?

Because I was worrying about serialising against reclaim before
changing the state of the inode. i.e. if the inode has already been
isolated by not yet disposed of, we shouldn't touch the inode state
at all. Serialisation against reclaim in this patch is via the
ILOCK, hence we need to do that before setting ISTALE....

IOWs, ISTALE is not protected by ILOCK, we just can't modify the
inode state until after we've gained the ILOCK to protect against
reclaim....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/28] mm: directed shrinker work deferral
  2019-11-18  0:49         ` Dave Chinner
@ 2019-11-19 15:12           ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-19 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 18, 2019 at 11:49:56AM +1100, Dave Chinner wrote:
> On Fri, Nov 15, 2019 at 12:21:40PM -0500, Brian Foster wrote:
> > On Fri, Nov 15, 2019 at 07:49:26AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> > > > On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Introduce a mechanism for ->count_objects() to indicate to the
> > > > > shrinker infrastructure that the reclaim context will not allow
> > > > > scanning work to be done and so the work it decides is necessary
> > > > > needs to be deferred.
> > > > > 
> > > > > This simplifies the code by separating out the accounting of
> > > > > deferred work from the actual doing of the work, and allows better
> > > > > decisions to be made by the shrinekr control logic on what action it
> > > > > can take.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > 
> > > > My understanding from the previous discussion(s) is that this is not
> > > > tied directly to the gfp mask because that is not the only intended use.
> > > > While it is currently a boolean tied to the the entire shrinker call,
> > > > the longer term objective is per-object granularity.
> > > 
> > > Longer term, yes, but right now such things are not possible as the
> > > shrinker needs more context to be able to make sane per-object
> > > decisions. shrinker policy decisions that affect the entire run
> > > scope should be handled by the ->count operation - it's the one that
> > > says whether the scan loop should run or not, and right now GFP_NOFS
> > > for all filesystem shrinkers is a pure boolean policy
> > > implementation.
> > > 
> > > The next future step is to provide a superblock context with
> > > GFP_NOFS to indicate which filesystem we cannot recurse into. That
> > > is also a shrinker instance wide check, so again it's something that
> > > ->count should be deciding.
> > > 
> > > i.e. ->count determines what is to be done, ->scan iterates the work
> > > that has to be done until we are done.
> > > 
> > 
> > Sure, makes sense in general.
> > 
> > > > I find the argument reasonable enough, but if the above is true, why do
> > > > we move these checks from ->scan_objects() to ->count_objects() (in the
> > > > next patch) when per-object decisions will ultimately need to be made by
> > > > the former?
> > > 
> > > Because run/no-run policy belongs in one place, and things like
> > > GFP_NOFS do no change across calls to the ->scan loop. i.e. after
> > > the first ->scan call in a loop that calls it hundreds to thousands
> > > of times, the GFP_NOFS run/no-run check is completely redundant.
> > > 
> > 
> > What loop is currently called hundreds to thousands of times that this
> > change prevents? AFAICT the current nofs checks in the ->scan calls
> > explicitly terminate the scan loop.
> 
> Right, but when we are in GFP_KERNEL context, every call to ->scan()
> checks it and says "ok". If we are scanning tens of thousands of
> objects in a scan, and we are using a befault batch size of 128
> objects per scan, then we have hundreds of calls in a single scan
> loop that check the GFP context and say "ok"....
> 
> > So we're effectively saving a
> > function call by doing this earlier in the count ->call. (Nothing wrong
> > with that, I'm just not following the numbers used in this reasoning..).
> 
> It's the don't terminate case. :)
> 

Oh, I see. You're talking about the number of executions of the gfp
check itself. That makes sense, though my understanding is that we'll
ultimately have a similar check anyways if we want per-object
granularity based on the allocation constraints of the current context.
OTOH, the check would still occur only once with an alloc flags field in
the shrinker structure too, FWIW.

> > > Once we introduce a new policy that allows the fs shrinker to do
> > > careful reclaim in GFP_NOFS conditions, we need to do substantial
> > > rework the shrinker scan loop and how it accounts the work that is
> > > done - we now have at least 3 or 4 different return counters
> > > (skipped because locked, skipped because referenced,
> > > reclaimed, deferred reclaim because couldn't lock/recursion) and
> > > the accounting and decisions to be made are a lot more complex.
> > > 
> > 
> > Yeah, that's generally what I expected from your previous description.
> > 
> > > In that case, the ->count function will drop the GFP_NOFS check, but
> > > still do all the other things is needs to do. The GFP_NOFS check
> > > will go deep in the guts of the shrinker scan implementation where
> > > the per-object recursion problem exists. But for most shrinkers,
> > > it's still going to be a global boolean check...
> > > 
> > 
> > So once the nofs checks are lifted out of the ->count callback and into
> > the core shrinker, is there still a use case to defer an entire ->count
> > instance from the callback?
> 
> Not right now. There may be in future, but I don't want to make
> things more complex than they need to be by trying to support
> functionality that isn't used.
> 

Ok, but do note that the reason I ask is to touch on simply whether it's
worth putting this in the ->scan callback at all. It's not like _not_
doing that is some big complexity adjustment. ;)

> > > If people want to call avoiding repeated, unnecessary evaluation of
> > > the same condition hundreds of times instead of once "unnecessary
> > > churn", then I'll drop it.
> > > 
> > 
> > I'm not referring to the functional change as churn. What I was
> > referring to is that we're shuffling around the boilerplate gfp checking
> > code between the different shrinker callbacks, knowing that it's
> > eventually going to be lifted out, when we could potentially just lift
> > that code up a level now.
> 
> I don't think that lifting it up will save much code at all, once we
> add all the gfp mask intialisation to all the shrinkers, etc. It's
> just means we can't look at the shrinker implementation and know
> that it can't run in GFP_NOFS context - we have to go look up
> where it is instantiated instead to see if there are gfp context
> constraints.
> 
> I think it's better where it is, documenting the constraints the
> shrinker implementation runs under in the implementation itself...
> 

Fair enough.. I don't necessarily agree that this is the best approach,
but the implementation is reasonable enough that I certainly don't
object to it (provided the fragility nits are addressed) and I don't
feel particularly tied to the suggested alternative. At the end of the
day this isn't a lot of code and it's not difficult to change (which it
probably will). I just wanted to make sure the alternative was fairly
considered and to test the reasoning for the approach a bit. I'll
move along from this topic on review of the next version...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-18  1:00         ` Dave Chinner
@ 2019-11-19 15:13           ` Brian Foster
  2019-11-19 21:18             ` Dave Chinner
  0 siblings, 1 reply; 72+ messages in thread
From: Brian Foster @ 2019-11-19 15:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Mon, Nov 18, 2019 at 12:00:47PM +1100, Dave Chinner wrote:
> On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote:
> > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > > If so, most of this patch will go away....
> > > 
> > > > > +	 * attached to the buffer so we don't need to do anything more here.
> > > > >  	 */
> > > > > -	if (ip != free_ip) {
> > > > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > > > -			rcu_read_unlock();
> > > > > -			delay(1);
> > > > > -			goto retry;
> > > > > -		}
> > > > > -
> > > > > -		/*
> > > > > -		 * Check the inode number again in case we're racing with
> > > > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > > > -		 * function for more information as to why the initial check is
> > > > > -		 * not sufficient.
> > > > > -		 */
> > > > > -		if (ip->i_ino != inum) {
> > > > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > > > 
> > > > Is there a correctness reason for why we move the stale check to under
> > > > ilock (in both iflush/ifree)?
> > > 
> > > It's under the i_flags_lock, and so I moved it up under the lookup
> > > hold of the i_flags_lock so we don't need to cycle it again.
> > > 
> > 
> > Yeah, but in both cases it looks like it moved to under the ilock as
> > well, which comes after i_flags_lock. IOW, why grab ilock for stale
> > inodes when we're just going to skip them?
> 
> Because I was worrying about serialising against reclaim before
> changing the state of the inode. i.e. if the inode has already been
> isolated by not yet disposed of, we shouldn't touch the inode state
> at all. Serialisation against reclaim in this patch is via the
> ILOCK, hence we need to do that before setting ISTALE....
> 

Yeah, I think my question still isn't clear... I'm not talking about
setting ISTALE. The code I referenced above is where we test for it and
skip the inode if it is already set. For example, the code referenced
above in xfs_ifree_get_one_inode() currently does the following with
respect to i_flags_lock, ILOCK and XFS_ISTALE:

	...
	spin_lock(i_flags_lock)
	xfs_ilock_nowait(XFS_ILOCK_EXCL)
	if !XFS_ISTALE
		skip
	set XFS_ISTALE
	...

The reclaim isolate code does this, however:

	spin_trylock(i_flags_lock)
	if !XFS_ISTALE
		skip
	xfs_ilock(XFS_ILOCK_EXCL)
	...	

So my question is why not do something like the following in the
_get_one_inode() case?

	...
	spin_lock(i_flags_lock)
	if !XFS_ISTALE
		skip
	xfs_ilock_nowait(XFS_ILOCK_EXCL)
	set XFS_ISTALE
	...

IOW, what is the need, if any, to acquire ilock in the iflush/ifree
paths before testing for XFS_ISTALE? Is there some specific intermediate
state I'm missing or is this just unintentional? The reason I ask is
ilock failure triggers that ugly delay(1) and retry thing, so it seems
slightly weird to allow that for a stale inode we're ultimately going to
skip (regardless of whether that would actually ever occur).

Brian

> IOWs, ISTALE is not protected by ILOCK, we just can't modify the
> inode state until after we've gained the ILOCK to protect against
> reclaim....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-19 15:13           ` Brian Foster
@ 2019-11-19 21:18             ` Dave Chinner
  2019-11-20 12:42               ` Brian Foster
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Chinner @ 2019-11-19 21:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Tue, Nov 19, 2019 at 10:13:44AM -0500, Brian Foster wrote:
> On Mon, Nov 18, 2019 at 12:00:47PM +1100, Dave Chinner wrote:
> > On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote:
> > > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> > > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > > > If so, most of this patch will go away....
> > > > 
> > > > > > +	 * attached to the buffer so we don't need to do anything more here.
> > > > > >  	 */
> > > > > > -	if (ip != free_ip) {
> > > > > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > > > > -			rcu_read_unlock();
> > > > > > -			delay(1);
> > > > > > -			goto retry;
> > > > > > -		}
> > > > > > -
> > > > > > -		/*
> > > > > > -		 * Check the inode number again in case we're racing with
> > > > > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > > > > -		 * function for more information as to why the initial check is
> > > > > > -		 * not sufficient.
> > > > > > -		 */
> > > > > > -		if (ip->i_ino != inum) {
> > > > > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > > > > 
> > > > > Is there a correctness reason for why we move the stale check to under
> > > > > ilock (in both iflush/ifree)?
> > > > 
> > > > It's under the i_flags_lock, and so I moved it up under the lookup
> > > > hold of the i_flags_lock so we don't need to cycle it again.
> > > > 
> > > 
> > > Yeah, but in both cases it looks like it moved to under the ilock as
> > > well, which comes after i_flags_lock. IOW, why grab ilock for stale
> > > inodes when we're just going to skip them?
> > 
> > Because I was worrying about serialising against reclaim before
> > changing the state of the inode. i.e. if the inode has already been
> > isolated by not yet disposed of, we shouldn't touch the inode state
> > at all. Serialisation against reclaim in this patch is via the
> > ILOCK, hence we need to do that before setting ISTALE....
> > 
> 
> Yeah, I think my question still isn't clear... I'm not talking about
> setting ISTALE. The code I referenced above is where we test for it and
> skip the inode if it is already set. For example, the code referenced
> above in xfs_ifree_get_one_inode() currently does the following with
> respect to i_flags_lock, ILOCK and XFS_ISTALE:
> 
> 	...
> 	spin_lock(i_flags_lock)
> 	xfs_ilock_nowait(XFS_ILOCK_EXCL)
> 	if !XFS_ISTALE
> 		skip
> 	set XFS_ISTALE
> 	...

There is another place in xfs_ifree_cluster that sets ISTALE without
the ILOCK held, so the ILOCK is being used here for a different
purpose...

> The reclaim isolate code does this, however:
> 
> 	spin_trylock(i_flags_lock)
> 	if !XFS_ISTALE
> 		skip
> 	xfs_ilock(XFS_ILOCK_EXCL)
> 	...	

Which is fine, because we're not trying to avoid racing with reclaim
here. :) i.e. all we need is the i_flags lock to check the ISTALE
flag safely.

> So my question is why not do something like the following in the
> _get_one_inode() case?
> 
> 	...
> 	spin_lock(i_flags_lock)
> 	if !XFS_ISTALE
> 		skip
> 	xfs_ilock_nowait(XFS_ILOCK_EXCL)
> 	set XFS_ISTALE
> 	...

Because, like I said, I focussed on the lookup racing with reclaim
first. The above code could be used, but it puts object internal
state checks before we really know whether the object is safe to
access and whether we can trust it.

I'm just following a basic RCU/lockless lookup principle here:
don't try to use object state before you've fully validated that the
object is live and guaranteed that it can be safely referenced.

> IOW, what is the need, if any, to acquire ilock in the iflush/ifree
> paths before testing for XFS_ISTALE? Is there some specific intermediate
> state I'm missing or is this just unintentional?

It's entirely intentional - validate and claim the object we've
found in the lockless lookup, then run the code that checks/changes
the object state. Smashing state checks and lockless lookup
validation together is a nasty landmine to leave behind...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/28] xfs: rework unreferenced inode lookups
  2019-11-19 21:18             ` Dave Chinner
@ 2019-11-20 12:42               ` Brian Foster
  0 siblings, 0 replies; 72+ messages in thread
From: Brian Foster @ 2019-11-20 12:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm, linux-kernel

On Wed, Nov 20, 2019 at 08:18:34AM +1100, Dave Chinner wrote:
> On Tue, Nov 19, 2019 at 10:13:44AM -0500, Brian Foster wrote:
> > On Mon, Nov 18, 2019 at 12:00:47PM +1100, Dave Chinner wrote:
> > > On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote:
> > > > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> > > > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > > > > If so, most of this patch will go away....
> > > > > 
> > > > > > > +	 * attached to the buffer so we don't need to do anything more here.
> > > > > > >  	 */
> > > > > > > -	if (ip != free_ip) {
> > > > > > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > > > > > -			rcu_read_unlock();
> > > > > > > -			delay(1);
> > > > > > > -			goto retry;
> > > > > > > -		}
> > > > > > > -
> > > > > > > -		/*
> > > > > > > -		 * Check the inode number again in case we're racing with
> > > > > > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > > > > > -		 * function for more information as to why the initial check is
> > > > > > > -		 * not sufficient.
> > > > > > > -		 */
> > > > > > > -		if (ip->i_ino != inum) {
> > > > > > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > > > > > 
> > > > > > Is there a correctness reason for why we move the stale check to under
> > > > > > ilock (in both iflush/ifree)?
> > > > > 
> > > > > It's under the i_flags_lock, and so I moved it up under the lookup
> > > > > hold of the i_flags_lock so we don't need to cycle it again.
> > > > > 
> > > > 
> > > > Yeah, but in both cases it looks like it moved to under the ilock as
> > > > well, which comes after i_flags_lock. IOW, why grab ilock for stale
> > > > inodes when we're just going to skip them?
> > > 
> > > Because I was worrying about serialising against reclaim before
> > > changing the state of the inode. i.e. if the inode has already been
> > > isolated by not yet disposed of, we shouldn't touch the inode state
> > > at all. Serialisation against reclaim in this patch is via the
> > > ILOCK, hence we need to do that before setting ISTALE....
> > > 
> > 
> > Yeah, I think my question still isn't clear... I'm not talking about
> > setting ISTALE. The code I referenced above is where we test for it and
> > skip the inode if it is already set. For example, the code referenced
> > above in xfs_ifree_get_one_inode() currently does the following with
> > respect to i_flags_lock, ILOCK and XFS_ISTALE:
> > 
> > 	...
> > 	spin_lock(i_flags_lock)
> > 	xfs_ilock_nowait(XFS_ILOCK_EXCL)
> > 	if !XFS_ISTALE
> > 		skip
> > 	set XFS_ISTALE
> > 	...
> 
> There is another place in xfs_ifree_cluster that sets ISTALE without
> the ILOCK held, so the ILOCK is being used here for a different
> purpose...
> 
> > The reclaim isolate code does this, however:
> > 
> > 	spin_trylock(i_flags_lock)
> > 	if !XFS_ISTALE
> > 		skip
> > 	xfs_ilock(XFS_ILOCK_EXCL)
> > 	...	
> 
> Which is fine, because we're not trying to avoid racing with reclaim
> here. :) i.e. all we need is the i_flags lock to check the ISTALE
> flag safely.
> 
> > So my question is why not do something like the following in the
> > _get_one_inode() case?
> > 
> > 	...
> > 	spin_lock(i_flags_lock)
> > 	if !XFS_ISTALE
> > 		skip
> > 	xfs_ilock_nowait(XFS_ILOCK_EXCL)
> > 	set XFS_ISTALE
> > 	...
> 
> Because, like I said, I focussed on the lookup racing with reclaim
> first. The above code could be used, but it puts object internal
> state checks before we really know whether the object is safe to
> access and whether we can trust it.
> 
> I'm just following a basic RCU/lockless lookup principle here:
> don't try to use object state before you've fully validated that the
> object is live and guaranteed that it can be safely referenced.
> 
> > IOW, what is the need, if any, to acquire ilock in the iflush/ifree
> > paths before testing for XFS_ISTALE? Is there some specific intermediate
> > state I'm missing or is this just unintentional?
> 
> It's entirely intentional - validate and claim the object we've
> found in the lockless lookup, then run the code that checks/changes
> the object state. Smashing state checks and lockless lookup
> validation together is a nasty landmine to leave behind...
> 

Ok, so this is intentional, but the purpose is simplification vs.
technically being part of the lookup dance. I'm not sure I see the
advantage given that IMO this trades off one landmine for another, but
I'm not worried that much about it as long as the code is correct.

I guess we'll see how things change after reevaluation of the whole
holding ilock across contexts behavior, but if we do end up with a
similar pattern in the iflush/ifree paths please document that
explicitly in the comments. Otherwise in a patch that swizzles this code
around and explicitly plays games with ilock, the intent of this
particular change is not clear to somebody reading the code IMO. In
fact, I think it might be interesting to see if we could define a couple
helpers (located closer to the reclaim code) to perform an unreferenced
lookup/release of an inode, but that is secondary to nailing down the
fundamental rules.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2019-11-20 12:42 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-31 23:45 [PATCH 00/28] mm, xfs: non-blocking inode reclaim Dave Chinner
2019-10-31 23:45 ` [PATCH 01/28] xfs: Lower CIL flush limit for large logs Dave Chinner
2019-10-31 23:45 ` [PATCH 02/28] xfs: Throttle commits on delayed background CIL push Dave Chinner
2019-11-01 12:04   ` Brian Foster
2019-11-01 21:40     ` Dave Chinner
2019-11-04 22:48       ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 03/28] xfs: don't allow log IO to be throttled Dave Chinner
2019-10-31 23:45 ` [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability Dave Chinner
2019-11-01 12:05   ` Brian Foster
2019-11-04 23:21   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 05/28] xfs: correctly acount for reclaimable slabs Dave Chinner
2019-10-31 23:45 ` [PATCH 06/28] xfs: factor common AIL item deletion code Dave Chinner
2019-11-04 23:16   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 07/28] xfs: tail updates only need to occur when LSN changes Dave Chinner
2019-11-04 23:18   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 08/28] xfs: factor inode lookup from xfs_ifree_cluster Dave Chinner
2019-11-01 12:05   ` Brian Foster
2019-11-04 23:20   ` Darrick J. Wong
2019-10-31 23:45 ` [PATCH 09/28] mm: directed shrinker work deferral Dave Chinner
2019-11-04 15:25   ` Brian Foster
2019-11-14 20:49     ` Dave Chinner
2019-11-15 17:21       ` Brian Foster
2019-11-18  0:49         ` Dave Chinner
2019-11-19 15:12           ` Brian Foster
2019-10-31 23:46 ` [PATCH 10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner
2019-10-31 23:46 ` [PATCH 11/28] mm: factor shrinker work calculations Dave Chinner
2019-11-02 10:55   ` kbuild test robot
2019-11-04 15:29   ` Brian Foster
2019-11-14 20:59     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 12/28] shrinker: defer work only to kswapd Dave Chinner
2019-11-04 15:29   ` Brian Foster
2019-11-14 21:11     ` Dave Chinner
2019-11-15 17:23       ` Brian Foster
2019-10-31 23:46 ` [PATCH 13/28] shrinker: clean up variable types and tracepoints Dave Chinner
2019-11-04 15:30   ` Brian Foster
2019-10-31 23:46 ` [PATCH 14/28] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-10-31 23:46 ` [PATCH 15/28] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-11-14 21:28     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 16/28] mm: kswapd backoff for shrinkers Dave Chinner
2019-11-04 19:58   ` Brian Foster
2019-11-14 21:41     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 17/28] xfs: synchronous AIL pushing Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 18/28] xfs: don't block kswapd in inode reclaim Dave Chinner
2019-10-31 23:46 ` [PATCH 19/28] xfs: reduce kswapd blocking on inode locking Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 20/28] xfs: kill background reclaim work Dave Chinner
2019-11-05 17:05   ` Brian Foster
2019-10-31 23:46 ` [PATCH 21/28] xfs: use AIL pushing for inode reclaim IO Dave Chinner
2019-11-05 17:06   ` Brian Foster
2019-10-31 23:46 ` [PATCH 22/28] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner
2019-10-31 23:46 ` [PATCH 23/28] xfs: track reclaimable inodes using a LRU list Dave Chinner
2019-10-31 23:46 ` [PATCH 24/28] xfs: reclaim inodes from the LRU Dave Chinner
2019-11-06 17:21   ` Brian Foster
2019-11-14 21:51     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 25/28] xfs: remove unusued old inode reclaim code Dave Chinner
2019-11-06 17:21   ` Brian Foster
2019-10-31 23:46 ` [PATCH 26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes Dave Chinner
2019-11-06 17:22   ` Brian Foster
2019-11-14 21:53     ` Dave Chinner
2019-10-31 23:46 ` [PATCH 27/28] rwsem: introduce down/up_write_non_owner Dave Chinner
2019-10-31 23:46 ` [PATCH 28/28] xfs: rework unreferenced inode lookups Dave Chinner
2019-11-06 22:18   ` Brian Foster
2019-11-14 22:16     ` Dave Chinner
2019-11-15 13:13       ` Christoph Hellwig
2019-11-15 17:26       ` Brian Foster
2019-11-18  1:00         ` Dave Chinner
2019-11-19 15:13           ` Brian Foster
2019-11-19 21:18             ` Dave Chinner
2019-11-20 12:42               ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).