All of lore.kernel.org
 help / color / mirror / Atom feed
* + vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch added to -mm tree
@ 2021-06-14 22:00 akpm
  0 siblings, 0 replies; 2+ messages in thread
From: akpm @ 2021-06-14 22:00 UTC (permalink / raw)
  To: david, guro, hannes, mm-commits, tj


The patch titled
     Subject: vfs: keep inodes with page cache off the inode shrinker LRU
has been added to the -mm tree.  Its filename is
     vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch

This patch should soon appear at
    https://ozlabs.org/~akpm/mmots/broken-out/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch
and later at
    https://ozlabs.org/~akpm/mmotm/broken-out/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: vfs: keep inodes with page cache off the inode shrinker LRU

Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache.  This caused
problems on highmem hosts: struct inode could put fill lowmem zones before
the cache was getting reclaimed in the highmem zones.

To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem.  However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because the
inodes are not currently open or dirty - think working with a large git
tree.  It further doesn't respect cgroup memory protection settings and
can cause priority inversions between containers.

Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults.  We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior.  We also use it to quantify and report workload health through
psi.  The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.

The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned scenarios,
with the workload generally entering unexpected thrashing states while
losing the ability to reliably detect it.

To fix this on non-highmem systems at least, going back to rotating inodes
on the LRU isn't feasible.  We've tried (commit a76cf1a474d7 ("mm: don't
reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3
("Revert "mm: don't reclaim inodes with many attached pages"")).  The
issue is mostly that shrinker pools attract pressure based on their size,
and when objects get skipped the shrinkers remember this as deferred
reclaim work.  This accumulates excessive pressure on the remaining
inodes, and we can quickly eat into heavily used ones, or dirty ones that
require IO to reclaim, when there potentially is plenty of cold, clean
cache around still.

Instead, this patch keeps populated inodes off the inode LRU in the first
place - just like an open file or dirty state would.  An otherwise clean
and unused inode then gets queued when the last cache entry disappears. 
This solves the problem without reintroducing the reclaim issues, and
generally is a bit more scalable than having to wade through potentially
hundreds of thousands of busy inodes.

Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe
page cache lock (i_pages.xa_lock).  Page cache deletions are serialized
through i_lock, taken before the i_pages lock, to make sure depopulated
inodes are queued reliably.  Additions may race with deletions, but we'll
check again in the shrinker.  If additions race with the shrinker itself,
we're protected by the i_lock: if find_inode() or iput() win, the shrinker
will bail on the elevated i_count or I_REFERENCED; if the shrinker wins
and goes ahead with the inode, it will set I_FREEING and inhibit further
igets(), which will cause the other side to create a new instance of the
inode instead.

Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/inode.c              |   46 ++++++++++++++++++++--------------
 fs/internal.h           |    1 
 include/linux/fs.h      |    1 
 include/linux/pagemap.h |   50 ++++++++++++++++++++++++++++++++++++++
 mm/filemap.c            |    8 ++++++
 mm/truncate.c           |   19 ++++++++++++--
 mm/vmscan.c             |    7 +++++
 mm/workingset.c         |   10 +++++++
 8 files changed, 120 insertions(+), 22 deletions(-)

--- a/fs/inode.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/inode.c
@@ -424,11 +424,20 @@ void ihold(struct inode *inode)
 }
 EXPORT_SYMBOL(ihold);
 
-static void inode_lru_list_add(struct inode *inode)
+static void __inode_add_lru(struct inode *inode, bool rotate)
 {
+	if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
+		return;
+	if (atomic_read(&inode->i_count))
+		return;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return;
+	if (!mapping_shrinkable(&inode->i_data))
+		return;
+
 	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_inc(nr_unused);
-	else
+	else if (rotate)
 		inode->i_state |= I_REFERENCED;
 }
 
@@ -439,16 +448,11 @@ static void inode_lru_list_add(struct in
  */
 void inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	__inode_add_lru(inode, false);
 }
 
-
 static void inode_lru_list_del(struct inode *inode)
 {
-
 	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_dec(nr_unused);
 }
@@ -724,10 +728,6 @@ again:
 /*
  * Isolate the inode from the LRU in preparation for freeing it.
  *
- * Any inodes which are pinned purely because of attached pagecache have their
- * pagecache removed.  If the inode has metadata buffers attached to
- * mapping->private_list then try to remove them.
- *
  * If the inode has the I_REFERENCED flag set, then it means that it has been
  * used recently - the flag is set in iput_final(). When we encounter such an
  * inode, clear the flag and move it to the back of the LRU so it gets another
@@ -743,31 +743,39 @@ static enum lru_status inode_lru_isolate
 	struct inode	*inode = container_of(item, struct inode, i_lru);
 
 	/*
-	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
-	 * If we fail to get the lock, just skip it.
+	 * We are inverting the lru lock/inode->i_lock here, so use a
+	 * trylock. If we fail to get the lock, just skip it.
 	 */
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
 	/*
-	 * Referenced or dirty inodes are still in use. Give them another pass
-	 * through the LRU as we canot reclaim them now.
+	 * Inodes can get referenced, redirtied, or repopulated while
+	 * they're already on the LRU, and this can make them
+	 * unreclaimable for a while. Remove them lazily here; iput,
+	 * sync, or the last page cache deletion will requeue them.
 	 */
 	if (atomic_read(&inode->i_count) ||
-	    (inode->i_state & ~I_REFERENCED)) {
+	    (inode->i_state & ~I_REFERENCED) ||
+	    !mapping_shrinkable(&inode->i_data)) {
 		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
 		this_cpu_dec(nr_unused);
 		return LRU_REMOVED;
 	}
 
-	/* recently referenced inodes get one more pass */
+	/* Recently referenced inodes get one more pass */
 	if (inode->i_state & I_REFERENCED) {
 		inode->i_state &= ~I_REFERENCED;
 		spin_unlock(&inode->i_lock);
 		return LRU_ROTATE;
 	}
 
+	/*
+	 * On highmem systems, mapping_shrinkable() permits dropping
+	 * page cache in order to free up struct inodes: lowmem might
+	 * be under pressure before the cache inside the highmem zone.
+	 */
 	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -1634,7 +1642,7 @@ static void iput_final(struct inode *ino
 	if (!drop &&
 	    !(inode->i_state & I_DONTCACHE) &&
 	    (sb->s_flags & SB_ACTIVE)) {
-		inode_add_lru(inode);
+		__inode_add_lru(inode, true);
 		spin_unlock(&inode->i_lock);
 		return;
 	}
--- a/fs/internal.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/internal.h
@@ -146,7 +146,6 @@ extern int vfs_open(const struct path *,
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
--- a/include/linux/fs.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/include/linux/fs.h
@@ -3214,6 +3214,7 @@ static inline void remove_inode_hash(str
 }
 
 extern void inode_sb_list_add(struct inode *inode);
+extern void inode_add_lru(struct inode *inode);
 
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
--- a/include/linux/pagemap.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/include/linux/pagemap.h
@@ -24,6 +24,56 @@ static inline bool mapping_empty(struct
 }
 
 /*
+ * mapping_shrinkable - test if page cache state allows inode reclaim
+ * @mapping: the page cache mapping
+ *
+ * This checks the mapping's cache state for the pupose of inode
+ * reclaim and LRU management.
+ *
+ * The caller is expected to hold the i_lock, but is not required to
+ * hold the i_pages lock, which usually protects cache state. That's
+ * because the i_lock and the list_lru lock that protect the inode and
+ * its LRU state don't nest inside the irq-safe i_pages lock.
+ *
+ * Cache deletions are performed under the i_lock, which ensures that
+ * when an inode goes empty, it will reliably get queued on the LRU.
+ *
+ * Cache additions do not acquire the i_lock and may race with this
+ * check, in which case we'll report the inode as shrinkable when it
+ * has cache pages. This is okay: the shrinker also checks the
+ * refcount and the referenced bit, which will be elevated or set in
+ * the process of adding new cache pages to an inode.
+ */
+static inline bool mapping_shrinkable(struct address_space *mapping)
+{
+	void *head;
+
+	/*
+	 * On highmem systems, there could be lowmem pressure from the
+	 * inodes before there is highmem pressure from the page
+	 * cache. Make inodes shrinkable regardless of cache state.
+	 */
+	if (IS_ENABLED(CONFIG_HIGHMEM))
+		return true;
+
+	/* Cache completely empty? Shrink away. */
+	head = rcu_access_pointer(mapping->i_pages.xa_head);
+	if (!head)
+		return true;
+
+	/*
+	 * The xarray stores single offset-0 entries directly in the
+	 * head pointer, which allows non-resident page cache entries
+	 * to escape the shadow shrinker's list of xarray nodes. The
+	 * inode shrinker needs to pick them up under memory pressure.
+	 */
+	if (!xa_is_node(head) && xa_is_value(head))
+		return true;
+
+	return false;
+}
+
+/*
  * Bits in mapping->flags.
  */
 enum mapping_flags {
--- a/mm/filemap.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/filemap.c
@@ -260,9 +260,13 @@ void delete_from_page_cache(struct page
 	struct address_space *mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	__delete_from_page_cache(page, NULL);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 
 	page_cache_free_page(mapping, page);
 }
@@ -338,6 +342,7 @@ void delete_from_page_cache_batch(struct
 	if (!pagevec_count(pvec))
 		return;
 
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		trace_mm_filemap_delete_from_page_cache(pvec->pages[i]);
@@ -346,6 +351,9 @@ void delete_from_page_cache_batch(struct
 	}
 	page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
--- a/mm/truncate.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/truncate.c
@@ -45,9 +45,13 @@ static inline void __clear_shadow_entry(
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	__clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 }
 
 /*
@@ -73,8 +77,10 @@ static void truncate_exceptional_pvec_en
 		return;
 
 	dax = dax_mapping(mapping);
-	if (!dax)
+	if (!dax) {
+		spin_lock(&mapping->host->i_lock);
 		xa_lock_irq(&mapping->i_pages);
+	}
 
 	for (i = j; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
@@ -93,8 +99,12 @@ static void truncate_exceptional_pvec_en
 		__clear_shadow_entry(mapping, index, page);
 	}
 
-	if (!dax)
+	if (!dax) {
 		xa_unlock_irq(&mapping->i_pages);
+		if (mapping_shrinkable(mapping))
+			inode_add_lru(mapping->host);
+		spin_unlock(&mapping->host->i_lock);
+	}
 	pvec->nr = j;
 }
 
@@ -566,6 +576,7 @@ invalidate_complete_page2(struct address
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	if (PageDirty(page))
 		goto failed;
@@ -573,6 +584,9 @@ invalidate_complete_page2(struct address
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -581,6 +595,7 @@ invalidate_complete_page2(struct address
 	return 1;
 failed:
 	xa_unlock_irq(&mapping->i_pages);
+	spin_unlock(&mapping->host->i_lock);
 	return 0;
 }
 
--- a/mm/vmscan.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/vmscan.c
@@ -1055,6 +1055,8 @@ static int __remove_mapping(struct addre
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+	if (!PageSwapCache(page))
+		spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	/*
 	 * The non racy check for a busy page.
@@ -1123,6 +1125,9 @@ static int __remove_mapping(struct addre
 			shadow = workingset_eviction(page, target_memcg);
 		__delete_from_page_cache(page, shadow);
 		xa_unlock_irq(&mapping->i_pages);
+		if (mapping_shrinkable(mapping))
+			inode_add_lru(mapping->host);
+		spin_unlock(&mapping->host->i_lock);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1132,6 +1137,8 @@ static int __remove_mapping(struct addre
 
 cannot_free:
 	xa_unlock_irq(&mapping->i_pages);
+	if (!PageSwapCache(page))
+		spin_unlock(&mapping->host->i_lock);
 	return 0;
 }
 
--- a/mm/workingset.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/workingset.c
@@ -540,6 +540,13 @@ static enum lru_status shadow_lru_isolat
 		goto out;
 	}
 
+	if (!spin_trylock(&mapping->host->i_lock)) {
+		xa_unlock(&mapping->i_pages);
+		spin_unlock_irq(lru_lock);
+		ret = LRU_RETRY;
+		goto out;
+	}
+
 	list_lru_isolate(lru, item);
 	__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
 
@@ -559,6 +566,9 @@ static enum lru_status shadow_lru_isolat
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
_

Patches currently in -mm which might be from hannes@cmpxchg.org are

mm-remove-irqsave-restore-locking-from-contexts-with-irqs-enabled.patch
fs-drop_caches-fix-skipping-over-shadow-cache-inodes.patch
fs-inode-count-invalidated-shadow-pages-in-pginodesteal.patch
vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch


^ permalink raw reply	[flat|nested] 2+ messages in thread

* + vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch added to -mm tree
  2020-05-08  1:35 incoming Andrew Morton
@ 2020-05-13 21:26 ` Andrew Morton
  0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2020-05-13 21:26 UTC (permalink / raw)
  To: david, guro, hannes, laoar.shao, mhocko, mm-commits, viro


The patch titled
     Subject: Re: vfs: keep inodes with page cache off the inode shrinker LRU
has been added to the -mm tree.  Its filename is
     vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: vfs: keep inodes with page cache off the inode shrinker LRU

The VFS inode shrinker is currently allowed to reclaim cold inodes with
populated page cache.  This behavior goes back to CONFIG_HIGHMEM setups,
which required the ability to drop page cache in large highem zones to
free up struct inodes in comparatively tiny lowmem zones.

However, it has significant side effects that are hard to justify on
systems without highmem:

- It can drop gigabytes of hot and active page cache on the floor
  without consulting the VM (recorded as "inodesteal" events in
  /proc/vmstat).  Such an "aging inversion" between unreferenced inodes
  holding hot cache easily happens in practice: for example, a git tree
  whose objects are accessed frequently but no open file descriptors are
  maintained throughout.

- It can als prematurely drop non-resident info stored inside the
  inodes, which can let massive cache thrashing go unnoticed.  This in
  turn means we're making the wrong decisions in page reclaim and can get
  stuck in pathological thrashing.  The thrashing also doesn't show up as
  memory pressure in psi, causing failure in the userspace OOM detection
  chain, which can leave a system (or Android device e.g.) stranded in a
  livelock.

History

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over inodes
that still contain pages: a76cf1a474d7 ("mm: don't reclaim inodes with
many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused severe
reclaim performance problems: Inodes that sit on the shrinker LRU are
attracting reclaim pressure away from the page cache and toward the VFS. 
If we then permanently exempt sizable portions of this pool from actually
getting reclaimed when looked at, this pressure accumulates as deferred
shrinker work (a mechanism for *temporarily* unreclaimable objects) until
it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive pressure
drove the XFS shrinker into dirty objects, where it caused synchronous,
IO-bound stalls, even as there was plenty of clean page cache that should
have been reclaimed instead.

Another variant of this problem was recently observed, where the kernel
violates cgroups' memory.low protection settings and reclaims page cache
way beyond the configured thresholds.  It was followed by a proposal of a
modified form of the reverted commit above, that implements
memory.low-sensitive shrinker skipping over populated inodes on the LRU
[1].  However, this proposal continues to run the risk of attracting
disproportionate reclaim pressure to a pool of still-used inodes, while
not addressing the more generic reclaim inversion problem outside of a
very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

This patch fixes the aging inversion described above on !CONFIG_HIGHMEM
systems, without reintroducing the problems associated with excessive
shrinker LRU rotations, by keeping populated inodes off the shrinker LRUs
entirely.

Currently, inodes are kept off the shrinker LRU as long as they have an
elevated i_count, indicating an active user.  Unfortunately, the page
cache cannot simply hold an i_count reference, because unlink() *should*
result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page cache
and punt the LRU linking to the VM if the inode is still populated; the VM
in turn checks the inode state when it depopulates the page cache, and
adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off the
LRU permanently until writeback completion puts them back on (iff still
unused).  We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer.  Follow existing precedence and
handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to use
  them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker drops
page cache - is relegated to CONFIG_HIGHMEM systems only.  The VM is in
charge of the cache, the shrinker in charge of struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of times
  a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred an
  LRU addition.  Keep the pginodesteal/kswapd_inodesteal counters for
  backwards compatibility, but they'll just show 0 now.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache.  Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker would
  have taken care of them before - no more, so do it VM-side.

Link: http://lkml.kernel.org/r/20200512212936.GA450429@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/block_dev.c                |    2 
 fs/dax.c                      |   14 ++++
 fs/drop_caches.c              |    2 
 fs/inode.c                    |  112 +++++++++++++++++++++++++++++---
 fs/internal.h                 |    2 
 include/linux/fs.h            |   17 ++++
 include/linux/pagemap.h       |    2 
 include/linux/vm_event_item.h |    3 
 mm/filemap.c                  |   37 ++++++++--
 mm/huge_memory.c              |    3 
 mm/truncate.c                 |   34 +++++++--
 mm/vmscan.c                   |    6 +
 mm/vmstat.c                   |    4 -
 mm/workingset.c               |    4 +
 14 files changed, 208 insertions(+), 34 deletions(-)

--- a/fs/block_dev.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/block_dev.c
@@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		return;
 
 	invalidate_bh_lrus();
--- a/fs/dax.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/dax.c
@@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct x
 {
 	unsigned long index = xas->xa_index;
 	bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
+	int populated;
 	void *entry;
 
 retry:
+	populated = 0;
 	xas_lock_irq(xas);
 	entry = get_unlocked_entry(xas, order);
 
@@ -526,6 +528,8 @@ retry:
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
 		mapping->nrexceptional--;
+		if (mapping_empty(mapping))
+			populated = -1;
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -541,11 +545,17 @@ retry:
 		dax_lock_entry(xas, entry);
 		if (xas_error(xas))
 			goto out_unlock;
+		if (mapping_empty(mapping))
+			populated++;
 		mapping->nrexceptional++;
 	}
 
 out_unlock:
 	xas_unlock_irq(xas);
+	if (populated == -1)
+		inode_pages_clear(mapping->inode);
+	else if (populated == 1)
+		inode_pages_set(mapping->inode);
 	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
 		goto retry;
 	if (xas->xa_node == XA_ERROR(-ENOMEM))
@@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct
 					  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
+	bool empty = false;
 	int ret = 0;
 	void *entry;
 
@@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct
 	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+	empty = mapping_empty(mapping);
 	ret = 1;
 out:
 	put_unlocked_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	return ret;
 }
 
--- a/fs/drop_caches.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/drop_caches.c
@@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct sup
 		 * we need to reschedule to avoid softlockups.
 		 */
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0 && !need_resched())) {
+		    (mapping_empty(inode->i_mapping) && !need_resched())) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
--- a/fs/inode.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/inode.c
@@ -431,26 +431,101 @@ static void inode_lru_list_add(struct in
 		inode->i_state |= I_REFERENCED;
 }
 
+static void inode_lru_list_del(struct inode *inode)
+{
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+		this_cpu_dec(nr_unused);
+}
+
 /*
  * Add inode to LRU if needed (inode is unused and clean).
  *
  * Needs inode->i_lock held.
  */
-void inode_add_lru(struct inode *inode)
+bool inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	if (inode->i_state &
+	    (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
+		return false;
+	if (atomic_read(&inode->i_count))
+		return false;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return false;
+	inode_lru_list_add(inode);
+	return true;
 }
 
+/*
+ * Usually, inodes become reclaimable when they are no longer
+ * referenced and their page cache has been reclaimed. The following
+ * API allows the VM to communicate cache population state to the VFS.
+ *
+ * However, on CONFIG_HIGHMEM we can't wait for the page cache to go
+ * away: cache pages allocated in a large highmem zone could pin
+ * struct inode memory allocated in relatively small lowmem zones. So
+ * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime.
+ */
 
-static void inode_lru_list_del(struct inode *inode)
+#ifndef CONFIG_HIGHMEM
+/**
+ * inode_pages_set - mark the inode as holding page cache
+ * @inode: the inode whose first cache page was just added
+ *
+ * Tell the VFS that this inode has populated page cache and must not
+ * be reclaimed by the inode shrinker.
+ *
+ * The caller must hold the page lock of the just-added page: by
+ * pinning the page, the page cache cannot become depopulated, and we
+ * can safely set I_PAGES without a race check under the i_pages lock.
+ *
+ * This function acquires the i_lock.
+ */
+void inode_pages_set(struct inode *inode)
 {
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_PAGES)) {
+		inode->i_state |= I_PAGES;
+		if (!list_empty(&inode->i_lru)) {
+			count_vm_event(PGINODERESCUE);
+			inode_lru_list_del(inode);
+		}
+	}
+	spin_unlock(&inode->i_lock);
+}
 
-	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
-		this_cpu_dec(nr_unused);
+/**
+ * inode_pages_clear - mark the inode as not holding page cache
+ * @inode: the inode whose last cache page was just removed
+ *
+ * Tell the VFS that the inode no longer holds page cache and that its
+ * lifetime is to be handed over to the inode shrinker LRU.
+ *
+ * This function acquires the i_lock and the i_pages lock.
+ */
+void inode_pages_clear(struct inode *inode)
+{
+	struct address_space *mapping = &inode->i_data;
+	bool add_to_lru = false;
+	unsigned long flags;
+
+	spin_lock(&inode->i_lock);
+
+	xa_lock_irqsave(&mapping->i_pages, flags);
+	if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
+		inode->i_state &= ~I_PAGES;
+		add_to_lru = true;
+	}
+	xa_unlock_irqrestore(&mapping->i_pages, flags);
+
+	if (add_to_lru) {
+		WARN_ON_ONCE(!list_empty(&inode->i_lru));
+		if (inode_add_lru(inode))
+			__count_vm_event(PGINODEDELAYED);
+	}
+
+	spin_unlock(&inode->i_lock);
 }
+#endif /* !CONFIG_HIGHMEM */
 
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
@@ -743,6 +818,8 @@ static enum lru_status inode_lru_isolate
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+	WARN_ON_ONCE(inode->i_state & I_PAGES);
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -762,7 +839,18 @@ static enum lru_status inode_lru_isolate
 		return LRU_ROTATE;
 	}
 
-	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+	/*
+	 * Usually, populated inodes shouldn't be on the shrinker LRU,
+	 * but they can be briefly visible when a new page is added to
+	 * an inode that was already linked but inode_pages_set()
+	 * hasn't run yet to move them off.
+	 *
+	 * The other exception is on HIGHMEM systems: highmem cache
+	 * can pin lowmem struct inodes, and we might be in dire
+	 * straits in the lower zones. Purge cache to free the inode.
+	 */
+	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+#ifdef CONFIG_HIGHMEM
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(lru_lock);
@@ -779,6 +867,12 @@ static enum lru_status inode_lru_isolate
 		iput(inode);
 		spin_lock(lru_lock);
 		return LRU_RETRY;
+#else
+		list_lru_isolate(lru, &inode->i_lru);
+		spin_unlock(&inode->i_lock);
+		this_cpu_dec(nr_unused);
+		return LRU_REMOVED;
+#endif
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
--- a/fs/internal.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/fs/internal.h
@@ -137,7 +137,7 @@ extern int vfs_open(const struct path *,
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
+extern bool inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
--- a/include/linux/fs.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/include/linux/fs.h
@@ -592,6 +592,11 @@ static inline void mapping_allow_writabl
 	atomic_inc(&mapping->i_mmap_writable);
 }
 
+static inline bool mapping_empty(struct address_space *mapping)
+{
+	return mapping->nrpages + mapping->nrexceptional == 0;
+}
+
 /*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
@@ -2162,6 +2167,9 @@ static inline void kiocb_clone(struct ki
  *
  * I_CREATING		New object's inode in the middle of setting up.
  *
+ * I_PAGES		Inode is holding page cache that needs to get reclaimed
+ *			first before the inode can go onto the shrinker LRU.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2184,6 +2192,7 @@ static inline void kiocb_clone(struct ki
 #define I_WB_SWITCH		(1 << 13)
 #define I_OVL_INUSE		(1 << 14)
 #define I_CREATING		(1 << 15)
+#define I_PAGES			(1 << 16)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -3123,6 +3132,14 @@ static inline void remove_inode_hash(str
 		__remove_inode_hash(inode);
 }
 
+#ifndef CONFIG_HIGHMEM
+extern void inode_pages_set(struct inode *inode);
+extern void inode_pages_clear(struct inode *inode);
+#else
+static inline void inode_pages_set(struct inode *inode) {}
+static inline void inode_pages_clear(struct inode *inode) {}
+#endif
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
--- a/include/linux/pagemap.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/include/linux/pagemap.h
@@ -613,7 +613,7 @@ int add_to_page_cache_locked(struct page
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+extern bool __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
--- a/include/linux/vm_event_item.h~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/include/linux/vm_event_item.h
@@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
-		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
+		SLABS_SCANNED,
+		PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
--- a/mm/filemap.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/filemap.c
@@ -116,8 +116,8 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
-static void page_cache_delete(struct address_space *mapping,
-				   struct page *page, void *shadow)
+static bool __must_check page_cache_delete(struct address_space *mapping,
+					   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
 	unsigned int nr = 1;
@@ -151,6 +151,8 @@ static void page_cache_delete(struct add
 		smp_wmb();
 	}
 	mapping->nrpages -= nr;
+
+	return mapping_empty(mapping);
 }
 
 static void unaccount_page_cache_page(struct address_space *mapping,
@@ -227,15 +229,18 @@ static void unaccount_page_cache_page(st
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the i_pages lock.
+ *
+ * If this returns true, the caller must call inode_pages_clear()
+ * after dropping the i_pages lock.
  */
-void __delete_from_page_cache(struct page *page, void *shadow)
+bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 
 	unaccount_page_cache_page(mapping, page);
-	page_cache_delete(mapping, page, shadow);
+	return page_cache_delete(mapping, page, shadow);
 }
 
 static void page_cache_free_page(struct address_space *mapping,
@@ -267,12 +272,16 @@ void delete_from_page_cache(struct page
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
+	bool empty;
 
 	BUG_ON(!PageLocked(page));
 	xa_lock_irqsave(&mapping->i_pages, flags);
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	page_cache_free_page(mapping, page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
@@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
  *
  * The function expects the i_pages lock to be held.
  */
-static void page_cache_delete_batch(struct address_space *mapping,
-			     struct pagevec *pvec)
+static bool __must_check page_cache_delete_batch(struct address_space *mapping,
+						 struct pagevec *pvec)
 {
 	XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
 	int total_pages = 0;
@@ -337,12 +346,15 @@ static void page_cache_delete_batch(stru
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
+
+	return mapping_empty(mapping);
 }
 
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec)
 {
 	int i;
+	bool empty;
 	unsigned long flags;
 
 	if (!pagevec_count(pvec))
@@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct
 
 		unaccount_page_cache_page(mapping, pvec->pages[i]);
 	}
-	page_cache_delete_batch(mapping, pvec);
+	empty = page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
 }
@@ -833,6 +848,7 @@ static int __add_to_page_cache_locked(st
 	XA_STATE(xas, &mapping->i_pages, offset);
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
+	bool populated = false;
 	int error;
 	void *old;
 
@@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(st
 		if (xas_error(&xas))
 			goto unlock;
 
+		populated = mapping_empty(mapping);
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
 			if (shadowp)
@@ -880,6 +897,10 @@ unlock:
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
+
+	if (populated)
+		inode_pages_set(mapping->host);
+
 	return 0;
 error:
 	page->mapping = NULL;
--- a/mm/huge_memory.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/huge_memory.c
@@ -2608,7 +2608,8 @@ static void __split_huge_page(struct pag
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
-			__delete_from_page_cache(head + i, NULL);
+			/* We know we're not removing the last page */
+			(void)__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
--- a/mm/truncate.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/truncate.c
@@ -31,24 +31,31 @@
  * itself locked.  These unlocked entries need verification under the tree
  * lock.
  */
-static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
+static bool __must_check __clear_shadow_entry(struct address_space *mapping,
+					      pgoff_t index, void *entry)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
-		return;
+		return 0;
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+
+	return mapping_empty(mapping);
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	bool empty;
+
 	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
+	empty = __clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
 }
 
 /*
@@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_en
 				pgoff_t end)
 {
 	int i, j;
-	bool dax, lock;
+	bool dax, lock, empty = false;
 
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
@@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_en
 			continue;
 		}
 
-		__clear_shadow_entry(mapping, index, page);
+		if (__clear_shadow_entry(mapping, index, page))
+			empty = true;
 	}
 
 	if (lock)
 		xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	pvec->nr = j;
 }
 
@@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct a
 	pgoff_t		index;
 	int		i;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	/* Offsets within partial pages */
@@ -636,6 +648,7 @@ static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
 	unsigned long flags;
+	bool empty;
 
 	if (page->mapping != mapping)
 		return 0;
@@ -648,9 +661,12 @@ invalidate_complete_page2(struct address
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
 
@@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	pagevec_init(&pvec);
--- a/mm/vmscan.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/vmscan.c
@@ -901,6 +901,7 @@ static int __remove_mapping(struct addre
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
+		int empty;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -922,9 +923,12 @@ static int __remove_mapping(struct addre
 		if (reclaimed && page_is_file_lru(page) &&
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(page, target_memcg);
-		__delete_from_page_cache(page, shadow);
+		empty = __delete_from_page_cache(page, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+		if (empty)
+			inode_pages_clear(mapping->host);
+
 		if (freepage != NULL)
 			freepage(page);
 	}
--- a/mm/vmstat.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/vmstat.c
@@ -1205,9 +1205,11 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
 #endif
-	"pginodesteal",
 	"slabs_scanned",
+	"pginodesteal",
 	"kswapd_inodesteal",
+	"pginoderescue",
+	"pginodedelayed",
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
--- a/mm/workingset.c~vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru
+++ a/mm/workingset.c
@@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolat
 	struct xa_node *node = container_of(item, struct xa_node, private_list);
 	XA_STATE(xas, node->array, 0);
 	struct address_space *mapping;
+	bool empty = false;
 	int ret;
 
 	/*
@@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolat
 	if (WARN_ON_ONCE(node->count != node->nr_values))
 		goto out_invalid;
 	mapping->nrexceptional -= node->nr_values;
+	empty = mapping_empty(mapping);
 	xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
 	xas.xa_offset = node->offset;
 	xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
@@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolat
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
_

Patches currently in -mm which might be from hannes@cmpxchg.org are

vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch
mm-fix-numa-node-file-count-error-in-replace_page_cache.patch
mm-memcontrol-fix-stat-corrupting-race-in-charge-moving.patch
mm-memcontrol-drop-compound-parameter-from-memcg-charging-api.patch
mm-shmem-remove-rare-optimization-when-swapin-races-with-hole-punching.patch
mm-memcontrol-move-out-cgroup-swaprate-throttling.patch
mm-memcontrol-convert-page-cache-to-a-new-mem_cgroup_charge-api.patch
mm-memcontrol-prepare-uncharging-for-removal-of-private-page-type-counters.patch
mm-memcontrol-prepare-move_account-for-removal-of-private-page-type-counters.patch
mm-memcontrol-prepare-cgroup-vmstat-infrastructure-for-native-anon-counters.patch
mm-memcontrol-switch-to-native-nr_file_pages-and-nr_shmem-counters.patch
mm-memcontrol-switch-to-native-nr_anon_mapped-counter.patch
mm-memcontrol-switch-to-native-nr_anon_thps-counter.patch
mm-memcontrol-convert-anon-and-file-thp-to-new-mem_cgroup_charge-api.patch
mm-memcontrol-drop-unused-try-commit-cancel-charge-api.patch
mm-memcontrol-prepare-swap-controller-setup-for-integration.patch
mm-memcontrol-make-swap-tracking-an-integral-part-of-memory-control.patch
mm-memcontrol-charge-swapin-pages-on-instantiation.patch
mm-memcontrol-delete-unused-lrucare-handling.patch
mm-memcontrol-update-page-mem_cgroup-stability-rules.patch

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-06-14 22:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 22:00 + vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch added to -mm tree akpm
  -- strict thread matches above, loose matches on Subject: below --
2020-05-08  1:35 incoming Andrew Morton
2020-05-13 21:26 ` + vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-lru.patch added to -mm tree Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.