Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
@ 2020-02-11 17:55 Johannes Weiner
  2020-02-11 18:20 ` Johannes Weiner
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-02-11 17:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel
  Cc: Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Andrew Morton, Linus Torvalds, Al Viro, kernel-team

The VFS inode shrinker is currently allowed to reclaim inodes with
populated page cache. As a result it can drop gigabytes of hot and
active page cache on the floor without consulting the VM (recorded as
"inodesteal" events in /proc/vmstat).

This causes real problems in practice. Consider for example how the VM
would cache a source tree, such as the Linux git tree. As large parts
of the checked out files and the object database are accessed
repeatedly, the page cache holding this data gets moved to the active
list, where it's fully (and indefinitely) insulated from one-off cache
moving through the inactive list.

However, due to the way users interact with the tree, no ongoing open
file descriptors into the source tree are maintained, and the inodes
end up on the "unused inode" shrinker LRU. A larger burst of one-off
cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop
first the dentries and then the inodes - inodes that contain the most
valuable data currently held by the page cache - while there is plenty
of one-off cache that could be reclaimed instead.

This doesn't make sense. The inodes aren't really "unused" as long as
the VM deems it worthwhile to hold on to their page cache. And the
shrinker can't possibly guess what is and isn't valuable to the VM
based on recent inode reference information alone (we could delete
several thousand lines of reclaim code if it could).

History

This behavior of invalidating page cache from the inode shrinker goes
back to even before the git import of the kernel tree. It may have
been less noticeable when the VM itself didn't have real workingset
protection, and floods of one-off cache would push out any active
cache over time anyway. But the VM has come a long way since then and
the inode shrinker is now actively subverting its caching strategy.

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over
inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
inodes with many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused
severe reclaim performance problems: Inodes that sit on the shrinker
LRU are attracting reclaim pressure away from the page cache and
toward the VFS. If we then permanently exempt sizable portions of this
pool from actually getting reclaimed when looked at, this pressure
accumulates as deferred shrinker work (a mechanism for *temporarily*
unreclaimable objects) until it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive
pressure drove the XFS shrinker into dirty objects, where it caused
synchronous, IO-bound stalls, even as there was plenty of clean page
cache that should have been reclaimed instead.

Another variant of this problem was recently observed, where the
kernel violates cgroups' memory.low protection settings and reclaims
page cache way beyond the configured thresholds. It was followed by a
proposal of a modified form of the reverted commit above, that
implements memory.low-sensitive shrinker skipping over populated
inodes on the LRU [1]. However, this proposal continues to run the
risk of attracting disproportionate reclaim pressure to a pool of
still-used inodes, while not addressing the more generic reclaim
inversion problem outside of a very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

To fix the reclaim inversion described in the beginning, without
reintroducing the problems associated with shrinker LRU rotations,
this patch keeps populated inodes off the LRUs entirely.

Currently, inodes are kept off the shrinker LRU as long as they have
an elevated i_count, indicating an active user. Unfortunately, the
page cache cannot simply hold an i_count reference, because unlink()
*should* result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page
cache and punt the LRU linking to the VM if the inode is still
populated; the VM in turn checks the inode state when it depopulates
the page cache, and adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off
the LRU permanently until writeback completion puts them back on (iff
still unused). We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer. Follow existing precedence
and handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to
  use them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker
drops page cache - is a thing of the past. The VM is in charge of the
page cache, the inode shrinker is in charge of freeing struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of
  times a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred
  an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
  for backwards compatibility, but they'll just show 0 now.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache. Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker
  would have taken care of them before - no more, so do it VM-side.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c                |   2 +-
 fs/dax.c                      |  14 +++++
 fs/drop_caches.c              |   2 +-
 fs/inode.c                    | 106 ++++++++++++++++++++++++++--------
 fs/internal.h                 |   2 +-
 include/linux/fs.h            |  12 ++++
 include/linux/pagemap.h       |   2 +-
 include/linux/vm_event_item.h |   3 +-
 mm/filemap.c                  |  39 ++++++++++---
 mm/huge_memory.c              |   3 +-
 mm/truncate.c                 |  34 ++++++++---
 mm/vmscan.c                   |   6 +-
 mm/vmstat.c                   |   4 +-
 mm/workingset.c               |   4 ++
 14 files changed, 183 insertions(+), 50 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..46f67147ad20 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/dax.c b/fs/dax.c
index 35da144375a0..7f3ce4612272 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
 {
 	unsigned long index = xas->xa_index;
 	bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
+	int populated;
 	void *entry;
 
 retry:
+	populated = 0;
 	xas_lock_irq(xas);
 	entry = get_unlocked_entry(xas, order);
 
@@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
 		mapping->nrexceptional--;
+		if (mapping_empty(mapping))
+			populated = -1;
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas,
 		dax_lock_entry(xas, entry);
 		if (xas_error(xas))
 			goto out_unlock;
+		if (mapping_empty(mapping))
+			populated++;
 		mapping->nrexceptional++;
 	}
 
 out_unlock:
 	xas_unlock_irq(xas);
+	if (populated == -1)
+		inode_pages_clear(mapping->inode);
+	else if (populated == 1)
+		inode_pages_set(mapping->inode);
 	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
 		goto retry;
 	if (xas->xa_node == XA_ERROR(-ENOMEM))
@@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
+	bool empty = false;
 	int ret = 0;
 	void *entry;
 
@@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+	empty = mapping_empty(mapping);
 	ret = 1;
 out:
 	put_unlocked_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	return ret;
 }
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dc1a1d5d825b..a5e9e9053474 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		 * we need to reschedule to avoid softlockups.
 		 */
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0 && !need_resched())) {
+		    (mapping_empty(inode->i_mapping) && !need_resched())) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..575b780fa9bb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -430,25 +430,87 @@ static void inode_lru_list_add(struct inode *inode)
 		inode->i_state |= I_REFERENCED;
 }
 
+static void inode_lru_list_del(struct inode *inode)
+{
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+		this_cpu_dec(nr_unused);
+}
+
 /*
  * Add inode to LRU if needed (inode is unused and clean).
  *
  * Needs inode->i_lock held.
  */
-void inode_add_lru(struct inode *inode)
+bool inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	if (inode->i_state &
+	    (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
+		return false;
+	if (atomic_read(&inode->i_count))
+		return false;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return false;
+	inode_lru_list_add(inode);
+	return true;
 }
 
+/**
+ * inode_pages_set - mark the inode as holding page cache
+ * @inode: the inode whose first cache page was just added
+ *
+ * Tell the VFS that this inode has populated page cache and must not
+ * be reclaimed by the inode shrinker.
+ *
+ * The caller must hold the page lock of the just-added page: by
+ * pinning the page, the page cache cannot become depopulated, and we
+ * can safely set I_PAGES without a race check under the i_pages lock.
+ *
+ * This function acquires the i_lock.
+ */
+void inode_pages_set(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_PAGES)) {
+		inode->i_state |= I_PAGES;
+		if (!list_empty(&inode->i_lru)) {
+			count_vm_event(PGINODERESCUE);
+			inode_lru_list_del(inode);
+		}
+	}
+	spin_unlock(&inode->i_lock);
+}
 
-static void inode_lru_list_del(struct inode *inode)
+/**
+ * inode_pages_clear - mark the inode as not holding page cache
+ * @inode: the inode whose last cache page was just removed
+ *
+ * Tell the VFS that the inode no longer holds page cache and that its
+ * lifetime is to be handed over to the inode shrinker LRU.
+ *
+ * This function acquires the i_lock and the i_pages lock.
+ */
+void inode_pages_clear(struct inode *inode)
 {
+	struct address_space *mapping = &inode->i_data;
+	bool add_to_lru = false;
+	unsigned long flags;
 
-	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
-		this_cpu_dec(nr_unused);
+	spin_lock(&inode->i_lock);
+
+	xa_lock_irqsave(&mapping->i_pages, flags);
+	if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
+		inode->i_state &= ~I_PAGES;
+		add_to_lru = true;
+	}
+	xa_unlock_irqrestore(&mapping->i_pages, flags);
+
+	if (add_to_lru) {
+		WARN_ON_ONCE(!list_empty(&inode->i_lru));
+		if (inode_add_lru(inode))
+			__count_vm_event(PGINODEDELAYED);
+	}
+
+	spin_unlock(&inode->i_lock);
 }
 
 /**
@@ -742,6 +804,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+	WARN_ON_ONCE(inode->i_state & I_PAGES);
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -761,23 +825,17 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_ROTATE;
 	}
 
-	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-		__iget(inode);
+	/*
+	 * Populated inodes shouldn't be on the shrinker LRU, but they
+	 * can be briefly visible when a new page is added to an inode
+	 * that was already linked but inode_pages_set() hasn't run
+	 * yet to move them off.
+	 */
+	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(lru_lock);
-		if (remove_inode_buffers(inode)) {
-			unsigned long reap;
-			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
-			if (current_is_kswapd())
-				__count_vm_events(KSWAPD_INODESTEAL, reap);
-			else
-				__count_vm_events(PGINODESTEAL, reap);
-			if (current->reclaim_state)
-				current->reclaim_state->reclaimed_slab += reap;
-		}
-		iput(inode);
-		spin_lock(lru_lock);
-		return LRU_RETRY;
+		this_cpu_dec(nr_unused);
+		return LRU_REMOVED;
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..4a9dc77e817b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -139,7 +139,7 @@ extern int vfs_open(const struct path *, struct file *);
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
+extern bool inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..a98d9dee39f4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -585,6 +585,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
 	atomic_inc(&mapping->i_mmap_writable);
 }
 
+static inline bool mapping_empty(struct address_space *mapping)
+{
+	return mapping->nrpages + mapping->nrexceptional == 0;
+}
+
 /*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
@@ -2150,6 +2155,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_CREATING		New object's inode in the middle of setting up.
  *
+ * I_PAGES		Inode is holding page cache that needs to get reclaimed
+ *			first before the inode can go onto the shrinker LRU.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2172,6 +2180,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_WB_SWITCH		(1 << 13)
 #define I_OVL_INUSE		(1 << 14)
 #define I_CREATING		(1 << 15)
+#define I_PAGES			(1 << 16)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -3097,6 +3106,9 @@ static inline void remove_inode_hash(struct inode *inode)
 		__remove_inode_hash(inode);
 }
 
+extern void inode_pages_set(struct inode *inode);
+extern void inode_pages_clear(struct inode *inode);
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ccb14b6a16b5..ae4d90bd0873 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -609,7 +609,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+extern bool __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..f31026ccf590 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
-		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
+		SLABS_SCANNED,
+		PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..fcc24b3b3540 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,8 +116,8 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
-static void page_cache_delete(struct address_space *mapping,
-				   struct page *page, void *shadow)
+static bool __must_check page_cache_delete(struct address_space *mapping,
+					   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
 	unsigned int nr = 1;
@@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping,
 		smp_wmb();
 	}
 	mapping->nrpages -= nr;
+
+	return mapping_empty(mapping);
 }
 
 static void unaccount_page_cache_page(struct address_space *mapping,
@@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping,
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the i_pages lock.
+ *
+ * If this returns true, the caller must call inode_pages_clear()
+ * after dropping the i_pages lock.
  */
-void __delete_from_page_cache(struct page *page, void *shadow)
+bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 
 	unaccount_page_cache_page(mapping, page);
-	page_cache_delete(mapping, page, shadow);
+	return page_cache_delete(mapping, page, shadow);
 }
 
 static void page_cache_free_page(struct address_space *mapping,
@@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
+	bool empty;
 
 	BUG_ON(!PageLocked(page));
 	xa_lock_irqsave(&mapping->i_pages, flags);
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	page_cache_free_page(mapping, page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
@@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
  *
  * The function expects the i_pages lock to be held.
  */
-static void page_cache_delete_batch(struct address_space *mapping,
-			     struct pagevec *pvec)
+static bool __must_check page_cache_delete_batch(struct address_space *mapping,
+						 struct pagevec *pvec)
 {
 	XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
 	int total_pages = 0;
@@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
+
+	return mapping_empty(mapping);
 }
 
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec)
 {
 	int i;
+	bool empty;
 	unsigned long flags;
 
 	if (!pagevec_count(pvec))
@@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping,
 
 		unaccount_page_cache_page(mapping, pvec->pages[i]);
 	}
-	page_cache_delete_batch(mapping, pvec);
+	empty = page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
 }
@@ -831,9 +846,10 @@ static int __add_to_page_cache_locked(struct page *page,
 				      void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
+	int error;
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
-	int error;
+	bool populated = false;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(struct page *page,
 		if (xas_error(&xas))
 			goto unlock;
 
+		populated = mapping_empty(mapping);
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
 			if (shadowp)
@@ -880,6 +897,10 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
+
+	if (populated)
+		inode_pages_set(mapping->host);
+
 	return 0;
 error:
 	page->mapping = NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..8b3e33a52d93 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2535,7 +2535,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
-			__delete_from_page_cache(head + i, NULL);
+			/* We know we're not removing the last page */
+			(void)__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..8fb6c2f762bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -31,24 +31,31 @@
  * itself locked.  These unlocked entries need verification under the tree
  * lock.
  */
-static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
+static bool __must_check __clear_shadow_entry(struct address_space *mapping,
+					      pgoff_t index, void *entry)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
-		return;
+		return 0;
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+
+	return mapping_empty(mapping);
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	bool empty;
+
 	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
+	empty = __clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
 }
 
 /*
@@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 				pgoff_t end)
 {
 	int i, j;
-	bool dax, lock;
+	bool dax, lock, empty = false;
 
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
@@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 			continue;
 		}
 
-		__clear_shadow_entry(mapping, index, page);
+		if (__clear_shadow_entry(mapping, index, page))
+			empty = true;
 	}
 
 	if (lock)
 		xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	pvec->nr = j;
 }
 
@@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	pgoff_t		index;
 	int		i;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	/* Offsets within partial pages */
@@ -636,6 +648,7 @@ static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
 	unsigned long flags;
+	bool empty;
 
 	if (page->mapping != mapping)
 		return 0;
@@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
 
@@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	pagevec_init(&pvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..c82e9831003f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
+		int empty;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		if (reclaimed && page_is_file_cache(page) &&
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(page, target_memcg);
-		__delete_from_page_cache(page, shadow);
+		empty = __delete_from_page_cache(page, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+		if (empty)
+			inode_pages_clear(mapping->host);
+
 		if (freepage != NULL)
 			freepage(page);
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..ae802253f71c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1203,9 +1203,11 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
 #endif
-	"pginodesteal",
 	"slabs_scanned",
+	"pginodesteal",
 	"kswapd_inodesteal",
+	"pginoderescue",
+	"pginodedelayed",
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..7ce9c74ebfdb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	struct xa_node *node = container_of(item, struct xa_node, private_list);
 	XA_STATE(xas, node->array, 0);
 	struct address_space *mapping;
+	bool empty = false;
 	int ret;
 
 	/*
@@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->count != node->nr_values))
 		goto out_invalid;
 	mapping->nrexceptional -= node->nr_values;
+	empty = mapping_empty(mapping);
 	xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
 	xas.xa_offset = node->offset;
 	xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
@@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
-- 
2.24.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
@ 2020-02-11 18:20 ` Johannes Weiner
  2020-02-11 19:05 ` Rik van Riel
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-02-11 18:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel
  Cc: Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Andrew Morton, Linus Torvalds, Al Viro, kernel-team

On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote:
> However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
> don't reclaim inodes with many attached pages"") because it caused
> severe reclaim performance problems: Inodes that sit on the shrinker
> LRU are attracting reclaim pressure away from the page cache and
> toward the VFS. If we then permanently exempt sizable portions of this
> pool from actually getting reclaimed when looked at, this pressure
> accumulates as deferred shrinker work (a mechanism for *temporarily*
> unreclaimable objects) until it causes mayhem in the VFS cache pools.
> 
> In the bug quoted in 69056ee6a8a3 in particular, the excessive
> pressure drove the XFS shrinker into dirty objects, where it caused
> synchronous, IO-bound stalls, even as there was plenty of clean page
> cache that should have been reclaimed instead.

A note on testing: the patch behaves much better on my machine and the
inode shrinker doesn't drop hot page cache anymore, without noticable
downsides so far.

However, I tried to reproduce the xfs case that caused the
69056ee6a8a3 revert and haven't managed to do so yet on 5.5 plus the
reverted patch. I cannot provoke higher inode sync stalls in the xfs
shrinker regardless of shrinker strategy. Maybe something else changed
since 4.19 and it's less of a concern now.

Nonetheless, I'm interested in opinions on the premise of this
patch. And Yafang is working on his memcg-specific fix for this issue,
so I wanted to put this proposal on the table sooner than later.

Thanks

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
  2020-02-11 18:20 ` Johannes Weiner
@ 2020-02-11 19:05 ` Rik van Riel
  2020-02-11 19:31   ` Johannes Weiner
  2020-02-12 12:25 ` Yafang Shao
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2020-02-11 19:05 UTC (permalink / raw)
  To: Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel
  Cc: Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Andrew Morton, Linus Torvalds, Al Viro, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1579 bytes --]

On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> The VFS inode shrinker is currently allowed to reclaim inodes with
> populated page cache. As a result it can drop gigabytes of hot and
> active page cache on the floor without consulting the VM (recorded as
> "inodesteal" events in /proc/vmstat).
> 
> This causes real problems in practice. Consider for example how the
> VM
> would cache a source tree, such as the Linux git tree. As large parts
> of the checked out files and the object database are accessed
> repeatedly, the page cache holding this data gets moved to the active
> list, where it's fully (and indefinitely) insulated from one-off
> cache
> moving through the inactive list.

> This behavior of invalidating page cache from the inode shrinker goes
> back to even before the git import of the kernel tree. It may have
> been less noticeable when the VM itself didn't have real workingset
> protection, and floods of one-off cache would push out any active
> cache over time anyway. But the VM has come a long way since then and
> the inode shrinker is now actively subverting its caching strategy.

Two things come to mind when looking at this:
- highmem
- NUMA

IIRC one of the reasons reclaim is done in this way is
because a page cache page in one area of memory (highmem,
or a NUMA node) can end up pinning inode slab memory in
another memory area (normal zone, other NUMA node).

I do not know how much of a concern that still is nowadays,
but it seemed something worth bringing up.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 19:05 ` Rik van Riel
@ 2020-02-11 19:31   ` Johannes Weiner
  2020-02-11 23:44     ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-02-11 19:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-fsdevel, linux-mm, linux-kernel, Dave Chinner, Yafang Shao,
	Michal Hocko, Roman Gushchin, Andrew Morton, Linus Torvalds,
	Al Viro, kernel-team

On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote:
> On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> > The VFS inode shrinker is currently allowed to reclaim inodes with
> > populated page cache. As a result it can drop gigabytes of hot and
> > active page cache on the floor without consulting the VM (recorded as
> > "inodesteal" events in /proc/vmstat).
> > 
> > This causes real problems in practice. Consider for example how the
> > VM
> > would cache a source tree, such as the Linux git tree. As large parts
> > of the checked out files and the object database are accessed
> > repeatedly, the page cache holding this data gets moved to the active
> > list, where it's fully (and indefinitely) insulated from one-off
> > cache
> > moving through the inactive list.
> 
> > This behavior of invalidating page cache from the inode shrinker goes
> > back to even before the git import of the kernel tree. It may have
> > been less noticeable when the VM itself didn't have real workingset
> > protection, and floods of one-off cache would push out any active
> > cache over time anyway. But the VM has come a long way since then and
> > the inode shrinker is now actively subverting its caching strategy.
> 
> Two things come to mind when looking at this:
> - highmem
> - NUMA
> 
> IIRC one of the reasons reclaim is done in this way is
> because a page cache page in one area of memory (highmem,
> or a NUMA node) can end up pinning inode slab memory in
> another memory area (normal zone, other NUMA node).

That's a good point, highmem does ring a bell now that you mention it.

If we still care, I think this could be solved by doing something
similar to what we do with buffer_heads_over_limit: allow a lowmem
allocation to reclaim page cache inside the highmem zone if the bhs
(or inodes in this case) have accumulated excessively.

AFAICS, we haven't done anything similar for NUMA, so it might not be
much of a problem there. I could imagine this is in part because NUMA
nodes tend to be more balanced in size, and the ratio between cache
memory and inode/bh memory means that these objects won't turn into a
significant externality. Whereas with extreme highmem:lowmem ratios,
they can.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 19:31   ` Johannes Weiner
@ 2020-02-11 23:44     ` Andrew Morton
  2020-02-12  0:28       ` Linus Torvalds
  2020-02-12 16:35       ` Johannes Weiner
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2020-02-11 23:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, linux-fsdevel, linux-mm, linux-kernel,
	Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Linus Torvalds, Al Viro, kernel-team

On Tue, 11 Feb 2020 14:31:01 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote:
> > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> > > The VFS inode shrinker is currently allowed to reclaim inodes with
> > > populated page cache. As a result it can drop gigabytes of hot and
> > > active page cache on the floor without consulting the VM (recorded as
> > > "inodesteal" events in /proc/vmstat).
> > > 
> > > This causes real problems in practice. Consider for example how the
> > > VM
> > > would cache a source tree, such as the Linux git tree. As large parts
> > > of the checked out files and the object database are accessed
> > > repeatedly, the page cache holding this data gets moved to the active
> > > list, where it's fully (and indefinitely) insulated from one-off
> > > cache
> > > moving through the inactive list.
> > 
> > > This behavior of invalidating page cache from the inode shrinker goes
> > > back to even before the git import of the kernel tree. It may have
> > > been less noticeable when the VM itself didn't have real workingset
> > > protection, and floods of one-off cache would push out any active
> > > cache over time anyway. But the VM has come a long way since then and
> > > the inode shrinker is now actively subverting its caching strategy.
> > 
> > Two things come to mind when looking at this:
> > - highmem
> > - NUMA
> > 
> > IIRC one of the reasons reclaim is done in this way is
> > because a page cache page in one area of memory (highmem,
> > or a NUMA node) can end up pinning inode slab memory in
> > another memory area (normal zone, other NUMA node).
> 
> That's a good point, highmem does ring a bell now that you mention it.

Yup, that's why this mechanism exists.  Here:

https://marc.info/?l=git-commits-head&m=103646757213266&w=2

> If we still care, I think this could be solved by doing something
> similar to what we do with buffer_heads_over_limit: allow a lowmem
> allocation to reclaim page cache inside the highmem zone if the bhs
> (or inodes in this case) have accumulated excessively.

Well, reclaiming highmem pagecache at random would be a painful way to
reclaim lowmem inodes.  Better to pick an inode then shoot down all its
pagecache.  Perhaps we could take its pagecache's aging into account.

Testing this will be a challenge, but the issue was real - a 7GB
highmem machine isn't crazy and I expect the inode has become larger
since those days.

> AFAICS, we haven't done anything similar for NUMA, so it might not be
> much of a problem there. I could imagine this is in part because NUMA
> nodes tend to be more balanced in size, and the ratio between cache
> memory and inode/bh memory means that these objects won't turn into a
> significant externality. Whereas with extreme highmem:lowmem ratios,
> they can.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 23:44     ` Andrew Morton
@ 2020-02-12  0:28       ` Linus Torvalds
  2020-02-12  0:47         ` Andrew Morton
                           ` (3 more replies)
  2020-02-12 16:35       ` Johannes Weiner
  1 sibling, 4 replies; 35+ messages in thread
From: Linus Torvalds @ 2020-02-12  0:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List, Dave Chinner, Yafang Shao,
	Michal Hocko, Roman Gushchin, Al Viro, kernel-team

On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Testing this will be a challenge, but the issue was real - a 7GB
> highmem machine isn't crazy and I expect the inode has become larger
> since those days.

Hmm. I would say that in the intening years a 7GB highmem machine has
indeed become crazy.

It used to be something we kind of supported.

But we really should consider HIGHMEM to be something that is on the
deprecation list. In this day and age, there is no excuse for running
a 32-bit kernel with lots of physical memory.

And if you really want to do that, and have some legacy hardware with
a legacy use case, maybe you should be using a legacy kernel.

I'd personally be perfectly happy to start removing HIGHMEM support again.

                   Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  0:28       ` Linus Torvalds
@ 2020-02-12  0:47         ` Andrew Morton
  2020-02-12  1:03           ` Linus Torvalds
  2020-02-12  3:58         ` Matthew Wilcox
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2020-02-12  0:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Rik van Riel, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List, Dave Chinner, Yafang Shao,
	Michal Hocko, Roman Gushchin, Al Viro, kernel-team

On Tue, 11 Feb 2020 16:28:39 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.
> 
> It used to be something we kind of supported.
> 
> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.
> 

That would be nice.

What's the situation with highmem on ARM?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  0:47         ` Andrew Morton
@ 2020-02-12  1:03           ` Linus Torvalds
  2020-02-12  8:50             ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2020-02-12  1:03 UTC (permalink / raw)
  To: Andrew Morton, Russell King, Linux ARM, Catalin Marinas
  Cc: Johannes Weiner, Rik van Riel, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List, Dave Chinner, Yafang Shao,
	Michal Hocko, Roman Gushchin, Al Viro, kernel-team

On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> What's the situation with highmem on ARM?

Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
needed it, and I was ranting at some people for repeating all the
mistakes Intel did.

But arm64 doesn't need it, and while 32-bit arm is obviosuly still
selling, I think that in many ways the switch-over to 64-bit has been
quicker on ARM than it was on x86. Partly because it happened later
(so all the 64-bit teething pains were dealt with), but largely
because everybody ended up actively discouraging 32-bit on the Android
side.

There were a couple of unfortunate early 32-bit arm server attempts,
but they were - predictably - complete garbage and nobody bought them.
They don't exist any more.

So at least my gut feel is that the arm people don't have any big
reason to push for maintaining HIGHMEM support either.

But I'm adding a couple of arm people and the arm list just in case
they have some input.

[ Obvious background for newly added people: we're talking about
making CONFIG_HIGHMEM a deprecated feature and saying that if you want
to run with lots of memory on a 32-bit kernel, you're doing legacy
stuff and can use a legacy kernel ]

              Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  0:28       ` Linus Torvalds
  2020-02-12  0:47         ` Andrew Morton
@ 2020-02-12  3:58         ` Matthew Wilcox
  2020-02-12  8:09         ` Michal Hocko
  2020-02-17 13:31         ` Pavel Machek
  3 siblings, 0 replies; 35+ messages in thread
From: Matthew Wilcox @ 2020-02-12  3:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, linux-fsdevel,
	Linux-MM, Linux Kernel Mailing List, Dave Chinner, Yafang Shao,
	Michal Hocko, Roman Gushchin, Al Viro, kernel-team

On Tue, Feb 11, 2020 at 04:28:39PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.
> 
> It used to be something we kind of supported.
> 
> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.

Do we have a use case where people want to run modern 32-bit guest kernels
with more than 896MB of memory to support some horrendous legacy app?
Or is our 32-bit compat story good enough that nobody actually does this?
(Contrariwise, maybe those people are also fine with running a ten year
old kernel because a `uname -r` starting with '3' breaks said app)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  0:28       ` Linus Torvalds
  2020-02-12  0:47         ` Andrew Morton
  2020-02-12  3:58         ` Matthew Wilcox
@ 2020-02-12  8:09         ` Michal Hocko
  2020-02-17 13:31         ` Pavel Machek
  3 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2020-02-12  8:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, linux-fsdevel,
	Linux-MM, Linux Kernel Mailing List, Dave Chinner, Yafang Shao,
	Roman Gushchin, Al Viro, kernel-team

On Tue 11-02-20 16:28:39, Linus Torvalds wrote:
> On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.

Absolutely agreed.

> It used to be something we kind of supported.

And it's been few years since we have been actively discouraging people
from using 32b kernels with a lot of memory. There are bug reports
popping out from time to time but I do not remember any case where using
64b kernel would be a no-go. So my strong suspicion is that people
simply keep their kernels on 32b without a good reason because it tends
to work most of the time until they hit one of the lowmem problems and
they move over to 64b.

> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.

I wouldn't be opposed at all.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  1:03           ` Linus Torvalds
@ 2020-02-12  8:50             ` Russell King - ARM Linux admin
  2020-02-13  9:50               ` Lucas Stach
  2020-02-13 16:52               ` Arnd Bergmann
  0 siblings, 2 replies; 35+ messages in thread
From: Russell King - ARM Linux admin @ 2020-02-12  8:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux ARM, Catalin Marinas, Johannes Weiner,
	Rik van Riel, linux-fsdevel, Linux-MM, Linux Kernel Mailing List,
	Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin, Al Viro,
	kernel-team

On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > What's the situation with highmem on ARM?
> 
> Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
> needed it, and I was ranting at some people for repeating all the
> mistakes Intel did.
> 
> But arm64 doesn't need it, and while 32-bit arm is obviosuly still
> selling, I think that in many ways the switch-over to 64-bit has been
> quicker on ARM than it was on x86. Partly because it happened later
> (so all the 64-bit teething pains were dealt with), but largely
> because everybody ended up actively discouraging 32-bit on the Android
> side.
> 
> There were a couple of unfortunate early 32-bit arm server attempts,
> but they were - predictably - complete garbage and nobody bought them.
> They don't exist any more.
> 
> So at least my gut feel is that the arm people don't have any big
> reason to push for maintaining HIGHMEM support either.
> 
> But I'm adding a couple of arm people and the arm list just in case
> they have some input.
> 
> [ Obvious background for newly added people: we're talking about
> making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> to run with lots of memory on a 32-bit kernel, you're doing legacy
> stuff and can use a legacy kernel ]

Well, the recent 32-bit ARM systems generally have more than 1G
of memory, so make use of highmem as a rule.  You're probably
talking about crippling support for any 32-bit ARM system produced
in the last 8 to 10 years.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
  2020-02-11 18:20 ` Johannes Weiner
  2020-02-11 19:05 ` Rik van Riel
@ 2020-02-12 12:25 ` Yafang Shao
  2020-02-12 16:42   ` Johannes Weiner
  2020-02-13 18:34 ` [PATCH v2] " Johannes Weiner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Yafang Shao @ 2020-02-12 12:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-fsdevel, Linux MM, LKML, Dave Chinner, Michal Hocko,
	Roman Gushchin, Andrew Morton, Linus Torvalds, Al Viro,
	Kernel Team

On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The VFS inode shrinker is currently allowed to reclaim inodes with
> populated page cache. As a result it can drop gigabytes of hot and
> active page cache on the floor without consulting the VM (recorded as
> "inodesteal" events in /proc/vmstat).
>
> This causes real problems in practice. Consider for example how the VM
> would cache a source tree, such as the Linux git tree. As large parts
> of the checked out files and the object database are accessed
> repeatedly, the page cache holding this data gets moved to the active
> list, where it's fully (and indefinitely) insulated from one-off cache
> moving through the inactive list.
>
> However, due to the way users interact with the tree, no ongoing open
> file descriptors into the source tree are maintained, and the inodes
> end up on the "unused inode" shrinker LRU. A larger burst of one-off
> cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop
> first the dentries and then the inodes - inodes that contain the most
> valuable data currently held by the page cache - while there is plenty
> of one-off cache that could be reclaimed instead.
>
> This doesn't make sense. The inodes aren't really "unused" as long as
> the VM deems it worthwhile to hold on to their page cache. And the
> shrinker can't possibly guess what is and isn't valuable to the VM
> based on recent inode reference information alone (we could delete
> several thousand lines of reclaim code if it could).
>
> History
>
> This behavior of invalidating page cache from the inode shrinker goes
> back to even before the git import of the kernel tree. It may have
> been less noticeable when the VM itself didn't have real workingset
> protection, and floods of one-off cache would push out any active
> cache over time anyway. But the VM has come a long way since then and
> the inode shrinker is now actively subverting its caching strategy.
>
> As this keeps causing problems for people, there have been several
> attempts to address this.
>
> One recent attempt was to make the inode shrinker simply skip over
> inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
> inodes with many attached pages").
>
> However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
> don't reclaim inodes with many attached pages"") because it caused
> severe reclaim performance problems: Inodes that sit on the shrinker
> LRU are attracting reclaim pressure away from the page cache and
> toward the VFS. If we then permanently exempt sizable portions of this
> pool from actually getting reclaimed when looked at, this pressure
> accumulates as deferred shrinker work (a mechanism for *temporarily*
> unreclaimable objects) until it causes mayhem in the VFS cache pools.
>
> In the bug quoted in 69056ee6a8a3 in particular, the excessive
> pressure drove the XFS shrinker into dirty objects, where it caused
> synchronous, IO-bound stalls, even as there was plenty of clean page
> cache that should have been reclaimed instead.
>
> Another variant of this problem was recently observed, where the
> kernel violates cgroups' memory.low protection settings and reclaims
> page cache way beyond the configured thresholds. It was followed by a
> proposal of a modified form of the reverted commit above, that
> implements memory.low-sensitive shrinker skipping over populated
> inodes on the LRU [1]. However, this proposal continues to run the
> risk of attracting disproportionate reclaim pressure to a pool of
> still-used inodes,

Hi Johannes,

If you really think that is a risk, what about bellow additional patch
to fix this risk ?

diff --git a/fs/inode.c b/fs/inode.c
index 80dddbc..61862d9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
                goto out;

        cgroup_size = mem_cgroup_size(memcg);
-       if (inode->i_data.nrpages + protection >= cgroup_size)
+       if (inode->i_data.nrpages)
                reclaimable = false;

 out:

With this additional patch, we skip all inodes in this memcg until all
its page cache pages are reclaimed.

> while not addressing the more generic reclaim
> inversion problem outside of a very specific cgroup application.
>

But I have a different understanding.
This method works like a knob. If you really care about your
workingset (data), you should turn it on (i.e. by using memcg
protection to protect them), while if you don't care about your
workingset (data) then you'd better turn it off. That would be more
flexible.
Regaring your case in the commit log, why not protect your linux git
tree with memcg protection ?

> [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/
>
> Solution
>
> To fix the reclaim inversion described in the beginning, without
> reintroducing the problems associated with shrinker LRU rotations,
> this patch keeps populated inodes off the LRUs entirely.
>
> Currently, inodes are kept off the shrinker LRU as long as they have
> an elevated i_count, indicating an active user. Unfortunately, the
> page cache cannot simply hold an i_count reference, because unlink()
> *should* result in the inode being dropped and its cache invalidated.
>
> Instead, this patch makes iput_final() consult the state of the page
> cache and punt the LRU linking to the VM if the inode is still
> populated; the VM in turn checks the inode state when it depopulates
> the page cache, and adds the inode to the LRU if necessary.
>
> This is not unlike what we do for dirty inodes, which are moved off
> the LRU permanently until writeback completion puts them back on (iff
> still unused). We can reuse the same code -- inode_add_lru() - here.
>
> This is also not unlike page reclaim, where the lower VM layer has to
> negotiate state with the higher VFS layer. Follow existing precedence
> and handle the inversion as much as possible on the VM side:
>
> - introduce an I_PAGES flag that the VM maintains under the i_lock, so
>   that any inode code holding that lock can check the page cache state
>   without having to lock and inspect the struct address_space
>
> - introduce inode_pages_set() and inode_pages_clear() to maintain the
>   inode LRU state from the VM side, then update all cache mutators to
>   use them when populating the first cache entry or clearing the last
>
> With this, the concept of "inodesteal" - where the inode shrinker
> drops page cache - is a thing of the past. The VM is in charge of the
> page cache, the inode shrinker is in charge of freeing struct inode.
>
> Footnotes
>
> - For debuggability, add vmstat counters that track the number of
>   times a new cache entry pulls a previously unused inode off the LRU
>   (pginoderescue), as well as how many times existing cache deferred
>   an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
>   for backwards compatibility, but they'll just show 0 now.
>
> - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
>   cache. Not doing so has always been a bit strange, but since most
>   people drop cache and metadata cache together, the inode shrinker
>   would have taken care of them before - no more, so do it VM-side.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/block_dev.c                |   2 +-
>  fs/dax.c                      |  14 +++++
>  fs/drop_caches.c              |   2 +-
>  fs/inode.c                    | 106 ++++++++++++++++++++++++++--------
>  fs/internal.h                 |   2 +-
>  include/linux/fs.h            |  12 ++++
>  include/linux/pagemap.h       |   2 +-
>  include/linux/vm_event_item.h |   3 +-
>  mm/filemap.c                  |  39 ++++++++++---
>  mm/huge_memory.c              |   3 +-
>  mm/truncate.c                 |  34 ++++++++---
>  mm/vmscan.c                   |   6 +-
>  mm/vmstat.c                   |   4 +-
>  mm/workingset.c               |   4 ++
>  14 files changed, 183 insertions(+), 50 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 69bf2fb6f7cd..46f67147ad20 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev)
>  {
>         struct address_space *mapping = bdev->bd_inode->i_mapping;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 return;
>
>         invalidate_bh_lrus();
> diff --git a/fs/dax.c b/fs/dax.c
> index 35da144375a0..7f3ce4612272 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  {
>         unsigned long index = xas->xa_index;
>         bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
> +       int populated;
>         void *entry;
>
>  retry:
> +       populated = 0;
>         xas_lock_irq(xas);
>         entry = get_unlocked_entry(xas, order);
>
> @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
>                 xas_store(xas, NULL);   /* undo the PMD join */
>                 dax_wake_entry(xas, entry, true);
>                 mapping->nrexceptional--;
> +               if (mapping_empty(mapping))
> +                       populated = -1;
>                 entry = NULL;
>                 xas_set(xas, index);
>         }
> @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas,
>                 dax_lock_entry(xas, entry);
>                 if (xas_error(xas))
>                         goto out_unlock;
> +               if (mapping_empty(mapping))
> +                       populated++;
>                 mapping->nrexceptional++;
>         }
>
>  out_unlock:
>         xas_unlock_irq(xas);
> +       if (populated == -1)
> +               inode_pages_clear(mapping->inode);
> +       else if (populated == 1)
> +               inode_pages_set(mapping->inode);
>         if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
>                 goto retry;
>         if (xas->xa_node == XA_ERROR(-ENOMEM))
> @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>                                           pgoff_t index, bool trunc)
>  {
>         XA_STATE(xas, &mapping->i_pages, index);
> +       bool empty = false;
>         int ret = 0;
>         void *entry;
>
> @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>         dax_disassociate_entry(entry, mapping, trunc);
>         xas_store(&xas, NULL);
>         mapping->nrexceptional--;
> +       empty = mapping_empty(mapping);
>         ret = 1;
>  out:
>         put_unlocked_entry(&xas, entry);
>         xas_unlock_irq(&xas);
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>         return ret;
>  }
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index dc1a1d5d825b..a5e9e9053474 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>                  * we need to reschedule to avoid softlockups.
>                  */
>                 if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -                   (inode->i_mapping->nrpages == 0 && !need_resched())) {
> +                   (mapping_empty(inode->i_mapping) && !need_resched())) {
>                         spin_unlock(&inode->i_lock);
>                         continue;
>                 }
> diff --git a/fs/inode.c b/fs/inode.c
> index 7d57068b6b7a..575b780fa9bb 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -430,25 +430,87 @@ static void inode_lru_list_add(struct inode *inode)
>                 inode->i_state |= I_REFERENCED;
>  }
>
> +static void inode_lru_list_del(struct inode *inode)
> +{
> +       if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> +               this_cpu_dec(nr_unused);
> +}
> +
>  /*
>   * Add inode to LRU if needed (inode is unused and clean).
>   *
>   * Needs inode->i_lock held.
>   */
> -void inode_add_lru(struct inode *inode)
> +bool inode_add_lru(struct inode *inode)
>  {
> -       if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
> -                               I_FREEING | I_WILL_FREE)) &&
> -           !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
> -               inode_lru_list_add(inode);
> +       if (inode->i_state &
> +           (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
> +               return false;
> +       if (atomic_read(&inode->i_count))
> +               return false;
> +       if (!(inode->i_sb->s_flags & SB_ACTIVE))
> +               return false;
> +       inode_lru_list_add(inode);
> +       return true;
>  }
>
> +/**
> + * inode_pages_set - mark the inode as holding page cache
> + * @inode: the inode whose first cache page was just added
> + *
> + * Tell the VFS that this inode has populated page cache and must not
> + * be reclaimed by the inode shrinker.
> + *
> + * The caller must hold the page lock of the just-added page: by
> + * pinning the page, the page cache cannot become depopulated, and we
> + * can safely set I_PAGES without a race check under the i_pages lock.
> + *
> + * This function acquires the i_lock.
> + */
> +void inode_pages_set(struct inode *inode)
> +{
> +       spin_lock(&inode->i_lock);
> +       if (!(inode->i_state & I_PAGES)) {
> +               inode->i_state |= I_PAGES;
> +               if (!list_empty(&inode->i_lru)) {
> +                       count_vm_event(PGINODERESCUE);
> +                       inode_lru_list_del(inode);
> +               }
> +       }
> +       spin_unlock(&inode->i_lock);
> +}
>
> -static void inode_lru_list_del(struct inode *inode)
> +/**
> + * inode_pages_clear - mark the inode as not holding page cache
> + * @inode: the inode whose last cache page was just removed
> + *
> + * Tell the VFS that the inode no longer holds page cache and that its
> + * lifetime is to be handed over to the inode shrinker LRU.
> + *
> + * This function acquires the i_lock and the i_pages lock.
> + */
> +void inode_pages_clear(struct inode *inode)
>  {
> +       struct address_space *mapping = &inode->i_data;
> +       bool add_to_lru = false;
> +       unsigned long flags;
>
> -       if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> -               this_cpu_dec(nr_unused);
> +       spin_lock(&inode->i_lock);
> +
> +       xa_lock_irqsave(&mapping->i_pages, flags);
> +       if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
> +               inode->i_state &= ~I_PAGES;
> +               add_to_lru = true;
> +       }
> +       xa_unlock_irqrestore(&mapping->i_pages, flags);
> +
> +       if (add_to_lru) {
> +               WARN_ON_ONCE(!list_empty(&inode->i_lru));
> +               if (inode_add_lru(inode))
> +                       __count_vm_event(PGINODEDELAYED);
> +       }
> +
> +       spin_unlock(&inode->i_lock);
>  }
>
>  /**
> @@ -742,6 +804,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>         if (!spin_trylock(&inode->i_lock))
>                 return LRU_SKIP;
>
> +       WARN_ON_ONCE(inode->i_state & I_PAGES);
> +
>         /*
>          * Referenced or dirty inodes are still in use. Give them another pass
>          * through the LRU as we canot reclaim them now.
> @@ -761,23 +825,17 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>                 return LRU_ROTATE;
>         }
>
> -       if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> -               __iget(inode);
> +       /*
> +        * Populated inodes shouldn't be on the shrinker LRU, but they
> +        * can be briefly visible when a new page is added to an inode
> +        * that was already linked but inode_pages_set() hasn't run
> +        * yet to move them off.
> +        */
> +       if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> +               list_lru_isolate(lru, &inode->i_lru);
>                 spin_unlock(&inode->i_lock);
> -               spin_unlock(lru_lock);
> -               if (remove_inode_buffers(inode)) {
> -                       unsigned long reap;
> -                       reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
> -                       if (current_is_kswapd())
> -                               __count_vm_events(KSWAPD_INODESTEAL, reap);
> -                       else
> -                               __count_vm_events(PGINODESTEAL, reap);
> -                       if (current->reclaim_state)
> -                               current->reclaim_state->reclaimed_slab += reap;
> -               }
> -               iput(inode);
> -               spin_lock(lru_lock);
> -               return LRU_RETRY;
> +               this_cpu_dec(nr_unused);
> +               return LRU_REMOVED;
>         }
>
>         WARN_ON(inode->i_state & I_NEW);
> diff --git a/fs/internal.h b/fs/internal.h
> index f3f280b952a3..4a9dc77e817b 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -139,7 +139,7 @@ extern int vfs_open(const struct path *, struct file *);
>   * inode.c
>   */
>  extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
> -extern void inode_add_lru(struct inode *inode);
> +extern bool inode_add_lru(struct inode *inode);
>  extern int dentry_needs_remove_privs(struct dentry *dentry);
>
>  /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..a98d9dee39f4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -585,6 +585,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
>         atomic_inc(&mapping->i_mmap_writable);
>  }
>
> +static inline bool mapping_empty(struct address_space *mapping)
> +{
> +       return mapping->nrpages + mapping->nrexceptional == 0;
> +}
> +
>  /*
>   * Use sequence counter to get consistent i_size on 32-bit processors.
>   */
> @@ -2150,6 +2155,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>   *
>   * I_CREATING          New object's inode in the middle of setting up.
>   *
> + * I_PAGES             Inode is holding page cache that needs to get reclaimed
> + *                     first before the inode can go onto the shrinker LRU.
> + *
>   * Q: What is the difference between I_WILL_FREE and I_FREEING?
>   */
>  #define I_DIRTY_SYNC           (1 << 0)
> @@ -2172,6 +2180,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>  #define I_WB_SWITCH            (1 << 13)
>  #define I_OVL_INUSE            (1 << 14)
>  #define I_CREATING             (1 << 15)
> +#define I_PAGES                        (1 << 16)
>
>  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> @@ -3097,6 +3106,9 @@ static inline void remove_inode_hash(struct inode *inode)
>                 __remove_inode_hash(inode);
>  }
>
> +extern void inode_pages_set(struct inode *inode);
> +extern void inode_pages_clear(struct inode *inode);
> +
>  extern void inode_sb_list_add(struct inode *inode);
>
>  #ifdef CONFIG_BLOCK
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..ae4d90bd0873 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -609,7 +609,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>                                 pgoff_t index, gfp_t gfp_mask);
>  extern void delete_from_page_cache(struct page *page);
> -extern void __delete_from_page_cache(struct page *page, void *shadow);
> +extern bool __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
>  void delete_from_page_cache_batch(struct address_space *mapping,
>                                   struct pagevec *pvec);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 47a3441cf4c4..f31026ccf590 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
>  #endif
> -               PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
> +               SLABS_SCANNED,
> +               PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
>                 KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>                 PAGEOUTRUN, PGROTATED,
>                 DROP_PAGECACHE, DROP_SLAB,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1784478270e1..fcc24b3b3540 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -116,8 +116,8 @@
>   *   ->tasklist_lock            (memory_failure, collect_procs_ao)
>   */
>
> -static void page_cache_delete(struct address_space *mapping,
> -                                  struct page *page, void *shadow)
> +static bool __must_check page_cache_delete(struct address_space *mapping,
> +                                          struct page *page, void *shadow)
>  {
>         XA_STATE(xas, &mapping->i_pages, page->index);
>         unsigned int nr = 1;
> @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping,
>                 smp_wmb();
>         }
>         mapping->nrpages -= nr;
> +
> +       return mapping_empty(mapping);
>  }
>
>  static void unaccount_page_cache_page(struct address_space *mapping,
> @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping,
>   * Delete a page from the page cache and free it. Caller has to make
>   * sure the page is locked and that nobody else uses it - or that usage
>   * is safe.  The caller must hold the i_pages lock.
> + *
> + * If this returns true, the caller must call inode_pages_clear()
> + * after dropping the i_pages lock.
>   */
> -void __delete_from_page_cache(struct page *page, void *shadow)
> +bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
>  {
>         struct address_space *mapping = page->mapping;
>
>         trace_mm_filemap_delete_from_page_cache(page);
>
>         unaccount_page_cache_page(mapping, page);
> -       page_cache_delete(mapping, page, shadow);
> +       return page_cache_delete(mapping, page, shadow);
>  }
>
>  static void page_cache_free_page(struct address_space *mapping,
> @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         unsigned long flags;
> +       bool empty;
>
>         BUG_ON(!PageLocked(page));
>         xa_lock_irqsave(&mapping->i_pages, flags);
> -       __delete_from_page_cache(page, NULL);
> +       empty = __delete_from_page_cache(page, NULL);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         page_cache_free_page(mapping, page);
>  }
>  EXPORT_SYMBOL(delete_from_page_cache);
> @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
>   *
>   * The function expects the i_pages lock to be held.
>   */
> -static void page_cache_delete_batch(struct address_space *mapping,
> -                            struct pagevec *pvec)
> +static bool __must_check page_cache_delete_batch(struct address_space *mapping,
> +                                                struct pagevec *pvec)
>  {
>         XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
>         int total_pages = 0;
> @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
>                 total_pages++;
>         }
>         mapping->nrpages -= total_pages;
> +
> +       return mapping_empty(mapping);
>  }
>
>  void delete_from_page_cache_batch(struct address_space *mapping,
>                                   struct pagevec *pvec)
>  {
>         int i;
> +       bool empty;
>         unsigned long flags;
>
>         if (!pagevec_count(pvec))
> @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping,
>
>                 unaccount_page_cache_page(mapping, pvec->pages[i]);
>         }
> -       page_cache_delete_batch(mapping, pvec);
> +       empty = page_cache_delete_batch(mapping, pvec);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         for (i = 0; i < pagevec_count(pvec); i++)
>                 page_cache_free_page(mapping, pvec->pages[i]);
>  }
> @@ -831,9 +846,10 @@ static int __add_to_page_cache_locked(struct page *page,
>                                       void **shadowp)
>  {
>         XA_STATE(xas, &mapping->i_pages, offset);
> +       int error;
>         int huge = PageHuge(page);
>         struct mem_cgroup *memcg;
> -       int error;
> +       bool populated = false;
>         void *old;
>
>         VM_BUG_ON_PAGE(!PageLocked(page), page);
> @@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 if (xas_error(&xas))
>                         goto unlock;
>
> +               populated = mapping_empty(mapping);
>                 if (xa_is_value(old)) {
>                         mapping->nrexceptional--;
>                         if (shadowp)
> @@ -880,6 +897,10 @@ static int __add_to_page_cache_locked(struct page *page,
>         if (!huge)
>                 mem_cgroup_commit_charge(page, memcg, false, false);
>         trace_mm_filemap_add_to_page_cache(page);
> +
> +       if (populated)
> +               inode_pages_set(mapping->host);
> +
>         return 0;
>  error:
>         page->mapping = NULL;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b08b199f9a11..8b3e33a52d93 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2535,7 +2535,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 /* Some pages can be beyond i_size: drop them from page cache */
>                 if (head[i].index >= end) {
>                         ClearPageDirty(head + i);
> -                       __delete_from_page_cache(head + i, NULL);
> +                       /* We know we're not removing the last page */
> +                       (void)__delete_from_page_cache(head + i, NULL);
>                         if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
>                                 shmem_uncharge(head->mapping->host, 1);
>                         put_page(head + i);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index dd9ebc1da356..8fb6c2f762bc 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -31,24 +31,31 @@
>   * itself locked.  These unlocked entries need verification under the tree
>   * lock.
>   */
> -static inline void __clear_shadow_entry(struct address_space *mapping,
> -                               pgoff_t index, void *entry)
> +static bool __must_check __clear_shadow_entry(struct address_space *mapping,
> +                                             pgoff_t index, void *entry)
>  {
>         XA_STATE(xas, &mapping->i_pages, index);
>
>         xas_set_update(&xas, workingset_update_node);
>         if (xas_load(&xas) != entry)
> -               return;
> +               return 0;
>         xas_store(&xas, NULL);
>         mapping->nrexceptional--;
> +
> +       return mapping_empty(mapping);
>  }
>
>  static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
>                                void *entry)
>  {
> +       bool empty;
> +
>         xa_lock_irq(&mapping->i_pages);
> -       __clear_shadow_entry(mapping, index, entry);
> +       empty = __clear_shadow_entry(mapping, index, entry);
>         xa_unlock_irq(&mapping->i_pages);
> +
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>  }
>
>  /*
> @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>                                 pgoff_t end)
>  {
>         int i, j;
> -       bool dax, lock;
> +       bool dax, lock, empty = false;
>
>         /* Handled by shmem itself */
>         if (shmem_mapping(mapping))
> @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>                         continue;
>                 }
>
> -               __clear_shadow_entry(mapping, index, page);
> +               if (__clear_shadow_entry(mapping, index, page))
> +                       empty = true;
>         }
>
>         if (lock)
>                 xa_unlock_irq(&mapping->i_pages);
> +
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         pvec->nr = j;
>  }
>
> @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>         pgoff_t         index;
>         int             i;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 goto out;
>
>         /* Offsets within partial pages */
> @@ -636,6 +648,7 @@ static int
>  invalidate_complete_page2(struct address_space *mapping, struct page *page)
>  {
>         unsigned long flags;
> +       bool empty;
>
>         if (page->mapping != mapping)
>                 return 0;
> @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
>                 goto failed;
>
>         BUG_ON(page_has_private(page));
> -       __delete_from_page_cache(page, NULL);
> +       empty = __delete_from_page_cache(page, NULL);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         if (mapping->a_ops->freepage)
>                 mapping->a_ops->freepage(page);
>
> @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>         int ret2 = 0;
>         int did_range_unmap = 0;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 goto out;
>
>         pagevec_init(&pvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05eb9efec07..c82e9831003f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>         } else {
>                 void (*freepage)(struct page *);
>                 void *shadow = NULL;
> +               int empty;
>
>                 freepage = mapping->a_ops->freepage;
>                 /*
> @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>                 if (reclaimed && page_is_file_cache(page) &&
>                     !mapping_exiting(mapping) && !dax_mapping(mapping))
>                         shadow = workingset_eviction(page, target_memcg);
> -               __delete_from_page_cache(page, shadow);
> +               empty = __delete_from_page_cache(page, shadow);
>                 xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +               if (empty)
> +                       inode_pages_clear(mapping->host);
> +
>                 if (freepage != NULL)
>                         freepage(page);
>         }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 78d53378db99..ae802253f71c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1203,9 +1203,11 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA
>         "zone_reclaim_failed",
>  #endif
> -       "pginodesteal",
>         "slabs_scanned",
> +       "pginodesteal",
>         "kswapd_inodesteal",
> +       "pginoderescue",
> +       "pginodedelayed",
>         "kswapd_low_wmark_hit_quickly",
>         "kswapd_high_wmark_hit_quickly",
>         "pageoutrun",
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 474186b76ced..7ce9c74ebfdb 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>         struct xa_node *node = container_of(item, struct xa_node, private_list);
>         XA_STATE(xas, node->array, 0);
>         struct address_space *mapping;
> +       bool empty = false;
>         int ret;
>
>         /*
> @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>         if (WARN_ON_ONCE(node->count != node->nr_values))
>                 goto out_invalid;
>         mapping->nrexceptional -= node->nr_values;
> +       empty = mapping_empty(mapping);
>         xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
>         xas.xa_offset = node->offset;
>         xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
> @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>
>  out_invalid:
>         xa_unlock_irq(&mapping->i_pages);
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>         ret = LRU_REMOVED_RETRY;
>  out:
>         cond_resched();
> --
> 2.24.1
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 23:44     ` Andrew Morton
  2020-02-12  0:28       ` Linus Torvalds
@ 2020-02-12 16:35       ` Johannes Weiner
  2020-02-12 18:26         ` Andrew Morton
  1 sibling, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-02-12 16:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-fsdevel, linux-mm, linux-kernel,
	Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Linus Torvalds, Al Viro, kernel-team

On Tue, Feb 11, 2020 at 03:44:38PM -0800, Andrew Morton wrote:
> On Tue, 11 Feb 2020 14:31:01 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote:
> > > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> > > > The VFS inode shrinker is currently allowed to reclaim inodes with
> > > > populated page cache. As a result it can drop gigabytes of hot and
> > > > active page cache on the floor without consulting the VM (recorded as
> > > > "inodesteal" events in /proc/vmstat).
> > > > 
> > > > This causes real problems in practice. Consider for example how the
> > > > VM
> > > > would cache a source tree, such as the Linux git tree. As large parts
> > > > of the checked out files and the object database are accessed
> > > > repeatedly, the page cache holding this data gets moved to the active
> > > > list, where it's fully (and indefinitely) insulated from one-off
> > > > cache
> > > > moving through the inactive list.
> > > 
> > > > This behavior of invalidating page cache from the inode shrinker goes
> > > > back to even before the git import of the kernel tree. It may have
> > > > been less noticeable when the VM itself didn't have real workingset
> > > > protection, and floods of one-off cache would push out any active
> > > > cache over time anyway. But the VM has come a long way since then and
> > > > the inode shrinker is now actively subverting its caching strategy.
> > > 
> > > Two things come to mind when looking at this:
> > > - highmem
> > > - NUMA
> > > 
> > > IIRC one of the reasons reclaim is done in this way is
> > > because a page cache page in one area of memory (highmem,
> > > or a NUMA node) can end up pinning inode slab memory in
> > > another memory area (normal zone, other NUMA node).
> > 
> > That's a good point, highmem does ring a bell now that you mention it.
> 
> Yup, that's why this mechanism exists.  Here:
> 
> https://marc.info/?l=git-commits-head&m=103646757213266&w=2

Ah, thanks for digging that up, I did not know that.

> > If we still care, I think this could be solved by doing something
> > similar to what we do with buffer_heads_over_limit: allow a lowmem
> > allocation to reclaim page cache inside the highmem zone if the bhs
> > (or inodes in this case) have accumulated excessively.
> 
> Well, reclaiming highmem pagecache at random would be a painful way to
> reclaim lowmem inodes.  Better to pick an inode then shoot down all its
> pagecache.  Perhaps we could take its pagecache's aging into account.

That reminds me of trying to correlate inode pages in reclaim to batch
the cache tree lock, slab page objects in the shrinker to free whole
pages etc. We never managed to actually do that. :-)

> Testing this will be a challenge, but the issue was real - a 7GB
> highmem machine isn't crazy and I expect the inode has become larger
> since those days.

Since the cache purging code was written for highmem scenarios, how
about making it specific to CONFIG_HIGHMEM at least?

That way we improve the situation for the more common setups, without
regressing highmem configurations. And if somebody wanted to improve
the CONFIG_HIGHMEM behavior as well, they could still do so.

Somethig like the below delta on top of my patch?

---
 fs/inode.c         | 44 ++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h |  5 +++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 575b780fa9bb..45b2abd4fef6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -454,6 +454,18 @@ bool inode_add_lru(struct inode *inode)
 	return true;
 }
 
+/*
+ * Usually, inodes become reclaimable when they are no longer
+ * referenced and their page cache has been reclaimed. The following
+ * API allows the VM to communicate cache population state to the VFS.
+ *
+ * However, on CONFIG_HIGHMEM we can't wait for the page cache to go
+ * away: cache pages allocated in a large highmem zone could pin
+ * struct inode memory allocated in relatively small lowmem zones. So
+ * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime.
+ */
+
+#ifndef CONFIG_HIGHMEM
 /**
  * inode_pages_set - mark the inode as holding page cache
  * @inode: the inode whose first cache page was just added
@@ -512,6 +524,7 @@ void inode_pages_clear(struct inode *inode)
 
 	spin_unlock(&inode->i_lock);
 }
+#endif /* !CONFIG_HIGHMEM */
 
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
@@ -826,16 +839,39 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	}
 
 	/*
-	 * Populated inodes shouldn't be on the shrinker LRU, but they
-	 * can be briefly visible when a new page is added to an inode
-	 * that was already linked but inode_pages_set() hasn't run
-	 * yet to move them off.
+	 * Usually, populated inodes shouldn't be on the shrinker LRU,
+	 * but they can be briefly visible when a new page is added to
+	 * an inode that was already linked but inode_pages_set()
+	 * hasn't run yet to move them off.
+	 *
+	 * The other exception is on HIGHMEM systems: highmem cache
+	 * can pin lowmem struct inodes, and we might be in dire
+	 * straits in the lower zones. Purge cache to free the inode.
 	 */
 	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+#ifdef CONFIG_HIGHMEM
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(lru_lock);
+		if (remove_inode_buffers(inode)) {
+			unsigned long reap;
+			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
+			if (current_is_kswapd())
+				__count_vm_events(KSWAPD_INODESTEAL, reap);
+			else
+				__count_vm_events(PGINODESTEAL, reap);
+			if (current->reclaim_state)
+				current->reclaim_state->reclaimed_slab += reap;
+		}
+		iput(inode);
+		spin_lock(lru_lock);
+		return LRU_RETRY;
+#else
 		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
 		this_cpu_dec(nr_unused);
 		return LRU_REMOVED;
+#endif
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a98d9dee39f4..abdb3fd3432f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3106,8 +3106,13 @@ static inline void remove_inode_hash(struct inode *inode)
 		__remove_inode_hash(inode);
 }
 
+#ifndef CONFIG_HIGHMEM
 extern void inode_pages_set(struct inode *inode);
 extern void inode_pages_clear(struct inode *inode);
+#else
+static inline void inode_pages_set(struct inode *inode) {}
+static inline void inode_pages_clear(struct inode *inode) {}
+#endif
 
 extern void inode_sb_list_add(struct inode *inode);
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12 12:25 ` Yafang Shao
@ 2020-02-12 16:42   ` Johannes Weiner
  2020-02-13  1:47     ` Yafang Shao
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-02-12 16:42 UTC (permalink / raw)
  To: Yafang Shao
  Cc: linux-fsdevel, Linux MM, LKML, Dave Chinner, Michal Hocko,
	Roman Gushchin, Andrew Morton, Linus Torvalds, Al Viro,
	Kernel Team

On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Another variant of this problem was recently observed, where the
> > kernel violates cgroups' memory.low protection settings and reclaims
> > page cache way beyond the configured thresholds. It was followed by a
> > proposal of a modified form of the reverted commit above, that
> > implements memory.low-sensitive shrinker skipping over populated
> > inodes on the LRU [1]. However, this proposal continues to run the
> > risk of attracting disproportionate reclaim pressure to a pool of
> > still-used inodes,
> 
> Hi Johannes,
> 
> If you really think that is a risk, what about bellow additional patch
> to fix this risk ?
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 80dddbc..61862d9 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
>                 goto out;
> 
>         cgroup_size = mem_cgroup_size(memcg);
> -       if (inode->i_data.nrpages + protection >= cgroup_size)
> +       if (inode->i_data.nrpages)
>                 reclaimable = false;
> 
>  out:
> 
> With this additional patch, we skip all inodes in this memcg until all
> its page cache pages are reclaimed.

Well that's something we've tried and had to revert because it caused
issues in slab reclaim. See the History part of my changelog.

> > while not addressing the more generic reclaim
> > inversion problem outside of a very specific cgroup application.
> >
> 
> But I have a different understanding.  This method works like a
> knob. If you really care about your workingset (data), you should
> turn it on (i.e. by using memcg protection to protect them), while
> if you don't care about your workingset (data) then you'd better
> turn it off. That would be more flexible.  Regaring your case in the
> commit log, why not protect your linux git tree with memcg
> protection ?

I can't imagine a scenario where I *wouldn't* care about my
workingset, though. Why should it be opt-in, not the default?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12 16:35       ` Johannes Weiner
@ 2020-02-12 18:26         ` Andrew Morton
  2020-02-12 18:52           ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2020-02-12 18:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, linux-fsdevel, linux-mm, linux-kernel,
	Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Linus Torvalds, Al Viro, kernel-team

On Wed, 12 Feb 2020 11:35:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Since the cache purging code was written for highmem scenarios, how
> about making it specific to CONFIG_HIGHMEM at least?

Why do I have memories of suggesting this a couple of weeks ago ;)

> That way we improve the situation for the more common setups, without
> regressing highmem configurations. And if somebody wanted to improve
> the CONFIG_HIGHMEM behavior as well, they could still do so.
> 
> Somethig like the below delta on top of my patch?

Does it need to be that complicated?  What's wrong with

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -761,6 +761,10 @@ static enum lru_status inode_lru_isolate
 		return LRU_ROTATE;
 	}
 
+#ifdef CONFIG_HIGHMEM
+	/*
+	 * lengthy blah
+	 */
 	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -779,6 +783,7 @@ static enum lru_status inode_lru_isolate
 		spin_lock(lru_lock);
 		return LRU_RETRY;
 	}
+#endif
 
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
_

Whatever we do will need plenty of testing.  It wouldn't surprise me
if there are people who unknowingly benefit from this code on
64-bit machines.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12 18:26         ` Andrew Morton
@ 2020-02-12 18:52           ` Johannes Weiner
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-02-12 18:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-fsdevel, linux-mm, linux-kernel,
	Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Linus Torvalds, Al Viro, kernel-team

On Wed, Feb 12, 2020 at 10:26:45AM -0800, Andrew Morton wrote:
> On Wed, 12 Feb 2020 11:35:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Since the cache purging code was written for highmem scenarios, how
> > about making it specific to CONFIG_HIGHMEM at least?
> 
> Why do I have memories of suggesting this a couple of weeks ago ;)

Sorry, you did. I went back and found your email now. It completely
slipped my mind after that thread went off into another direction.

> > That way we improve the situation for the more common setups, without
> > regressing highmem configurations. And if somebody wanted to improve
> > the CONFIG_HIGHMEM behavior as well, they could still do so.
> > 
> > Somethig like the below delta on top of my patch?
> 
> Does it need to be that complicated?  What's wrong with
> 
> --- a/fs/inode.c~a
> +++ a/fs/inode.c
> @@ -761,6 +761,10 @@ static enum lru_status inode_lru_isolate
>  		return LRU_ROTATE;
>  	}
>  
> +#ifdef CONFIG_HIGHMEM
> +	/*
> +	 * lengthy blah
> +	 */
>  	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> @@ -779,6 +783,7 @@ static enum lru_status inode_lru_isolate
>  		spin_lock(lru_lock);
>  		return LRU_RETRY;
>  	}
> +#endif

Pages can show up here even under !CONFIG_HIGHMEM. Because of the lock
order to maintain LRU state (i_lock -> xa_lock), when the page cache
inserts new pages it doesn't unlink the inode from the LRU atomically,
and the shrinker might get here before inode_pages_set(). In that case
we need the shrinker to punt the inode off the LRU (the #else branch).

>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state |= I_FREEING;
> _
> 
> Whatever we do will need plenty of testing.  It wouldn't surprise me
> if there are people who unknowingly benefit from this code on
> 64-bit machines.

If we agree this is the way to go, I can put the patch into our tree
and gather data from the Facebook fleet before we merge it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12 16:42   ` Johannes Weiner
@ 2020-02-13  1:47     ` Yafang Shao
  2020-02-13 13:46       ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Yafang Shao @ 2020-02-13  1:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-fsdevel, Linux MM, LKML, Dave Chinner, Michal Hocko,
	Roman Gushchin, Andrew Morton, Linus Torvalds, Al Viro,
	Kernel Team

On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > Another variant of this problem was recently observed, where the
> > > kernel violates cgroups' memory.low protection settings and reclaims
> > > page cache way beyond the configured thresholds. It was followed by a
> > > proposal of a modified form of the reverted commit above, that
> > > implements memory.low-sensitive shrinker skipping over populated
> > > inodes on the LRU [1]. However, this proposal continues to run the
> > > risk of attracting disproportionate reclaim pressure to a pool of
> > > still-used inodes,
> >
> > Hi Johannes,
> >
> > If you really think that is a risk, what about bellow additional patch
> > to fix this risk ?
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 80dddbc..61862d9 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
> >                 goto out;
> >
> >         cgroup_size = mem_cgroup_size(memcg);
> > -       if (inode->i_data.nrpages + protection >= cgroup_size)
> > +       if (inode->i_data.nrpages)
> >                 reclaimable = false;
> >
> >  out:
> >
> > With this additional patch, we skip all inodes in this memcg until all
> > its page cache pages are reclaimed.
>
> Well that's something we've tried and had to revert because it caused
> issues in slab reclaim. See the History part of my changelog.
>

You misuderstood it.
The reverted patch skips all inodes in the system, while this patch
only works when you turn on memcg.{min, low} protection.
IOW, that is not a default behavior, while it only works when you want
it and only effect your targeted memcg rather than the whole system.

> > > while not addressing the more generic reclaim
> > > inversion problem outside of a very specific cgroup application.
> > >
> >
> > But I have a different understanding.  This method works like a
> > knob. If you really care about your workingset (data), you should
> > turn it on (i.e. by using memcg protection to protect them), while
> > if you don't care about your workingset (data) then you'd better
> > turn it off. That would be more flexible.  Regaring your case in the
> > commit log, why not protect your linux git tree with memcg
> > protection ?
>
> I can't imagine a scenario where I *wouldn't* care about my
> workingset, though. Why should it be opt-in, not the default?

Because the default behavior has caused the XFS performace hit.
(I haven't  checked your patch carefully, so I don't know whehter your
patch fix it yet.)


Thanks

Yafang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  8:50             ` Russell King - ARM Linux admin
@ 2020-02-13  9:50               ` Lucas Stach
  2020-02-13 16:52               ` Arnd Bergmann
  1 sibling, 0 replies; 35+ messages in thread
From: Lucas Stach @ 2020-02-13  9:50 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Linus Torvalds
  Cc: Michal Hocko, Rik van Riel, Catalin Marinas, kernel-team,
	Dave Chinner, Linux Kernel Mailing List, Linux-MM, Yafang Shao,
	Al Viro, Johannes Weiner, linux-fsdevel, Andrew Morton,
	Roman Gushchin, Linux ARM

On Mi, 2020-02-12 at 08:50 +0000, Russell King - ARM Linux admin wrote:
> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > What's the situation with highmem on ARM?
> > 
> > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
> > needed it, and I was ranting at some people for repeating all the
> > mistakes Intel did.
> > 
> > But arm64 doesn't need it, and while 32-bit arm is obviosuly still
> > selling, I think that in many ways the switch-over to 64-bit has been
> > quicker on ARM than it was on x86. Partly because it happened later
> > (so all the 64-bit teething pains were dealt with), but largely
> > because everybody ended up actively discouraging 32-bit on the Android
> > side.
> > 
> > There were a couple of unfortunate early 32-bit arm server attempts,
> > but they were - predictably - complete garbage and nobody bought them.
> > They don't exist any more.
> > 
> > So at least my gut feel is that the arm people don't have any big
> > reason to push for maintaining HIGHMEM support either.
> > 
> > But I'm adding a couple of arm people and the arm list just in case
> > they have some input.
> > 
> > [ Obvious background for newly added people: we're talking about
> > making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> > to run with lots of memory on a 32-bit kernel, you're doing legacy
> > stuff and can use a legacy kernel ]
> 
> Well, the recent 32-bit ARM systems generally have more than 1G
> of memory, so make use of highmem as a rule.  You're probably
> talking about crippling support for any 32-bit ARM system produced
> in the last 8 to 10 years.

I know of quite a few embedded ARMv7 systems that will need to be
supported for at least 10 years from now, with RAM sizes between 1GB
and even the full 4GB (well 3.75GB due to MMIO needing some address
space). Deprecating highmem would severely cripple our ability to
support those platforms with new kernels.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-13  1:47     ` Yafang Shao
@ 2020-02-13 13:46       ` Johannes Weiner
  2020-02-14  2:02         ` Yafang Shao
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-02-13 13:46 UTC (permalink / raw)
  To: Yafang Shao
  Cc: linux-fsdevel, Linux MM, LKML, Dave Chinner, Michal Hocko,
	Roman Gushchin, Andrew Morton, Linus Torvalds, Al Viro,
	Kernel Team

On Thu, Feb 13, 2020 at 09:47:29AM +0800, Yafang Shao wrote:
> On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > Another variant of this problem was recently observed, where the
> > > > kernel violates cgroups' memory.low protection settings and reclaims
> > > > page cache way beyond the configured thresholds. It was followed by a
> > > > proposal of a modified form of the reverted commit above, that
> > > > implements memory.low-sensitive shrinker skipping over populated
> > > > inodes on the LRU [1]. However, this proposal continues to run the
> > > > risk of attracting disproportionate reclaim pressure to a pool of
> > > > still-used inodes,
> > >
> > > Hi Johannes,
> > >
> > > If you really think that is a risk, what about bellow additional patch
> > > to fix this risk ?
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 80dddbc..61862d9 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
> > >                 goto out;
> > >
> > >         cgroup_size = mem_cgroup_size(memcg);
> > > -       if (inode->i_data.nrpages + protection >= cgroup_size)
> > > +       if (inode->i_data.nrpages)
> > >                 reclaimable = false;
> > >
> > >  out:
> > >
> > > With this additional patch, we skip all inodes in this memcg until all
> > > its page cache pages are reclaimed.
> >
> > Well that's something we've tried and had to revert because it caused
> > issues in slab reclaim. See the History part of my changelog.
> 
> You misuderstood it.
> The reverted patch skips all inodes in the system, while this patch
> only works when you turn on memcg.{min, low} protection.
> IOW, that is not a default behavior, while it only works when you want
> it and only effect your targeted memcg rather than the whole system.

I understand perfectly well.

Keeping unreclaimable inodes on the shrinker LRU causes the shrinker
to build up excessive pressure on all VFS objects. This is a
bug. Making it cgroup-specific doesn't make it less of a bug, it just
means you only hit the bug when you use cgroup memory protection.

> > > > while not addressing the more generic reclaim
> > > > inversion problem outside of a very specific cgroup application.
> > > >
> > >
> > > But I have a different understanding.  This method works like a
> > > knob. If you really care about your workingset (data), you should
> > > turn it on (i.e. by using memcg protection to protect them), while
> > > if you don't care about your workingset (data) then you'd better
> > > turn it off. That would be more flexible.  Regaring your case in the
> > > commit log, why not protect your linux git tree with memcg
> > > protection ?
> >
> > I can't imagine a scenario where I *wouldn't* care about my
> > workingset, though. Why should it be opt-in, not the default?
> 
> Because the default behavior has caused the XFS performace hit.

That means that with your proposal you cannot use cgroup memory
protection for workloads that run on xfs.

(And if I remember the bug report correctly, this wasn't just xfs. It
also caused metadata caches on other filesystems to get trashed. xfs
was just more pronounced because it does sync inode flushing from the
shrinker, adding write stalls to the mix of metadata cache misses.)

What I'm proposing is an implementation that protects hot page cache
without causing excessive shrinker pressure and rotations.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  8:50             ` Russell King - ARM Linux admin
  2020-02-13  9:50               ` Lucas Stach
@ 2020-02-13 16:52               ` Arnd Bergmann
  2020-02-15 11:25                 ` Geert Uytterhoeven
  2020-02-26 18:04                 ` santosh.shilimkar
  1 sibling, 2 replies; 35+ messages in thread
From: Arnd Bergmann @ 2020-02-13 16:52 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Linus Torvalds, Michal Hocko, Rik van Riel, Catalin Marinas,
	kernel-team, Dave Chinner, Linux Kernel Mailing List, Linux-MM,
	Yafang Shao, Al Viro, Johannes Weiner, linux-fsdevel,
	Andrew Morton, Roman Gushchin, Linux ARM, Kishon Vijay Abraham I,
	Santosh Shilimkar

On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > What's the situation with highmem on ARM?
> >
> > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
> > needed it, and I was ranting at some people for repeating all the
> > mistakes Intel did.
> >
> > But arm64 doesn't need it, and while 32-bit arm is obviosuly still
> > selling, I think that in many ways the switch-over to 64-bit has been
> > quicker on ARM than it was on x86. Partly because it happened later
> > (so all the 64-bit teething pains were dealt with), but largely
> > because everybody ended up actively discouraging 32-bit on the Android
> > side.
> >
> > There were a couple of unfortunate early 32-bit arm server attempts,
> > but they were - predictably - complete garbage and nobody bought them.
> > They don't exist any more.

I'd generally agree with that, the systems with more than 4GB tended to
be high-end systems predating the Cortex-A53/A57 that quickly got
replaced once there were actual 64-bit parts, this would include axm5516
(replaced with x86-64 cores after sale to Intel), hip04 (replaced
with arm64), or ecx-2000 (Calxeda bankruptcy).

The one 32-bit SoC that I can think of that can actually drive lots of
RAM and is still actively marketed is TI Keystone-2/AM5K2.
The embedded AM5K2 is listed supporting up to 8GB of RAM, but
the verison in the HPE ProLiant m800 server could take up to 32GB (!).

I added Santosh and Kishon to Cc, they can probably comment on how
long they think users will upgrade kernels on these. I suspect these
devices can live for a very long time in things like wireless base stations,
but it's possible that they all run on old kernels anyway by now (and are
not worried about y2038).

> > So at least my gut feel is that the arm people don't have any big
> > reason to push for maintaining HIGHMEM support either.
> >
> > But I'm adding a couple of arm people and the arm list just in case
> > they have some input.
> >
> > [ Obvious background for newly added people: we're talking about
> > making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> > to run with lots of memory on a 32-bit kernel, you're doing legacy
> > stuff and can use a legacy kernel ]
>
> Well, the recent 32-bit ARM systems generally have more than 1G
> of memory, so make use of highmem as a rule.  You're probably
> talking about crippling support for any 32-bit ARM system produced
> in the last 8 to 10 years.

What I'm observing in the newly added board support is that memory
configurations are actually going down, driven by component cost.
512MB is really cheap (~$4) these days with a single 256Mx16 DDR3
chip or two 128Mx16. Going beyond 1GB is where things get expensive
with either 4+ chips or LPDDR3/LPDDR4 memory.

For designs with 1GB, we're probably better off just using
CONFIG_VMSPLIT_3G_OPT (without LPAE) anyway, completely
avoiding highmem. That is particularly true on systems with a custom
kernel configuration.

2GB machines are less common, but are definitely important, e.g.
MT6580 based Android phones and some industrial embedded machines
that will live a long time. I've recently seen reports of odd behavior
with CONFIG_VMSPLIT_2G and plus CONFIG_HIGHMEM and a 7:1
ratio of lowmem to highmem that apparently causes OOM despite lots
of lowmem being free. I suspect a lot of those workloads would still be
better off with a CONFIG_VMSPLIT_2G_OPT (1.75 GB user, 2GB
linear map). That config unfortunately has a few problems, too:
- nobody has implemented it
- it won't work with LPAE and therefore cannot support hardware
  that relies on high physical addresses for RAM or MMIO
  (those could run CONFIG_VMSPLIT_2G at the cost of wasting
  12.5% of RAM).
- any workload that requires the full 3GB of virtual address space won't
  work at all. This might be e.g. MAP_FIXED users, or build servers
  linking large binaries.
It will take a while to find out what kinds of workloads suffer the most
from a different vmsplit and what can be done to address that, but we
could start by changing the kernel defconfig and distro builds to see
who complains ;-)

I think 32-bit ARM machines with 3GB or more are getting very rare,
but some still exist:
- The Armada XP development board had a DIMM slot that could take
  large memory (possibly up to 8GB with LPAE). This never shipped as
  a commercial product, but distro build servers sometimes still run on
  this, or on the old Calxeda or Keystone server systems.
- a few early i.MX6 boards  (e.g. HummingBoard) came had 4GB of
  RAM, though none of these seem to be available any more.
- High-end phones from 2013/2014 had 3GB LPDDR3 before getting
  obsoleted by 64-bit phones. Presumably none of these ever ran
  Linux-4.x or newer.
- My main laptop is a RK3288 based Chromebook with 4GB that just
  got updated to linux-4.19 by Google. Official updates apparently
  stop this summer, but it could easily run Debian later on.
- Some people run 32-bit kernels on a 64-bit Raspberry Pi 4 or on
  arm64 KVM with lots of RAM. These should probably all
  migrate to 64-bit kernels with compat user space anyway.
In theory these could also run on a VMSPLIT_4G_4G-like setup,
but I don't think anyone wants to go there. Deprecating highmem
definitely impacts any such users significantly, though staying on
an LTS kernel may be an option if there are only few of them.

           Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
                   ` (2 preceding siblings ...)
  2020-02-12 12:25 ` Yafang Shao
@ 2020-02-13 18:34 ` " Johannes Weiner
  2020-02-14 16:53 ` [PATCH] " kbuild test robot
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-02-13 18:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel
  Cc: Dave Chinner, Yafang Shao, Michal Hocko, Roman Gushchin,
	Andrew Morton, Linus Torvalds, Al Viro, kernel-team

Changes since v1:
- retain inode-driven cache reclaim for CONFIG_HIGHMEM
- fix a compile bug in fs/dax.c (mapping->inode -> mapping->host)

I'm going to start wider-scale testing with this updated version.

---
From 6b76483fd64fd47e04eddbb2025e3d6a3b4aaec0 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 4 Feb 2020 18:12:48 -0500
Subject: [PATCH v2] vfs: keep inodes with page cache off the inode shrinker LRU

The VFS inode shrinker is currently allowed to reclaim inodes with
populated page cache. As a result it can drop gigabytes of hot and
active page cache on the floor without consulting the VM.

The reason for this goes back to highmem: page cache in the highmem
zones can pin struct inode objects in small lowmem zones and get the
whole system into trouble. As a result, the inode shrinker zaps the
page cache to free up the lowmem struct inodes.

Details: https://marc.info/?l=git-commits-head&m=103646757213266&w=2

But the cost of doing this isn't justifiable on the more prevalent
!CONFIG_HIGHMEM systems nowadays.

Consider for example how the VM would cache a source tree, such as the
Linux git tree. As large parts of the checked out files and the object
database are accessed repeatedly, the page cache holding this data
gets moved to the active list, where it's fully (and indefinitely)
insulated from one-off cache moving through the inactive list. But due
to the way users interact with the tree, no ongoing open file
descriptors into the source tree are maintained, and the inodes end up
on the shrinker LRU. A larger burst of one-off cache (find, updatedb,
etc.) can now drive the shrinkers to drop first the dentries and then
the inodes - inodes that contain the most valuable data currently held
by the page cache - while there is plenty of one-off cache that could
be reclaimed instead.

This may have been less of a concern when the VM itself didn't have
real workingset protection, and one-off cache would push out active
cache over time anyway. But we've come a long way since, and the inode
shrinker is now actively in conflict with the VM's caching strategy.

Previous proposals

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over
inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
inodes with many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused
excessive pressure build up on the VFS objects: Inodes that sit on the
shrinker LRU are attracting reclaim pressure away from the page cache
and toward the VFS. If we then permanently exempt sizable portions of
this pool from actually getting reclaimed when looked at, this
pressure accumulates as deferred work (a mechanism for *temporarily*
unreclaimable objects) until it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive
pressure drove the XFS shrinker into dirty objects, where it caused
synchronous, IO-bound stalls, even as there was plenty of clean page
cache that should have been reclaimed instead.

Another variant of this problem was recently observed, where the
kernel violates cgroups' memory.low protection settings and reclaims
page cache way beyond the configured thresholds. It was followed by a
proposal of a modified form of the reverted commit above, that
implements memory.low-sensitive shrinker skipping over populated
inodes on the LRU [1]. However, this proposal continues to run the
risk of attracting disproportionate reclaim pressure to a pool of
still-used inodes, while not addressing the more generic reclaim
inversion problem outside of a very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

To fix the reclaim inversion in the shrinker, without reintroducing
the problems associated with shrinker LRU rotations, this patch keeps
populated inodes off the LRUs entirely on !CONFIG_HIGHMEM systems.

Currently, inodes are kept off the shrinker LRU as long as they have
an elevated i_count, indicating an active user. Unfortunately, the
page cache cannot simply hold an i_count reference, because unlink()
*should* result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page
cache and punt the LRU linking to the VM if the inode is still
populated; the VM in turn checks the inode state when it depopulates
the page cache, and adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off
the LRU permanently until writeback completion puts them back on (iff
still unused). We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer. Follow existing precedence
and handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to
  use them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker
drops page cache - is a thing of the past. The VM is in charge of the
page cache, the inode shrinker is in charge of freeing struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of
  times a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred
  an LRU addition.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache. Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker
  would have taken care of them before - no more, so do it VM-side.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c                |   2 +-
 fs/dax.c                      |  14 +++++
 fs/drop_caches.c              |   2 +-
 fs/inode.c                    | 112 +++++++++++++++++++++++++++++++---
 fs/internal.h                 |   2 +-
 include/linux/fs.h            |  17 ++++++
 include/linux/pagemap.h       |   2 +-
 include/linux/vm_event_item.h |   3 +-
 mm/filemap.c                  |  39 +++++++++---
 mm/huge_memory.c              |   3 +-
 mm/truncate.c                 |  34 ++++++++---
 mm/vmscan.c                   |   6 +-
 mm/vmstat.c                   |   4 +-
 mm/workingset.c               |   4 ++
 14 files changed, 209 insertions(+), 35 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..46f67147ad20 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/dax.c b/fs/dax.c
index 35da144375a0..f68b71f81e5b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
 {
 	unsigned long index = xas->xa_index;
 	bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
+	int populated;
 	void *entry;
 
 retry:
+	populated = 0;
 	xas_lock_irq(xas);
 	entry = get_unlocked_entry(xas, order);
 
@@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
 		mapping->nrexceptional--;
+		if (mapping_empty(mapping))
+			populated = -1;
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas,
 		dax_lock_entry(xas, entry);
 		if (xas_error(xas))
 			goto out_unlock;
+		if (mapping_empty(mapping))
+			populated++;
 		mapping->nrexceptional++;
 	}
 
 out_unlock:
 	xas_unlock_irq(xas);
+	if (populated == -1)
+		inode_pages_clear(mapping->host);
+	else if (populated == 1)
+		inode_pages_set(mapping->host);
 	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
 		goto retry;
 	if (xas->xa_node == XA_ERROR(-ENOMEM))
@@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
+	bool empty = false;
 	int ret = 0;
 	void *entry;
 
@@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+	empty = mapping_empty(mapping);
 	ret = 1;
 out:
 	put_unlocked_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	return ret;
 }
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dc1a1d5d825b..a5e9e9053474 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		 * we need to reschedule to avoid softlockups.
 		 */
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0 && !need_resched())) {
+		    (mapping_empty(inode->i_mapping) && !need_resched())) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..45b2abd4fef6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -430,26 +430,101 @@ static void inode_lru_list_add(struct inode *inode)
 		inode->i_state |= I_REFERENCED;
 }
 
+static void inode_lru_list_del(struct inode *inode)
+{
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+		this_cpu_dec(nr_unused);
+}
+
 /*
  * Add inode to LRU if needed (inode is unused and clean).
  *
  * Needs inode->i_lock held.
  */
-void inode_add_lru(struct inode *inode)
+bool inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	if (inode->i_state &
+	    (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
+		return false;
+	if (atomic_read(&inode->i_count))
+		return false;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return false;
+	inode_lru_list_add(inode);
+	return true;
 }
 
+/*
+ * Usually, inodes become reclaimable when they are no longer
+ * referenced and their page cache has been reclaimed. The following
+ * API allows the VM to communicate cache population state to the VFS.
+ *
+ * However, on CONFIG_HIGHMEM we can't wait for the page cache to go
+ * away: cache pages allocated in a large highmem zone could pin
+ * struct inode memory allocated in relatively small lowmem zones. So
+ * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime.
+ */
 
-static void inode_lru_list_del(struct inode *inode)
+#ifndef CONFIG_HIGHMEM
+/**
+ * inode_pages_set - mark the inode as holding page cache
+ * @inode: the inode whose first cache page was just added
+ *
+ * Tell the VFS that this inode has populated page cache and must not
+ * be reclaimed by the inode shrinker.
+ *
+ * The caller must hold the page lock of the just-added page: by
+ * pinning the page, the page cache cannot become depopulated, and we
+ * can safely set I_PAGES without a race check under the i_pages lock.
+ *
+ * This function acquires the i_lock.
+ */
+void inode_pages_set(struct inode *inode)
 {
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_PAGES)) {
+		inode->i_state |= I_PAGES;
+		if (!list_empty(&inode->i_lru)) {
+			count_vm_event(PGINODERESCUE);
+			inode_lru_list_del(inode);
+		}
+	}
+	spin_unlock(&inode->i_lock);
+}
 
-	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
-		this_cpu_dec(nr_unused);
+/**
+ * inode_pages_clear - mark the inode as not holding page cache
+ * @inode: the inode whose last cache page was just removed
+ *
+ * Tell the VFS that the inode no longer holds page cache and that its
+ * lifetime is to be handed over to the inode shrinker LRU.
+ *
+ * This function acquires the i_lock and the i_pages lock.
+ */
+void inode_pages_clear(struct inode *inode)
+{
+	struct address_space *mapping = &inode->i_data;
+	bool add_to_lru = false;
+	unsigned long flags;
+
+	spin_lock(&inode->i_lock);
+
+	xa_lock_irqsave(&mapping->i_pages, flags);
+	if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
+		inode->i_state &= ~I_PAGES;
+		add_to_lru = true;
+	}
+	xa_unlock_irqrestore(&mapping->i_pages, flags);
+
+	if (add_to_lru) {
+		WARN_ON_ONCE(!list_empty(&inode->i_lru));
+		if (inode_add_lru(inode))
+			__count_vm_event(PGINODEDELAYED);
+	}
+
+	spin_unlock(&inode->i_lock);
 }
+#endif /* CONFIG_HIGHMEM */
 
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
@@ -742,6 +817,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+	WARN_ON_ONCE(inode->i_state & I_PAGES);
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -761,7 +838,18 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_ROTATE;
 	}
 
-	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+	/*
+	 * Usually, populated inodes shouldn't be on the shrinker LRU,
+	 * but they can be briefly visible when a new page is added to
+	 * an inode that was already linked but inode_pages_set()
+	 * hasn't run yet to move them off.
+	 *
+	 * The other exception is on HIGHMEM systems: highmem cache
+	 * can pin lowmem struct inodes, and we might be in dire
+	 * straits in the lower zones. Purge cache to free the inode.
+	 */
+	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+#ifdef CONFIG_HIGHMEM
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(lru_lock);
@@ -778,6 +866,12 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		iput(inode);
 		spin_lock(lru_lock);
 		return LRU_RETRY;
+#else
+		list_lru_isolate(lru, &inode->i_lru);
+		spin_unlock(&inode->i_lock);
+		this_cpu_dec(nr_unused);
+		return LRU_REMOVED;
+#endif
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..4a9dc77e817b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -139,7 +139,7 @@ extern int vfs_open(const struct path *, struct file *);
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
+extern bool inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..abdb3fd3432f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -585,6 +585,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
 	atomic_inc(&mapping->i_mmap_writable);
 }
 
+static inline bool mapping_empty(struct address_space *mapping)
+{
+	return mapping->nrpages + mapping->nrexceptional == 0;
+}
+
 /*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
@@ -2150,6 +2155,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_CREATING		New object's inode in the middle of setting up.
  *
+ * I_PAGES		Inode is holding page cache that needs to get reclaimed
+ *			first before the inode can go onto the shrinker LRU.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2172,6 +2180,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_WB_SWITCH		(1 << 13)
 #define I_OVL_INUSE		(1 << 14)
 #define I_CREATING		(1 << 15)
+#define I_PAGES			(1 << 16)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -3097,6 +3106,14 @@ static inline void remove_inode_hash(struct inode *inode)
 		__remove_inode_hash(inode);
 }
 
+#ifndef CONFIG_HIGHMEM
+extern void inode_pages_set(struct inode *inode);
+extern void inode_pages_clear(struct inode *inode);
+#else
+static inline void inode_pages_set(struct inode *inode) {}
+static inline void inode_pages_clear(struct inode *inode) {}
+#endif
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ccb14b6a16b5..ae4d90bd0873 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -609,7 +609,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+extern bool __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..f31026ccf590 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
-		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
+		SLABS_SCANNED,
+		PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..fcc24b3b3540 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,8 +116,8 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
-static void page_cache_delete(struct address_space *mapping,
-				   struct page *page, void *shadow)
+static bool __must_check page_cache_delete(struct address_space *mapping,
+					   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
 	unsigned int nr = 1;
@@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping,
 		smp_wmb();
 	}
 	mapping->nrpages -= nr;
+
+	return mapping_empty(mapping);
 }
 
 static void unaccount_page_cache_page(struct address_space *mapping,
@@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping,
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the i_pages lock.
+ *
+ * If this returns true, the caller must call inode_pages_clear()
+ * after dropping the i_pages lock.
  */
-void __delete_from_page_cache(struct page *page, void *shadow)
+bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 
 	unaccount_page_cache_page(mapping, page);
-	page_cache_delete(mapping, page, shadow);
+	return page_cache_delete(mapping, page, shadow);
 }
 
 static void page_cache_free_page(struct address_space *mapping,
@@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
+	bool empty;
 
 	BUG_ON(!PageLocked(page));
 	xa_lock_irqsave(&mapping->i_pages, flags);
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	page_cache_free_page(mapping, page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
@@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
  *
  * The function expects the i_pages lock to be held.
  */
-static void page_cache_delete_batch(struct address_space *mapping,
-			     struct pagevec *pvec)
+static bool __must_check page_cache_delete_batch(struct address_space *mapping,
+						 struct pagevec *pvec)
 {
 	XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
 	int total_pages = 0;
@@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
+
+	return mapping_empty(mapping);
 }
 
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec)
 {
 	int i;
+	bool empty;
 	unsigned long flags;
 
 	if (!pagevec_count(pvec))
@@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping,
 
 		unaccount_page_cache_page(mapping, pvec->pages[i]);
 	}
-	page_cache_delete_batch(mapping, pvec);
+	empty = page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
 }
@@ -831,9 +846,10 @@ static int __add_to_page_cache_locked(struct page *page,
 				      void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
+	int error;
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
-	int error;
+	bool populated = false;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(struct page *page,
 		if (xas_error(&xas))
 			goto unlock;
 
+		populated = mapping_empty(mapping);
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
 			if (shadowp)
@@ -880,6 +897,10 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
+
+	if (populated)
+		inode_pages_set(mapping->host);
+
 	return 0;
 error:
 	page->mapping = NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..8b3e33a52d93 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2535,7 +2535,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
-			__delete_from_page_cache(head + i, NULL);
+			/* We know we're not removing the last page */
+			(void)__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..8fb6c2f762bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -31,24 +31,31 @@
  * itself locked.  These unlocked entries need verification under the tree
  * lock.
  */
-static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
+static bool __must_check __clear_shadow_entry(struct address_space *mapping,
+					      pgoff_t index, void *entry)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
-		return;
+		return 0;
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+
+	return mapping_empty(mapping);
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	bool empty;
+
 	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
+	empty = __clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
 }
 
 /*
@@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 				pgoff_t end)
 {
 	int i, j;
-	bool dax, lock;
+	bool dax, lock, empty = false;
 
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
@@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 			continue;
 		}
 
-		__clear_shadow_entry(mapping, index, page);
+		if (__clear_shadow_entry(mapping, index, page))
+			empty = true;
 	}
 
 	if (lock)
 		xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	pvec->nr = j;
 }
 
@@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	pgoff_t		index;
 	int		i;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	/* Offsets within partial pages */
@@ -636,6 +648,7 @@ static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
 	unsigned long flags;
+	bool empty;
 
 	if (page->mapping != mapping)
 		return 0;
@@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
 
@@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	pagevec_init(&pvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..c82e9831003f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
+		int empty;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		if (reclaimed && page_is_file_cache(page) &&
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(page, target_memcg);
-		__delete_from_page_cache(page, shadow);
+		empty = __delete_from_page_cache(page, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+		if (empty)
+			inode_pages_clear(mapping->host);
+
 		if (freepage != NULL)
 			freepage(page);
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..ae802253f71c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1203,9 +1203,11 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
 #endif
-	"pginodesteal",
 	"slabs_scanned",
+	"pginodesteal",
 	"kswapd_inodesteal",
+	"pginoderescue",
+	"pginodedelayed",
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..7ce9c74ebfdb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	struct xa_node *node = container_of(item, struct xa_node, private_list);
 	XA_STATE(xas, node->array, 0);
 	struct address_space *mapping;
+	bool empty = false;
 	int ret;
 
 	/*
@@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->count != node->nr_values))
 		goto out_invalid;
 	mapping->nrexceptional -= node->nr_values;
+	empty = mapping_empty(mapping);
 	xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
 	xas.xa_offset = node->offset;
 	xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
@@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
-- 
2.24.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-13 13:46       ` Johannes Weiner
@ 2020-02-14  2:02         ` Yafang Shao
  0 siblings, 0 replies; 35+ messages in thread
From: Yafang Shao @ 2020-02-14  2:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-fsdevel, Linux MM, LKML, Dave Chinner, Michal Hocko,
	Roman Gushchin, Andrew Morton, Linus Torvalds, Al Viro,
	Kernel Team

On Thu, Feb 13, 2020 at 9:46 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Feb 13, 2020 at 09:47:29AM +0800, Yafang Shao wrote:
> > On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> > > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > Another variant of this problem was recently observed, where the
> > > > > kernel violates cgroups' memory.low protection settings and reclaims
> > > > > page cache way beyond the configured thresholds. It was followed by a
> > > > > proposal of a modified form of the reverted commit above, that
> > > > > implements memory.low-sensitive shrinker skipping over populated
> > > > > inodes on the LRU [1]. However, this proposal continues to run the
> > > > > risk of attracting disproportionate reclaim pressure to a pool of
> > > > > still-used inodes,
> > > >
> > > > Hi Johannes,
> > > >
> > > > If you really think that is a risk, what about bellow additional patch
> > > > to fix this risk ?
> > > >
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index 80dddbc..61862d9 100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
> > > >                 goto out;
> > > >
> > > >         cgroup_size = mem_cgroup_size(memcg);
> > > > -       if (inode->i_data.nrpages + protection >= cgroup_size)
> > > > +       if (inode->i_data.nrpages)
> > > >                 reclaimable = false;
> > > >
> > > >  out:
> > > >
> > > > With this additional patch, we skip all inodes in this memcg until all
> > > > its page cache pages are reclaimed.
> > >
> > > Well that's something we've tried and had to revert because it caused
> > > issues in slab reclaim. See the History part of my changelog.
> >
> > You misuderstood it.
> > The reverted patch skips all inodes in the system, while this patch
> > only works when you turn on memcg.{min, low} protection.
> > IOW, that is not a default behavior, while it only works when you want
> > it and only effect your targeted memcg rather than the whole system.
>
> I understand perfectly well.
>
> Keeping unreclaimable inodes on the shrinker LRU causes the shrinker
> to build up excessive pressure on all VFS objects. This is a
> bug. Making it cgroup-specific doesn't make it less of a bug, it just
> means you only hit the bug when you use cgroup memory protection.
>

What I mean to fix is really a cgroup-specific issue, but this issue
may be different with what you're meaning to fix.
(I will explain it bellow)
Considering the excessive pressure the protected inodes may give to
the shrinker, the protected page cache pages will give much more
pressure on the reclaimer. If you mean to remove the protecrted inodes
from the shrinker LRU, why not removing the protected page cache pages
from the page cache LRU as well ? Well, what I really to mean is, that
is how the memcg proctection works.

> > > > > while not addressing the more generic reclaim
> > > > > inversion problem outside of a very specific cgroup application.
> > > > >
> > > >
> > > > But I have a different understanding.  This method works like a
> > > > knob. If you really care about your workingset (data), you should
> > > > turn it on (i.e. by using memcg protection to protect them), while
> > > > if you don't care about your workingset (data) then you'd better
> > > > turn it off. That would be more flexible.  Regaring your case in the
> > > > commit log, why not protect your linux git tree with memcg
> > > > protection ?
> > >
> > > I can't imagine a scenario where I *wouldn't* care about my
> > > workingset, though. Why should it be opt-in, not the default?
> >
> > Because the default behavior has caused the XFS performace hit.
>
> That means that with your proposal you cannot use cgroup memory
> protection for workloads that run on xfs.
>

Well, if you set memory.min to protect your workload inside a specific
memcg, it means that you already know these memroy can't be used by
your workload outside the memcg. That means, the performace of the
workload outside the memcg may not as good as before. Then you should
adjust your SLA or migrating this protected memcgs to other host or
just killing this protected memcg.
IOW, the result is *expected*.

> (And if I remember the bug report correctly, this wasn't just xfs. It
> also caused metadata caches on other filesystems to get trashed. xfs
> was just more pronounced because it does sync inode flushing from the
> shrinker, adding write stalls to the mix of metadata cache misses.)
>
> What I'm proposing is an implementation that protects hot page cache
> without causing excessive shrinker pressure and rotations.

That's the different between your issue and my issue.
You're trying to fix the issue around the hot  page cache, but what I
want to fix may be cold page cache and it really is a memcg protection
specific issue.
Becuase the memcg protection can protect all page cache pages, even if
the page cache pages are cold and the inodes are cold (in the tail of
the list lru) as well.  That is one of the reasons why memcg protect
exist. (I know you are the author of memcg protection, but I have to
clarify what memcg protect is.)

Regarding your issue around the hot page cache  pages, I have another
question. If the page cache pages are hot, why are the inode of these
page cahe pages cold (in the tail of the list lru) ?  Per my
understanding, if the page cache pages are hot, the inodes of them
should be hot (not in the tail of the list lur) as well. That should
be how the LRU works.

Well, that doesn't mean I object to your patch.  What I really want to
clarify is that our issues are really different.

Thanks
Yafang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
                   ` (3 preceding siblings ...)
  2020-02-13 18:34 ` [PATCH v2] " Johannes Weiner
@ 2020-02-14 16:53 ` " kbuild test robot
  2020-02-14 21:30 ` kbuild test robot
  2020-02-14 21:30 ` [PATCH] vfs: fix boolreturn.cocci warnings kbuild test robot
  6 siblings, 0 replies; 35+ messages in thread
From: kbuild test robot @ 2020-02-14 16:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, linux-fsdevel, linux-mm, linux-kernel, Dave Chinner,
	Yafang Shao, Michal Hocko, Roman Gushchin, Andrew Morton,
	Linus Torvalds, Al Viro, kernel-team

[-- Attachment #1: Type: text/plain, Size: 6051 bytes --]

Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on vfs/for-next]
[also build test ERROR on linux/master linus/master v5.6-rc1 next-20200213]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756
base:   https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/dax.c: In function 'grab_mapping_entry':
>> fs/dax.c:556:28: error: 'struct address_space' has no member named 'inode'
      inode_pages_clear(mapping->inode);
                               ^~
   fs/dax.c:558:26: error: 'struct address_space' has no member named 'inode'
      inode_pages_set(mapping->inode);
                             ^~

vim +556 fs/dax.c

   446	
   447	/*
   448	 * Find page cache entry at given index. If it is a DAX entry, return it
   449	 * with the entry locked. If the page cache doesn't contain an entry at
   450	 * that index, add a locked empty entry.
   451	 *
   452	 * When requesting an entry with size DAX_PMD, grab_mapping_entry() will
   453	 * either return that locked entry or will return VM_FAULT_FALLBACK.
   454	 * This will happen if there are any PTE entries within the PMD range
   455	 * that we are requesting.
   456	 *
   457	 * We always favor PTE entries over PMD entries. There isn't a flow where we
   458	 * evict PTE entries in order to 'upgrade' them to a PMD entry.  A PMD
   459	 * insertion will fail if it finds any PTE entries already in the tree, and a
   460	 * PTE insertion will cause an existing PMD entry to be unmapped and
   461	 * downgraded to PTE entries.  This happens for both PMD zero pages as
   462	 * well as PMD empty entries.
   463	 *
   464	 * The exception to this downgrade path is for PMD entries that have
   465	 * real storage backing them.  We will leave these real PMD entries in
   466	 * the tree, and PTE writes will simply dirty the entire PMD entry.
   467	 *
   468	 * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
   469	 * persistent memory the benefit is doubtful. We can add that later if we can
   470	 * show it helps.
   471	 *
   472	 * On error, this function does not return an ERR_PTR.  Instead it returns
   473	 * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
   474	 * overlap with xarray value entries.
   475	 */
   476	static void *grab_mapping_entry(struct xa_state *xas,
   477			struct address_space *mapping, unsigned int order)
   478	{
   479		unsigned long index = xas->xa_index;
   480		bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
   481		int populated;
   482		void *entry;
   483	
   484	retry:
   485		populated = 0;
   486		xas_lock_irq(xas);
   487		entry = get_unlocked_entry(xas, order);
   488	
   489		if (entry) {
   490			if (dax_is_conflict(entry))
   491				goto fallback;
   492			if (!xa_is_value(entry)) {
   493				xas_set_err(xas, EIO);
   494				goto out_unlock;
   495			}
   496	
   497			if (order == 0) {
   498				if (dax_is_pmd_entry(entry) &&
   499				    (dax_is_zero_entry(entry) ||
   500				     dax_is_empty_entry(entry))) {
   501					pmd_downgrade = true;
   502				}
   503			}
   504		}
   505	
   506		if (pmd_downgrade) {
   507			/*
   508			 * Make sure 'entry' remains valid while we drop
   509			 * the i_pages lock.
   510			 */
   511			dax_lock_entry(xas, entry);
   512	
   513			/*
   514			 * Besides huge zero pages the only other thing that gets
   515			 * downgraded are empty entries which don't need to be
   516			 * unmapped.
   517			 */
   518			if (dax_is_zero_entry(entry)) {
   519				xas_unlock_irq(xas);
   520				unmap_mapping_pages(mapping,
   521						xas->xa_index & ~PG_PMD_COLOUR,
   522						PG_PMD_NR, false);
   523				xas_reset(xas);
   524				xas_lock_irq(xas);
   525			}
   526	
   527			dax_disassociate_entry(entry, mapping, false);
   528			xas_store(xas, NULL);	/* undo the PMD join */
   529			dax_wake_entry(xas, entry, true);
   530			mapping->nrexceptional--;
   531			if (mapping_empty(mapping))
   532				populated = -1;
   533			entry = NULL;
   534			xas_set(xas, index);
   535		}
   536	
   537		if (entry) {
   538			dax_lock_entry(xas, entry);
   539		} else {
   540			unsigned long flags = DAX_EMPTY;
   541	
   542			if (order > 0)
   543				flags |= DAX_PMD;
   544			entry = dax_make_entry(pfn_to_pfn_t(0), flags);
   545			dax_lock_entry(xas, entry);
   546			if (xas_error(xas))
   547				goto out_unlock;
   548			if (mapping_empty(mapping))
   549				populated++;
   550			mapping->nrexceptional++;
   551		}
   552	
   553	out_unlock:
   554		xas_unlock_irq(xas);
   555		if (populated == -1)
 > 556			inode_pages_clear(mapping->inode);
   557		else if (populated == 1)
   558			inode_pages_set(mapping->inode);
   559		if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
   560			goto retry;
   561		if (xas->xa_node == XA_ERROR(-ENOMEM))
   562			return xa_mk_internal(VM_FAULT_OOM);
   563		if (xas_error(xas))
   564			return xa_mk_internal(VM_FAULT_SIGBUS);
   565		return entry;
   566	fallback:
   567		xas_unlock_irq(xas);
   568		return xa_mk_internal(VM_FAULT_FALLBACK);
   569	}
   570	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 51877 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
                   ` (4 preceding siblings ...)
  2020-02-14 16:53 ` [PATCH] " kbuild test robot
@ 2020-02-14 21:30 ` kbuild test robot
  2020-02-14 21:30 ` [PATCH] vfs: fix boolreturn.cocci warnings kbuild test robot
  6 siblings, 0 replies; 35+ messages in thread
From: kbuild test robot @ 2020-02-14 21:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, linux-fsdevel, linux-mm, linux-kernel, Dave Chinner,
	Yafang Shao, Michal Hocko, Roman Gushchin, Andrew Morton,
	Linus Torvalds, Al Viro, kernel-team

Hi Johannes,

I love your patch! Perhaps something to improve:

[auto build test WARNING on vfs/for-next]
[also build test WARNING on linux/master linus/master v5.6-rc1 next-20200214]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756
base:   https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


coccinelle warnings: (new ones prefixed by >>)

>> mm/truncate.c:41:9-10: WARNING: return of 0/1 in function '__clear_shadow_entry' with return type bool

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] vfs: fix boolreturn.cocci warnings
  2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
                   ` (5 preceding siblings ...)
  2020-02-14 21:30 ` kbuild test robot
@ 2020-02-14 21:30 ` kbuild test robot
  6 siblings, 0 replies; 35+ messages in thread
From: kbuild test robot @ 2020-02-14 21:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, linux-fsdevel, linux-mm, linux-kernel, Dave Chinner,
	Yafang Shao, Michal Hocko, Roman Gushchin, Andrew Morton,
	Linus Torvalds, Al Viro, kernel-team

From: kbuild test robot <lkp@intel.com>

mm/truncate.c:41:9-10: WARNING: return of 0/1 in function '__clear_shadow_entry' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

Fixes: 98e2565539a0 ("vfs: keep inodes with page cache off the inode shrinker LRU")
CC: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: kbuild test robot <lkp@intel.com>
---

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756
base:   https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next

 truncate.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -38,7 +38,7 @@ static bool __must_check __clear_shadow_
 
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
-		return 0;
+		return false;
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-13 16:52               ` Arnd Bergmann
@ 2020-02-15 11:25                 ` Geert Uytterhoeven
  2020-02-15 16:59                   ` Arnd Bergmann
  2020-02-26 18:04                 ` santosh.shilimkar
  1 sibling, 1 reply; 35+ messages in thread
From: Geert Uytterhoeven @ 2020-02-15 11:25 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King - ARM Linux admin, Linus Torvalds, Michal Hocko,
	Rik van Riel, Catalin Marinas, kernel-team, Dave Chinner,
	Linux Kernel Mailing List, Linux-MM, Yafang Shao, Al Viro,
	Johannes Weiner, linux-fsdevel, Andrew Morton, Roman Gushchin,
	Linux ARM, Kishon Vijay Abraham I, Santosh Shilimkar,
	Chris Paterson, cip-dev

Hi Arnd,

On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> > > So at least my gut feel is that the arm people don't have any big
> > > reason to push for maintaining HIGHMEM support either.
> > >
> > > But I'm adding a couple of arm people and the arm list just in case
> > > they have some input.
> > >
> > > [ Obvious background for newly added people: we're talking about
> > > making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> > > to run with lots of memory on a 32-bit kernel, you're doing legacy
> > > stuff and can use a legacy kernel ]
> >
> > Well, the recent 32-bit ARM systems generally have more than 1G
> > of memory, so make use of highmem as a rule.  You're probably
> > talking about crippling support for any 32-bit ARM system produced
> > in the last 8 to 10 years.
>
> What I'm observing in the newly added board support is that memory
> configurations are actually going down, driven by component cost.
> 512MB is really cheap (~$4) these days with a single 256Mx16 DDR3
> chip or two 128Mx16. Going beyond 1GB is where things get expensive
> with either 4+ chips or LPDDR3/LPDDR4 memory.
>
> For designs with 1GB, we're probably better off just using
> CONFIG_VMSPLIT_3G_OPT (without LPAE) anyway, completely
> avoiding highmem. That is particularly true on systems with a custom
> kernel configuration.
>
> 2GB machines are less common, but are definitely important, e.g.
> MT6580 based Android phones and some industrial embedded machines
> that will live a long time. I've recently seen reports of odd behavior
> with CONFIG_VMSPLIT_2G and plus CONFIG_HIGHMEM and a 7:1
> ratio of lowmem to highmem that apparently causes OOM despite lots
> of lowmem being free. I suspect a lot of those workloads would still be
> better off with a CONFIG_VMSPLIT_2G_OPT (1.75 GB user, 2GB
> linear map). That config unfortunately has a few problems, too:
> - nobody has implemented it
> - it won't work with LPAE and therefore cannot support hardware
>   that relies on high physical addresses for RAM or MMIO
>   (those could run CONFIG_VMSPLIT_2G at the cost of wasting
>   12.5% of RAM).
> - any workload that requires the full 3GB of virtual address space won't
>   work at all. This might be e.g. MAP_FIXED users, or build servers
>   linking large binaries.
> It will take a while to find out what kinds of workloads suffer the most
> from a different vmsplit and what can be done to address that, but we
> could start by changing the kernel defconfig and distro builds to see
> who complains ;-)
>
> I think 32-bit ARM machines with 3GB or more are getting very rare,
> but some still exist:
> - The Armada XP development board had a DIMM slot that could take
>   large memory (possibly up to 8GB with LPAE). This never shipped as
>   a commercial product, but distro build servers sometimes still run on
>   this, or on the old Calxeda or Keystone server systems.
> - a few early i.MX6 boards  (e.g. HummingBoard) came had 4GB of
>   RAM, though none of these seem to be available any more.
> - High-end phones from 2013/2014 had 3GB LPDDR3 before getting
>   obsoleted by 64-bit phones. Presumably none of these ever ran
>   Linux-4.x or newer.
> - My main laptop is a RK3288 based Chromebook with 4GB that just
>   got updated to linux-4.19 by Google. Official updates apparently
>   stop this summer, but it could easily run Debian later on.
> - Some people run 32-bit kernels on a 64-bit Raspberry Pi 4 or on
>   arm64 KVM with lots of RAM. These should probably all
>   migrate to 64-bit kernels with compat user space anyway.
> In theory these could also run on a VMSPLIT_4G_4G-like setup,
> but I don't think anyone wants to go there. Deprecating highmem
> definitely impacts any such users significantly, though staying on
> an LTS kernel may be an option if there are only few of them.

The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
for 1 GiB or 2 GiB configurations) in two parts, one below and one above
the 32-bit physical limit.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-15 11:25                 ` Geert Uytterhoeven
@ 2020-02-15 16:59                   ` Arnd Bergmann
  2020-02-16  9:44                     ` Geert Uytterhoeven
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2020-02-15 16:59 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Russell King - ARM Linux admin, Linus Torvalds, Michal Hocko,
	Rik van Riel, Catalin Marinas, kernel-team, Dave Chinner,
	Linux Kernel Mailing List, Linux-MM, Yafang Shao, Al Viro,
	Johannes Weiner, linux-fsdevel, Andrew Morton, Roman Gushchin,
	Linux ARM, Kishon Vijay Abraham I, Santosh Shilimkar,
	Chris Paterson, cip-dev

On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
>
> The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> the 32-bit physical limit.

Good to know. I think there are several other chips that have dual-channel
DDR3 and thus /can/ support this configuration, but this rarely happens.
Are you aware of commercial products that use a 4GB configuration, aside from
the reference board?

For TI AM54x, there is apparently a variant of the Dragonbox Pyro with 4G,
which is said to be shipping in the near future, see
https://en.wikipedia.org/wiki/DragonBox_Pyra

     Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-15 16:59                   ` Arnd Bergmann
@ 2020-02-16  9:44                     ` Geert Uytterhoeven
  2020-02-16 19:54                       ` Chris Paterson
  0 siblings, 1 reply; 35+ messages in thread
From: Geert Uytterhoeven @ 2020-02-16  9:44 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King - ARM Linux admin, Linus Torvalds, Michal Hocko,
	Rik van Riel, Catalin Marinas, kernel-team, Dave Chinner,
	Linux Kernel Mailing List, Linux-MM, Yafang Shao, Al Viro,
	Johannes Weiner, linux-fsdevel, Andrew Morton, Roman Gushchin,
	Linux ARM, Kishon Vijay Abraham I, Santosh Shilimkar,
	Chris Paterson, cip-dev

Hi Arnd,

On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> <geert@linux-m68k.org> wrote:
> > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk> wrote:
> >
> > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> > for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> > the 32-bit physical limit.
>
> Good to know. I think there are several other chips that have dual-channel
> DDR3 and thus /can/ support this configuration, but this rarely happens.
> Are you aware of commercial products that use a 4GB configuration, aside from
> the reference board?

Unfortunately I don't know.
Chris Paterson might know.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-16  9:44                     ` Geert Uytterhoeven
@ 2020-02-16 19:54                       ` Chris Paterson
  2020-02-16 20:38                         ` Arnd Bergmann
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Paterson @ 2020-02-16 19:54 UTC (permalink / raw)
  To: Geert Uytterhoeven, Arnd Bergmann
  Cc: Russell King - ARM Linux admin, Linus Torvalds, Michal Hocko,
	Rik van Riel, Catalin Marinas, kernel-team, Dave Chinner,
	Linux Kernel Mailing List, Linux-MM, Yafang Shao, Al Viro,
	Johannes Weiner, linux-fsdevel, Andrew Morton, Roman Gushchin,
	Linux ARM, Kishon Vijay Abraham I, Santosh Shilimkar, cip-dev

Hello Arnd, Geert,

> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Sent: 16 February 2020 09:45
> To: Arnd Bergmann <arnd@arndb.de>
> 
> Hi Arnd,
> 
> On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> > <geert@linux-m68k.org> wrote:
> > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de>
> wrote:
> > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > <linux@armlinux.org.uk> wrote:
> > >
> > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> > > the 32-bit physical limit.

Yep. One example is r8a7743-iwg20m.dtsi.

> >
> > Good to know. I think there are several other chips that have dual-channel
> > DDR3 and thus /can/ support this configuration, but this rarely happens.
> > Are you aware of commercial products that use a 4GB configuration, aside
> from
> > the reference board?

iWave Systems make a range of SOM modules using the RZ/G1 SoCs.
I believe there are options for some of these to use 4 GB, although 1 or 2 GB is used in the boards we've upstreamed support for.

There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1, but I'm not sure of the details.

Kind regards, Chris

> 
> Unfortunately I don't know.
> Chris Paterson might know.
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-
> m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-16 19:54                       ` Chris Paterson
@ 2020-02-16 20:38                         ` Arnd Bergmann
  2020-02-20 14:35                           ` Chris Paterson
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2020-02-16 20:38 UTC (permalink / raw)
  To: Chris Paterson
  Cc: Geert Uytterhoeven, Russell King - ARM Linux admin,
	Linus Torvalds, Michal Hocko, Rik van Riel, Catalin Marinas,
	kernel-team, Dave Chinner, Linux Kernel Mailing List, Linux-MM,
	Yafang Shao, Al Viro, Johannes Weiner, linux-fsdevel,
	Andrew Morton, Roman Gushchin, Linux ARM, Kishon Vijay Abraham I,
	Santosh Shilimkar, cip-dev

On Sun, Feb 16, 2020 at 8:54 PM Chris Paterson
<Chris.Paterson2@renesas.com> wrote:
>
> Hello Arnd, Geert,
>
> > From: Geert Uytterhoeven <geert@linux-m68k.org>
> > Sent: 16 February 2020 09:45
> > To: Arnd Bergmann <arnd@arndb.de>
> >
> > Hi Arnd,
> >
> > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> > > <geert@linux-m68k.org> wrote:
> > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de>
> > wrote:
> > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> > > > the 32-bit physical limit.
>
> Yep. One example is r8a7743-iwg20m.dtsi.

This one has 2x512MB, with half above the 4GiB limit. This means it needs
LPAE to address high physical addresses (which is fine), but it does not need
highmem if one uses an appropriate CONFIG_VMSPLIT_* option.

> > > Good to know. I think there are several other chips that have dual-channel
> > > DDR3 and thus /can/ support this configuration, but this rarely happens.
> > > Are you aware of commercial products that use a 4GB configuration, aside
> > from
> > > the reference board?
>
> iWave Systems make a range of SOM modules using the RZ/G1 SoCs.
> I believe there are options for some of these to use 4 GB, although 1 or 2 GB is
> used in the boards we've upstreamed support for.
>
> There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1,
> but I'm not sure of the details.

Both iWave and Emtrion only seem to list boards with 2GB or less on their
websites today (with up to 15 year availability). My guess is that they had
the same problem as everyone else in finding the right memory chips in
the required quantities and/or long-term availability. iWave lists "By default
1GB DDR3 and 4GB eMMC only supported. Contact iWave for memory
expansion support." on some boards, but that doesn't mean they ever
shipped a 4GB configuration.

       Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-12  0:28       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2020-02-12  8:09         ` Michal Hocko
@ 2020-02-17 13:31         ` Pavel Machek
  3 siblings, 0 replies; 35+ messages in thread
From: Pavel Machek @ 2020-02-17 13:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, linux-fsdevel,
	Linux-MM, Linux Kernel Mailing List, Dave Chinner, Yafang Shao,
	Michal Hocko, Roman Gushchin, Al Viro, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

Hi!

> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.
> 
> It used to be something we kind of supported.
> 
> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.

7GB HIGHMEM machine may be unusual, but AFAICT HIGHMEM is need for
configurations like 1GB x86 machine, and definitely for 3GB x86
machine.

32-bit machines with 1.5 to 4GB of RAM are still pretty common (I have
two of those), and dropping HIGHMEM support will limit them to 0.8GB
RAM and probably make them unusable even for simple web browsing. I
have two such machines here, please don't break them :-).

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-16 20:38                         ` Arnd Bergmann
@ 2020-02-20 14:35                           ` Chris Paterson
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Paterson @ 2020-02-20 14:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Geert Uytterhoeven, Russell King - ARM Linux admin,
	Linus Torvalds, Michal Hocko, Rik van Riel, Catalin Marinas,
	kernel-team, Dave Chinner, Linux Kernel Mailing List, Linux-MM,
	Yafang Shao, Al Viro, Johannes Weiner, linux-fsdevel,
	Andrew Morton, Roman Gushchin, Linux ARM, Kishon Vijay Abraham I,
	Santosh Shilimkar, cip-dev

Hello,

> From: Arnd Bergmann <arnd@arndb.de>
> Sent: 16 February 2020 20:38
> 
> On Sun, Feb 16, 2020 at 8:54 PM Chris Paterson
> <Chris.Paterson2@renesas.com> wrote:
> >
> > Hello Arnd, Geert,
> >
> > > From: Geert Uytterhoeven <geert@linux-m68k.org>
> > > Sent: 16 February 2020 09:45
> > > To: Arnd Bergmann <arnd@arndb.de>
> > >
> > > Hi Arnd,
> > >
> > > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> > > > <geert@linux-m68k.org> wrote:
> > > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de>
> > > wrote:
> > > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > > > <linux@armlinux.org.uk> wrote:
> > > > >
> > > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split
> (even
> > > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one
> above
> > > > > the 32-bit physical limit.
> >
> > Yep. One example is r8a7743-iwg20m.dtsi.
> 
> This one has 2x512MB, with half above the 4GiB limit. This means it needs
> LPAE to address high physical addresses (which is fine), but it does not need
> highmem if one uses an appropriate CONFIG_VMSPLIT_* option.
> 
> > > > Good to know. I think there are several other chips that have dual-
> channel
> > > > DDR3 and thus /can/ support this configuration, but this rarely happens.
> > > > Are you aware of commercial products that use a 4GB configuration,
> aside
> > > from
> > > > the reference board?
> >
> > iWave Systems make a range of SOM modules using the RZ/G1 SoCs.
> > I believe there are options for some of these to use 4 GB, although 1 or 2
> GB is
> > used in the boards we've upstreamed support for.
> >
> > There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1,
> > but I'm not sure of the details.
> 
> Both iWave and Emtrion only seem to list boards with 2GB or less on their
> websites today (with up to 15 year availability). My guess is that they had
> the same problem as everyone else in finding the right memory chips in
> the required quantities and/or long-term availability. iWave lists "By default
> 1GB DDR3 and 4GB eMMC only supported. Contact iWave for memory
> expansion support." on some boards, but that doesn't mean they ever
> shipped a 4GB configuration.

I probably should have been clearer before - I actually have a couple of iWave RZ/G1M SOM modules with 4GB DDR on them.
However I've never booted them nor do I know what the memory mapping is.

Kind regards, Chris

> 
>        Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-13 16:52               ` Arnd Bergmann
  2020-02-15 11:25                 ` Geert Uytterhoeven
@ 2020-02-26 18:04                 ` santosh.shilimkar
  2020-02-26 21:01                   ` Arnd Bergmann
  1 sibling, 1 reply; 35+ messages in thread
From: santosh.shilimkar @ 2020-02-26 18:04 UTC (permalink / raw)
  To: Arnd Bergmann, Russell King - ARM Linux admin
  Cc: Linus Torvalds, Michal Hocko, Rik van Riel, Catalin Marinas,
	kernel-team, Dave Chinner, Linux Kernel Mailing List, Linux-MM,
	Yafang Shao, Al Viro, Johannes Weiner, linux-fsdevel,
	Andrew Morton, Roman Gushchin, Linux ARM, Kishon Vijay Abraham I,
	Santosh Shilimkar

On 2/13/20 8:52 AM, Arnd Bergmann wrote:
> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
>>
>> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
>>> On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>
>>>> What's the situation with highmem on ARM?
>>>
>>> Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
>>> needed it, and I was ranting at some people for repeating all the
>>> mistakes Intel did.
>>>
>>> But arm64 doesn't need it, and while 32-bit arm is obviosuly still
>>> selling, I think that in many ways the switch-over to 64-bit has been
>>> quicker on ARM than it was on x86. Partly because it happened later
>>> (so all the 64-bit teething pains were dealt with), but largely
>>> because everybody ended up actively discouraging 32-bit on the Android
>>> side.
>>>
>>> There were a couple of unfortunate early 32-bit arm server attempts,
>>> but they were - predictably - complete garbage and nobody bought them.
>>> They don't exist any more.
> 
> I'd generally agree with that, the systems with more than 4GB tended to
> be high-end systems predating the Cortex-A53/A57 that quickly got
> replaced once there were actual 64-bit parts, this would include axm5516
> (replaced with x86-64 cores after sale to Intel), hip04 (replaced
> with arm64), or ecx-2000 (Calxeda bankruptcy).
> 
> The one 32-bit SoC that I can think of that can actually drive lots of
> RAM and is still actively marketed is TI Keystone-2/AM5K2.
> The embedded AM5K2 is listed supporting up to 8GB of RAM, but
> the verison in the HPE ProLiant m800 server could take up to 32GB (!).
> 
> I added Santosh and Kishon to Cc, they can probably comment on how
> long they think users will upgrade kernels on these. I suspect these
> devices can live for a very long time in things like wireless base stations,
> but it's possible that they all run on old kernels anyway by now (and are
> not worried about y2038).
> 
>>> So at least my gut feel is that the arm people don't have any big
>>> reason to push for maintaining HIGHMEM support either.
>>>
>>> But I'm adding a couple of arm people and the arm list just in case
>>> they have some input.
The Keystone generations of SOCs have been used in different areas and
they will be used for long unless says otherwise.

Apart from just split of lowmem and highmem, one of the peculiar thing
with Keystome family of SOCs is the DDR is addressable from two
addressing ranges. The lowmem address range is actually non-cached
range and the higher range is the cacheable.

So apart from LPAE, another change I needed to do back then is to boot
the linux from lowmem with bootstrap MMU tables and then re-create
MMU tables early (mmu init) and use higher range for entire memory so
that L3 cache can be used.

AFAIK, all the derived SOCs from Keystone architecture inherently
use this feature.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-26 18:04                 ` santosh.shilimkar
@ 2020-02-26 21:01                   ` Arnd Bergmann
  2020-02-26 21:11                     ` santosh.shilimkar
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2020-02-26 21:01 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: Russell King - ARM Linux admin, Linus Torvalds, Michal Hocko,
	Rik van Riel, Catalin Marinas, kernel-team, Dave Chinner,
	Linux Kernel Mailing List, Linux-MM, Yafang Shao, Al Viro,
	Johannes Weiner, linux-fsdevel, Andrew Morton, Roman Gushchin,
	Linux ARM, Kishon Vijay Abraham I, Santosh Shilimkar

On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote:
>
> On 2/13/20 8:52 AM, Arnd Bergmann wrote:
> > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
>
> The Keystone generations of SOCs have been used in different areas and
> they will be used for long unless says otherwise.
>
> Apart from just split of lowmem and highmem, one of the peculiar thing
> with Keystome family of SOCs is the DDR is addressable from two
> addressing ranges. The lowmem address range is actually non-cached
> range and the higher range is the cacheable.

I'm aware of Keystone's special physical memory layout, but for the
discussion here, this is actually irrelevant for the discussion about
highmem here, which is only about the way we map all or part of the
available physical memory into the 4GB of virtual address space.

The far more important question is how much memory any users
(in particular the subset that are going to update their kernels
several years from now) actually have installed. Keystone-II is
one of the rare 32-bit chips with fairly wide memory interfaces,
having two 72-bit (with ECC) channels rather than the usual one
 or two channels of 32-bit DDR3. This means a relatively cheap
4GB configuration using eight 256Mx16 chips is possible, or
even a 8GB using sixteen or eighteen 512Mx8.

Do you have an estimate on how common these 4GB and 8GB
configurations are in practice outside of the TI evaluation
board?

       Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
  2020-02-26 21:01                   ` Arnd Bergmann
@ 2020-02-26 21:11                     ` santosh.shilimkar
  0 siblings, 0 replies; 35+ messages in thread
From: santosh.shilimkar @ 2020-02-26 21:11 UTC (permalink / raw)
  To: Arnd Bergmann, Nishanth Menon, Tero Kristo
  Cc: Russell King - ARM Linux admin, Linus Torvalds, Michal Hocko,
	Rik van Riel, Catalin Marinas, kernel-team, Dave Chinner,
	Linux Kernel Mailing List, Linux-MM, Yafang Shao, Al Viro,
	Johannes Weiner, linux-fsdevel, Andrew Morton, Roman Gushchin,
	Linux ARM, Kishon Vijay Abraham I, Santosh Shilimkar

+Nishant, Tero

On 2/26/20 1:01 PM, Arnd Bergmann wrote:
> On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote:
>>
>> On 2/13/20 8:52 AM, Arnd Bergmann wrote:
>>> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
>>> <linux@armlinux.org.uk> wrote:
>>
>> The Keystone generations of SOCs have been used in different areas and
>> they will be used for long unless says otherwise.
>>
>> Apart from just split of lowmem and highmem, one of the peculiar thing
>> with Keystome family of SOCs is the DDR is addressable from two
>> addressing ranges. The lowmem address range is actually non-cached
>> range and the higher range is the cacheable.
> 
> I'm aware of Keystone's special physical memory layout, but for the
> discussion here, this is actually irrelevant for the discussion about
> highmem here, which is only about the way we map all or part of the
> available physical memory into the 4GB of virtual address space.
> 
> The far more important question is how much memory any users
> (in particular the subset that are going to update their kernels
> several years from now) actually have installed. Keystone-II is
> one of the rare 32-bit chips with fairly wide memory interfaces,
> having two 72-bit (with ECC) channels rather than the usual one
>   or two channels of 32-bit DDR3. This means a relatively cheap
> 4GB configuration using eight 256Mx16 chips is possible, or
> even a 8GB using sixteen or eighteen 512Mx8.
> 
> Do you have an estimate on how common these 4GB and 8GB
> configurations are in practice outside of the TI evaluation
> board?
> 
 From my TI memories, many K2 customers were going to install
more than 2G memory. Don't remember 8G, but 4G was the dominant
one afair. Will let Nishant/Tero elaborate latest on this.

regards,
Santosh

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, back to index

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-11 17:55 [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
2020-02-11 18:20 ` Johannes Weiner
2020-02-11 19:05 ` Rik van Riel
2020-02-11 19:31   ` Johannes Weiner
2020-02-11 23:44     ` Andrew Morton
2020-02-12  0:28       ` Linus Torvalds
2020-02-12  0:47         ` Andrew Morton
2020-02-12  1:03           ` Linus Torvalds
2020-02-12  8:50             ` Russell King - ARM Linux admin
2020-02-13  9:50               ` Lucas Stach
2020-02-13 16:52               ` Arnd Bergmann
2020-02-15 11:25                 ` Geert Uytterhoeven
2020-02-15 16:59                   ` Arnd Bergmann
2020-02-16  9:44                     ` Geert Uytterhoeven
2020-02-16 19:54                       ` Chris Paterson
2020-02-16 20:38                         ` Arnd Bergmann
2020-02-20 14:35                           ` Chris Paterson
2020-02-26 18:04                 ` santosh.shilimkar
2020-02-26 21:01                   ` Arnd Bergmann
2020-02-26 21:11                     ` santosh.shilimkar
2020-02-12  3:58         ` Matthew Wilcox
2020-02-12  8:09         ` Michal Hocko
2020-02-17 13:31         ` Pavel Machek
2020-02-12 16:35       ` Johannes Weiner
2020-02-12 18:26         ` Andrew Morton
2020-02-12 18:52           ` Johannes Weiner
2020-02-12 12:25 ` Yafang Shao
2020-02-12 16:42   ` Johannes Weiner
2020-02-13  1:47     ` Yafang Shao
2020-02-13 13:46       ` Johannes Weiner
2020-02-14  2:02         ` Yafang Shao
2020-02-13 18:34 ` [PATCH v2] " Johannes Weiner
2020-02-14 16:53 ` [PATCH] " kbuild test robot
2020-02-14 21:30 ` kbuild test robot
2020-02-14 21:30 ` [PATCH] vfs: fix boolreturn.cocci warnings kbuild test robot

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git