linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU
Date: Mon, 14 Jun 2021 14:59:12 -0700	[thread overview]
Message-ID: <20210614145912.feb751df928f38476048ec15@linux-foundation.org> (raw)
In-Reply-To: <20210614211904.14420-4-hannes@cmpxchg.org>

On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Historically (pre-2.5), the inode shrinker used to reclaim only empty
> inodes and skip over those that still contained page cache. This
> caused problems on highmem hosts: struct inode could put fill lowmem
> zones before the cache was getting reclaimed in the highmem zones.
> 
> To address this, the inode shrinker started to strip page cache to
> facilitate reclaiming lowmem. However, this comes with its own set of
> problems: the shrinkers may drop actively used page cache just because
> the inodes are not currently open or dirty - think working with a
> large git tree. It further doesn't respect cgroup memory protection
> settings and can cause priority inversions between containers.
> 
> Nowadays, the page cache also holds non-resident info for evicted
> cache pages in order to detect refaults. We've come to rely heavily on
> this data inside reclaim for protecting the cache workingset and
> driving swap behavior. We also use it to quantify and report workload
> health through psi. The latter in turn is used for fleet health
> monitoring, as well as driving automated memory sizing of workloads
> and containers, proactive reclaim and memory offloading schemes.
> 
> The consequences of dropping page cache prematurely is that we're
> seeing subtle and not-so-subtle failures in all of the above-mentioned
> scenarios, with the workload generally entering unexpected thrashing
> states while losing the ability to reliably detect it.
> 
> To fix this on non-highmem systems at least, going back to rotating
> inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
> ("mm: don't reclaim inodes with many attached pages")) and failed
> (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
> attached pages"")). The issue is mostly that shrinker pools attract
> pressure based on their size, and when objects get skipped the
> shrinkers remember this as deferred reclaim work. This accumulates
> excessive pressure on the remaining inodes, and we can quickly eat
> into heavily used ones, or dirty ones that require IO to reclaim, when
> there potentially is plenty of cold, clean cache around still.
> 
> Instead, this patch keeps populated inodes off the inode LRU in the
> first place - just like an open file or dirty state would. An
> otherwise clean and unused inode then gets queued when the last cache
> entry disappears. This solves the problem without reintroducing the
> reclaim issues, and generally is a bit more scalable than having to
> wade through potentially hundreds of thousands of busy inodes.
> 
> Locking is a bit tricky because the locks protecting the inode state
> (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
> irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
> serialized through i_lock, taken before the i_pages lock, to make sure
> depopulated inodes are queued reliably. Additions may race with
> deletions, but we'll check again in the shrinker. If additions race
> with the shrinker itself, we're protected by the i_lock: if
> find_inode() or iput() win, the shrinker will bail on the elevated
> i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
> inode, it will set I_FREEING and inhibit further igets(), which will
> cause the other side to create a new instance of the inode instead.
> 

And what hitherto unexpected problems will this one cause, sigh.

How exhaustively has this approach been tested?

  reply	other threads:[~2021-06-14 21:59 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-14 21:19 [PATCH 1/4] mm: remove irqsave/restore locking from contexts with irqs enabled Johannes Weiner
2021-06-14 21:19 ` [PATCH 2/4] fs: drop_caches: fix skipping over shadow cache inodes Johannes Weiner
2021-06-14 22:31   ` Roman Gushchin
2021-06-14 21:19 ` [PATCH 3/4] fs: inode: count invalidated shadow pages in pginodesteal Johannes Weiner
2021-06-14 21:19 ` [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
2021-06-14 21:59   ` Andrew Morton [this message]
2021-06-14 22:41     ` Johannes Weiner
2021-06-15  6:26   ` Dave Chinner
2021-06-15 18:50     ` Johannes Weiner
2021-06-16  1:20       ` Dave Chinner
2021-06-16  4:54         ` Johannes Weiner
2021-06-17  0:49           ` Dave Chinner
2021-06-18 16:45             ` Johannes Weiner
2021-06-28 18:58               ` Johannes Weiner
2021-06-17  1:30           ` Andrew Morton
2021-06-18 17:09             ` Johannes Weiner
2021-06-28 17:12         ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210614145912.feb751df928f38476048ec15@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).