Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: linux-fsdevel@vger.kernel.org, Linux MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
Date: Wed, 13 May 2020 09:00:02 -0400
Message-ID: <20200513130002.GC488426@cmpxchg.org> (raw)
In-Reply-To: <CALOAHbAZ0eUmrBGt=J0cJZzPmDtPKpfMK0jrUNa0Z_-JfDLoXA@mail.gmail.com>

On Wed, May 13, 2020 at 09:32:58AM +0800, Yafang Shao wrote:
> On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote:
> > > The VFS inode shrinker is currently allowed to reclaim inodes with
> > > populated page cache. As a result it can drop gigabytes of hot and
> > > active page cache on the floor without consulting the VM (recorded as
> > > "inodesteal" events in /proc/vmstat).
> >
> > I'm sending a rebased version of this patch.
> >
> > We've been running with this change in the Facebook fleet since
> > February with no ill side effects observed.
> >
> > However, I just spent several hours chasing a mysterious reclaim
> > problem that turned out to be this bug again on an unpatched system.
> >
> > In the scenario I was debugging, the problem wasn't that we were
> > losing cache, but that we were losing the non-resident information for
> > previously evicted cache.
> >
> > I understood the file set enough to know it was thrashing like crazy,
> > but it didn't register as refaults to the kernel. Without detecting
> > the refaults, reclaim wouldn't start swapping to relieve the
> > struggling cache (plenty of cold anon memory around). It also meant
> > the IO delays of those refaults didn't contribute to memory pressure
> > in psi, which made userspace blind to the situation as well.
> >
> > The first aspect means we can get stuck in pathological thrashing, the
> > second means userspace OOM detection breaks and we can leave servers
> > (or Android devices, for that matter) hopelessly livelocked.
> >
> > New patch attached below. I hope we can get this fixed in 5.8, it's
> > really quite a big hole in our cache management strategy.
> >
> > ---
> > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Tue, 11 Feb 2020 12:55:07 -0500
> > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
> >
> > The VFS inode shrinker is currently allowed to reclaim cold inodes
> > with populated page cache. This behavior goes back to CONFIG_HIGHMEM
> > setups, which required the ability to drop page cache in large highem
> > zones to free up struct inodes in comparatively tiny lowmem zones.
> >
> > However, it has significant side effects that are hard to justify on
> > systems without highmem:
> >
> > - It can drop gigabytes of hot and active page cache on the floor
> > without consulting the VM (recorded as "inodesteal" events in
> > /proc/vmstat). Such an "aging inversion" between unreferenced inodes
> > holding hot cache easily happens in practice: for example, a git tree
> > whose objects are accessed frequently but no open file descriptors are
> > maintained throughout.
> >
> 
> Hi Johannes,
> 
> I think it is reasonable to keep inodes with _active_ page cache off
> the inode shrinker LRU, but I'm not sure whether it is proper to keep
> the inodes with _only_ inactive page cache off the inode list lru
> neither. Per my understanding, if the inode has only inactive page
> cache, then invalidate all these inactive page cache could save the
> reclaimer's time, IOW, it may improve the performance in this case.

The shrinker doesn't know whether pages are active or inactive.

There is a PageActive() flag, but that's a sampled state that's only
uptodate when page reclaim is running. All the active pages could be
stale and getting deactivated on the next scan; all the inactive pages
could have page table references that would get them activated on the
next reclaim run etc.

You'd have to duplicate aspects of page reclaim itself to be sure
you're axing the right pages.

It also wouldn't be a reliable optimization. This only happens when
there is a disconnect between the inode and the cache life time, which
is true for some situations but not others.

  reply index

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-11 17:55 Johannes Weiner
2020-02-11 18:20 ` Johannes Weiner
2020-02-11 19:05 ` Rik van Riel
2020-02-11 19:31   ` Johannes Weiner
2020-02-11 23:44     ` Andrew Morton
2020-02-12  0:28       ` Linus Torvalds
2020-02-12  0:47         ` Andrew Morton
2020-02-12  1:03           ` Linus Torvalds
2020-02-12  8:50             ` Russell King - ARM Linux admin
2020-02-13  9:50               ` Lucas Stach
2020-02-13 16:52               ` Arnd Bergmann
2020-02-15 11:25                 ` Geert Uytterhoeven
2020-02-15 16:59                   ` Arnd Bergmann
2020-02-16  9:44                     ` Geert Uytterhoeven
2020-02-16 19:54                       ` Chris Paterson
2020-02-16 20:38                         ` Arnd Bergmann
2020-02-20 14:35                           ` Chris Paterson
2020-02-26 18:04                 ` santosh.shilimkar
2020-02-26 21:01                   ` Arnd Bergmann
2020-02-26 21:11                     ` santosh.shilimkar
2020-03-06 20:34                       ` Nishanth Menon
2020-03-07  1:08                         ` santosh.shilimkar
2020-03-08 10:58                         ` Arnd Bergmann
2020-03-08 14:19                           ` Russell King - ARM Linux admin
2020-03-09 13:33                             ` Arnd Bergmann
2020-03-09 14:04                               ` Russell King - ARM Linux admin
2020-03-09 15:04                                 ` Arnd Bergmann
2020-03-10  9:16                                   ` Michal Hocko
2020-03-09 15:59                           ` Catalin Marinas
2020-03-09 16:09                             ` Russell King - ARM Linux admin
2020-03-09 16:57                               ` Catalin Marinas
2020-03-09 19:46                               ` Arnd Bergmann
2020-03-11 14:29                                 ` Catalin Marinas
2020-03-11 16:59                                   ` Arnd Bergmann
2020-03-11 17:26                                     ` Catalin Marinas
2020-03-11 22:21                                       ` Arnd Bergmann
2020-02-12  3:58         ` Matthew Wilcox
2020-02-12  8:09         ` Michal Hocko
2020-02-17 13:31         ` Pavel Machek
2020-02-12 16:35       ` Johannes Weiner
2020-02-12 18:26         ` Andrew Morton
2020-02-12 18:52           ` Johannes Weiner
2020-02-12 12:25 ` Yafang Shao
2020-02-12 16:42   ` Johannes Weiner
2020-02-13  1:47     ` Yafang Shao
2020-02-13 13:46       ` Johannes Weiner
2020-02-14  2:02         ` Yafang Shao
2020-02-13 18:34 ` [PATCH v2] " Johannes Weiner
2020-02-14 16:53 ` [PATCH] " kbuild test robot
2020-02-14 21:30 ` kbuild test robot
2020-02-14 21:30 ` [PATCH] vfs: fix boolreturn.cocci warnings kbuild test robot
2020-05-12 21:29 ` [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner
2020-05-13  1:32   ` Yafang Shao
2020-05-13 13:00     ` Johannes Weiner [this message]
2020-05-13 21:15   ` Andrew Morton
2020-05-14 11:27     ` Johannes Weiner
2020-05-14  2:24   ` Andrew Morton
2020-05-14 10:37     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200513130002.GC488426@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=kernel-team@fb.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git