From: Johannes Weiner <firstname.lastname@example.org> To: Yafang Shao <email@example.com> Cc: firstname.lastname@example.org, Linux MM <email@example.com>, LKML <firstname.lastname@example.org>, Dave Chinner <email@example.com>, Michal Hocko <firstname.lastname@example.org>, Roman Gushchin <email@example.com>, Andrew Morton <firstname.lastname@example.org>, Linus Torvalds <email@example.com>, Al Viro <firstname.lastname@example.org>, Kernel Team <email@example.com> Subject: Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Date: Wed, 13 May 2020 09:00:02 -0400 Message-ID: <20200513130002.GC488426@cmpxchg.org> (raw) In-Reply-To: <CALOAHbAZ0eUmrBGt=J0cJZzPmDtPKpfMK0jrUNa0Z_-JfDLoXA@mail.gmail.com> On Wed, May 13, 2020 at 09:32:58AM +0800, Yafang Shao wrote: > On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <firstname.lastname@example.org> wrote: > > > > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > > > The VFS inode shrinker is currently allowed to reclaim inodes with > > > populated page cache. As a result it can drop gigabytes of hot and > > > active page cache on the floor without consulting the VM (recorded as > > > "inodesteal" events in /proc/vmstat). > > > > I'm sending a rebased version of this patch. > > > > We've been running with this change in the Facebook fleet since > > February with no ill side effects observed. > > > > However, I just spent several hours chasing a mysterious reclaim > > problem that turned out to be this bug again on an unpatched system. > > > > In the scenario I was debugging, the problem wasn't that we were > > losing cache, but that we were losing the non-resident information for > > previously evicted cache. > > > > I understood the file set enough to know it was thrashing like crazy, > > but it didn't register as refaults to the kernel. Without detecting > > the refaults, reclaim wouldn't start swapping to relieve the > > struggling cache (plenty of cold anon memory around). It also meant > > the IO delays of those refaults didn't contribute to memory pressure > > in psi, which made userspace blind to the situation as well. > > > > The first aspect means we can get stuck in pathological thrashing, the > > second means userspace OOM detection breaks and we can leave servers > > (or Android devices, for that matter) hopelessly livelocked. > > > > New patch attached below. I hope we can get this fixed in 5.8, it's > > really quite a big hole in our cache management strategy. > > > > --- > > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001 > > From: Johannes Weiner <email@example.com> > > Date: Tue, 11 Feb 2020 12:55:07 -0500 > > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU > > > > The VFS inode shrinker is currently allowed to reclaim cold inodes > > with populated page cache. This behavior goes back to CONFIG_HIGHMEM > > setups, which required the ability to drop page cache in large highem > > zones to free up struct inodes in comparatively tiny lowmem zones. > > > > However, it has significant side effects that are hard to justify on > > systems without highmem: > > > > - It can drop gigabytes of hot and active page cache on the floor > > without consulting the VM (recorded as "inodesteal" events in > > /proc/vmstat). Such an "aging inversion" between unreferenced inodes > > holding hot cache easily happens in practice: for example, a git tree > > whose objects are accessed frequently but no open file descriptors are > > maintained throughout. > > > > Hi Johannes, > > I think it is reasonable to keep inodes with _active_ page cache off > the inode shrinker LRU, but I'm not sure whether it is proper to keep > the inodes with _only_ inactive page cache off the inode list lru > neither. Per my understanding, if the inode has only inactive page > cache, then invalidate all these inactive page cache could save the > reclaimer's time, IOW, it may improve the performance in this case. The shrinker doesn't know whether pages are active or inactive. There is a PageActive() flag, but that's a sampled state that's only uptodate when page reclaim is running. All the active pages could be stale and getting deactivated on the next scan; all the inactive pages could have page table references that would get them activated on the next reclaim run etc. You'd have to duplicate aspects of page reclaim itself to be sure you're axing the right pages. It also wouldn't be a reliable optimization. This only happens when there is a disconnect between the inode and the cache life time, which is true for some situations but not others.
next prev parent reply index Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-02-11 17:55 Johannes Weiner 2020-02-11 18:20 ` Johannes Weiner 2020-02-11 19:05 ` Rik van Riel 2020-02-11 19:31 ` Johannes Weiner 2020-02-11 23:44 ` Andrew Morton 2020-02-12 0:28 ` Linus Torvalds 2020-02-12 0:47 ` Andrew Morton 2020-02-12 1:03 ` Linus Torvalds 2020-02-12 8:50 ` Russell King - ARM Linux admin 2020-02-13 9:50 ` Lucas Stach 2020-02-13 16:52 ` Arnd Bergmann 2020-02-15 11:25 ` Geert Uytterhoeven 2020-02-15 16:59 ` Arnd Bergmann 2020-02-16 9:44 ` Geert Uytterhoeven 2020-02-16 19:54 ` Chris Paterson 2020-02-16 20:38 ` Arnd Bergmann 2020-02-20 14:35 ` Chris Paterson 2020-02-26 18:04 ` santosh.shilimkar 2020-02-26 21:01 ` Arnd Bergmann 2020-02-26 21:11 ` santosh.shilimkar 2020-03-06 20:34 ` Nishanth Menon 2020-03-07 1:08 ` santosh.shilimkar 2020-03-08 10:58 ` Arnd Bergmann 2020-03-08 14:19 ` Russell King - ARM Linux admin 2020-03-09 13:33 ` Arnd Bergmann 2020-03-09 14:04 ` Russell King - ARM Linux admin 2020-03-09 15:04 ` Arnd Bergmann 2020-03-10 9:16 ` Michal Hocko 2020-03-09 15:59 ` Catalin Marinas 2020-03-09 16:09 ` Russell King - ARM Linux admin 2020-03-09 16:57 ` Catalin Marinas 2020-03-09 19:46 ` Arnd Bergmann 2020-03-11 14:29 ` Catalin Marinas 2020-03-11 16:59 ` Arnd Bergmann 2020-03-11 17:26 ` Catalin Marinas 2020-03-11 22:21 ` Arnd Bergmann 2020-02-12 3:58 ` Matthew Wilcox 2020-02-12 8:09 ` Michal Hocko 2020-02-17 13:31 ` Pavel Machek 2020-02-12 16:35 ` Johannes Weiner 2020-02-12 18:26 ` Andrew Morton 2020-02-12 18:52 ` Johannes Weiner 2020-02-12 12:25 ` Yafang Shao 2020-02-12 16:42 ` Johannes Weiner 2020-02-13 1:47 ` Yafang Shao 2020-02-13 13:46 ` Johannes Weiner 2020-02-14 2:02 ` Yafang Shao 2020-02-13 18:34 ` [PATCH v2] " Johannes Weiner 2020-02-14 16:53 ` [PATCH] " kbuild test robot 2020-02-14 21:30 ` kbuild test robot 2020-02-14 21:30 ` [PATCH] vfs: fix boolreturn.cocci warnings kbuild test robot 2020-05-12 21:29 ` [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Johannes Weiner 2020-05-13 1:32 ` Yafang Shao 2020-05-13 13:00 ` Johannes Weiner [this message] 2020-05-13 21:15 ` Andrew Morton 2020-05-14 11:27 ` Johannes Weiner 2020-05-14 2:24 ` Andrew Morton 2020-05-14 10:37 ` Johannes Weiner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200513130002.GC488426@cmpxchg.org \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-Fsdevel Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \ firstname.lastname@example.org public-inbox-index linux-fsdevel Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git