From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 350AEC2B9F4 for ; Mon, 14 Jun 2021 21:59:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0EDF561166 for ; Mon, 14 Jun 2021 21:59:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230435AbhFNWBQ (ORCPT ); Mon, 14 Jun 2021 18:01:16 -0400 Received: from mail.kernel.org ([198.145.29.99]:38130 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229728AbhFNWBQ (ORCPT ); Mon, 14 Jun 2021 18:01:16 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 812B0611AB; Mon, 14 Jun 2021 21:59:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1623707952; bh=pWkuBglv4mJnySzxhM7bwezoUP6TooTYCM8jj9C/mMc=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=YwSajIk1y4sYoZY1UBFqXv0/yZ/AiCakTI2i2mjtLiUBMYJ3Nx370PZ08rJnPm7iL rnjeiU+qqZujCFrnI8IKr47zRkSzzon0gjd3rA8tzgplkkxe+ox7gjb4HPbKJPjzD4 7p/WfcR9LrREx8ocKOPJeaO+YA9ngbBjiKFSCgYs= Date: Mon, 14 Jun 2021 14:59:12 -0700 From: Andrew Morton To: Johannes Weiner Cc: Roman Gushchin , Tejun Heo , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, Dave Chinner Subject: Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU Message-Id: <20210614145912.feb751df928f38476048ec15@linux-foundation.org> In-Reply-To: <20210614211904.14420-4-hannes@cmpxchg.org> References: <20210614211904.14420-1-hannes@cmpxchg.org> <20210614211904.14420-4-hannes@cmpxchg.org> X-Mailer: Sylpheed 3.5.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner wrote: > Historically (pre-2.5), the inode shrinker used to reclaim only empty > inodes and skip over those that still contained page cache. This > caused problems on highmem hosts: struct inode could put fill lowmem > zones before the cache was getting reclaimed in the highmem zones. > > To address this, the inode shrinker started to strip page cache to > facilitate reclaiming lowmem. However, this comes with its own set of > problems: the shrinkers may drop actively used page cache just because > the inodes are not currently open or dirty - think working with a > large git tree. It further doesn't respect cgroup memory protection > settings and can cause priority inversions between containers. > > Nowadays, the page cache also holds non-resident info for evicted > cache pages in order to detect refaults. We've come to rely heavily on > this data inside reclaim for protecting the cache workingset and > driving swap behavior. We also use it to quantify and report workload > health through psi. The latter in turn is used for fleet health > monitoring, as well as driving automated memory sizing of workloads > and containers, proactive reclaim and memory offloading schemes. > > The consequences of dropping page cache prematurely is that we're > seeing subtle and not-so-subtle failures in all of the above-mentioned > scenarios, with the workload generally entering unexpected thrashing > states while losing the ability to reliably detect it. > > To fix this on non-highmem systems at least, going back to rotating > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > ("mm: don't reclaim inodes with many attached pages")) and failed > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > attached pages"")). The issue is mostly that shrinker pools attract > pressure based on their size, and when objects get skipped the > shrinkers remember this as deferred reclaim work. This accumulates > excessive pressure on the remaining inodes, and we can quickly eat > into heavily used ones, or dirty ones that require IO to reclaim, when > there potentially is plenty of cold, clean cache around still. > > Instead, this patch keeps populated inodes off the inode LRU in the > first place - just like an open file or dirty state would. An > otherwise clean and unused inode then gets queued when the last cache > entry disappears. This solves the problem without reintroducing the > reclaim issues, and generally is a bit more scalable than having to > wade through potentially hundreds of thousands of busy inodes. > > Locking is a bit tricky because the locks protecting the inode state > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > serialized through i_lock, taken before the i_pages lock, to make sure > depopulated inodes are queued reliably. Additions may race with > deletions, but we'll check again in the shrinker. If additions race > with the shrinker itself, we're protected by the i_lock: if > find_inode() or iput() win, the shrinker will bail on the elevated > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > inode, it will set I_FREEING and inhibit further igets(), which will > cause the other side to create a new instance of the inode instead. > And what hitherto unexpected problems will this one cause, sigh. How exhaustively has this approach been tested?