From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.7 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63FE6CA90AF for ; Wed, 13 May 2020 13:00:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 38F4624932 for ; Wed, 13 May 2020 13:00:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="ueBZ7Xbb" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729437AbgEMNAY (ORCPT ); Wed, 13 May 2020 09:00:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50806 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727033AbgEMNAX (ORCPT ); Wed, 13 May 2020 09:00:23 -0400 Received: from mail-qk1-x744.google.com (mail-qk1-x744.google.com [IPv6:2607:f8b0:4864:20::744]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4CEC9C061A0C for ; Wed, 13 May 2020 06:00:23 -0700 (PDT) Received: by mail-qk1-x744.google.com with SMTP id y22so4522652qki.3 for ; Wed, 13 May 2020 06:00:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=+cW70ChImxuqV2RBsl6w7NV4YFDYwLH2CWpZPYn8bwc=; b=ueBZ7XbbZpEcGa6ncc1N99qMbGVjWNDNqkGunZjHci2h023ddhN8dJF1xJmuaONvN+ 9TslYt/mBrZfFRMUkZb5agEyvyh5UhClFCX8DxTW9eEizKKEOfRycRQMTr4rMIjDot/o 2S1/8fFrfFxOfEbPhC5ryqLCeT+ahFmll6lbagvvUuCl/dCAQvux7Imp3edG0zObITXY FQdfjaibsbo0zy2Y98ouPjVoqSnHj+N+OSzDvhSoWupA8W4KCX+NwYzCuc9mnmYL8jEO OEnsjfGAuj6B/HOEA/3Mo835BKZDDu9hEyQNf++9enhJKbUQWDpFnZcS+rTSXyWFFyXA 5VQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=+cW70ChImxuqV2RBsl6w7NV4YFDYwLH2CWpZPYn8bwc=; b=SyHScNkUXO/R31m78wFeuouNkw3BrtKQpi8lZcO7KNpGcLY2KuMEq7/Q3wIoOF7oQ4 DOFYEYGMKXQNFIu20d3fy2rThRMwgP8decD5SxtGNVxfHCI8IlBjiyQyGqR6/gaYvVeO YSAV6kHQhnh1l45WHfJ5i6IcQLBaBY+wZdYgHDWeVVwIset0XCWnVp4kZrPywBV5c5UT U5gVlo5NtNmdXyBFtGUd4nSBqX85NN/FOSQbEJW6tLDvIEuiIFTXqdLt9b2bYbNEWTkf nUB83GR68mLWHInaWwNJbNNgdnmFGWeoheyanC9VA3JOn6JrbdwQWYgd4LbG7t3vHeW2 D3ew== X-Gm-Message-State: AGi0PubCzhW6acuS1oBgcR6CoQLMudy6C9X+70zZYc8YISLq8NVLyGAS 1P2Pt8n9DuPExN8OY7FAZv5INg== X-Google-Smtp-Source: APiQypKwvGZuJBxRqVUHNYuvoj8QpY7uTR3OAQ4b8juIRpGo91ZqMW8ow8DkU+0C6ZNMsRFoLm9hcg== X-Received: by 2002:a37:84b:: with SMTP id 72mr15538452qki.252.1589374821767; Wed, 13 May 2020 06:00:21 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:2627]) by smtp.gmail.com with ESMTPSA id t202sm13798497qke.97.2020.05.13.06.00.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 May 2020 06:00:20 -0700 (PDT) Date: Wed, 13 May 2020 09:00:02 -0400 From: Johannes Weiner To: Yafang Shao Cc: linux-fsdevel@vger.kernel.org, Linux MM , LKML , Dave Chinner , Michal Hocko , Roman Gushchin , Andrew Morton , Linus Torvalds , Al Viro , Kernel Team Subject: Re: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Message-ID: <20200513130002.GC488426@cmpxchg.org> References: <20200211175507.178100-1-hannes@cmpxchg.org> <20200512212936.GA450429@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 13, 2020 at 09:32:58AM +0800, Yafang Shao wrote: > On Wed, May 13, 2020 at 5:29 AM Johannes Weiner wrote: > > > > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > > > The VFS inode shrinker is currently allowed to reclaim inodes with > > > populated page cache. As a result it can drop gigabytes of hot and > > > active page cache on the floor without consulting the VM (recorded as > > > "inodesteal" events in /proc/vmstat). > > > > I'm sending a rebased version of this patch. > > > > We've been running with this change in the Facebook fleet since > > February with no ill side effects observed. > > > > However, I just spent several hours chasing a mysterious reclaim > > problem that turned out to be this bug again on an unpatched system. > > > > In the scenario I was debugging, the problem wasn't that we were > > losing cache, but that we were losing the non-resident information for > > previously evicted cache. > > > > I understood the file set enough to know it was thrashing like crazy, > > but it didn't register as refaults to the kernel. Without detecting > > the refaults, reclaim wouldn't start swapping to relieve the > > struggling cache (plenty of cold anon memory around). It also meant > > the IO delays of those refaults didn't contribute to memory pressure > > in psi, which made userspace blind to the situation as well. > > > > The first aspect means we can get stuck in pathological thrashing, the > > second means userspace OOM detection breaks and we can leave servers > > (or Android devices, for that matter) hopelessly livelocked. > > > > New patch attached below. I hope we can get this fixed in 5.8, it's > > really quite a big hole in our cache management strategy. > > > > --- > > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001 > > From: Johannes Weiner > > Date: Tue, 11 Feb 2020 12:55:07 -0500 > > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU > > > > The VFS inode shrinker is currently allowed to reclaim cold inodes > > with populated page cache. This behavior goes back to CONFIG_HIGHMEM > > setups, which required the ability to drop page cache in large highem > > zones to free up struct inodes in comparatively tiny lowmem zones. > > > > However, it has significant side effects that are hard to justify on > > systems without highmem: > > > > - It can drop gigabytes of hot and active page cache on the floor > > without consulting the VM (recorded as "inodesteal" events in > > /proc/vmstat). Such an "aging inversion" between unreferenced inodes > > holding hot cache easily happens in practice: for example, a git tree > > whose objects are accessed frequently but no open file descriptors are > > maintained throughout. > > > > Hi Johannes, > > I think it is reasonable to keep inodes with _active_ page cache off > the inode shrinker LRU, but I'm not sure whether it is proper to keep > the inodes with _only_ inactive page cache off the inode list lru > neither. Per my understanding, if the inode has only inactive page > cache, then invalidate all these inactive page cache could save the > reclaimer's time, IOW, it may improve the performance in this case. The shrinker doesn't know whether pages are active or inactive. There is a PageActive() flag, but that's a sampled state that's only uptodate when page reclaim is running. All the active pages could be stale and getting deactivated on the next scan; all the inactive pages could have page table references that would get them activated on the next reclaim run etc. You'd have to duplicate aspects of page reclaim itself to be sure you're axing the right pages. It also wouldn't be a reliable optimization. This only happens when there is a disconnect between the inode and the cache life time, which is true for some situations but not others.