From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19134C433DF for ; Thu, 15 Oct 2020 21:20:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD27D2076A for ; Thu, 15 Oct 2020 21:20:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726452AbgJOVUZ (ORCPT ); Thu, 15 Oct 2020 17:20:25 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:38315 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726018AbgJOVUZ (ORCPT ); Thu, 15 Oct 2020 17:20:25 -0400 Received: from dread.disaster.area (pa49-179-6-140.pa.nsw.optusnet.com.au [49.179.6.140]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 33B0758C6BE; Fri, 16 Oct 2020 08:20:23 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1kTAfe-000uqv-3d; Fri, 16 Oct 2020 08:20:22 +1100 Date: Fri, 16 Oct 2020 08:20:22 +1100 From: Dave Chinner To: "Darrick J. Wong" Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 23/27] libxfs: use PSI information to detect memory pressure Message-ID: <20201015212022.GI7391@dread.disaster.area> References: <20201015072155.1631135-1-david@fromorbit.com> <20201015072155.1631135-24-david@fromorbit.com> <20201015175611.GY9832@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201015175611.GY9832@magnolia> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=uDU3YIYVKEaHT0eX+MXYOQ==:117 a=uDU3YIYVKEaHT0eX+MXYOQ==:17 a=kj9zAlcOel0A:10 a=afefHYAZSVUA:10 a=7-415B0cAAAA:8 a=OR8kNt25WPBK8Lc6AbQA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Thu, Oct 15, 2020 at 10:56:11AM -0700, Darrick J. Wong wrote: > On Thu, Oct 15, 2020 at 06:21:51PM +1100, Dave Chinner wrote: > > @@ -74,6 +196,8 @@ xfs_buftarg_alloc( > > btp->bt_mount = mp; > > btp->bt_fd = libxfs_device_to_fd(bdev); > > btp->bt_bdev = bdev; > > + btp->bt_psi_fd = -1; > > + btp->bt_exiting = false; > > > > if (xfs_buftarg_setsize_early(btp)) > > goto error_free; > > @@ -84,8 +208,13 @@ xfs_buftarg_alloc( > > if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL)) > > goto error_lru; > > > > + if (xfs_buftarg_mempressue_init(btp)) > > So what happens if PSI isn't enabled or procfs isn't mounted yet? > xfs_repair just ... fails? That seems disappointing, particularly if > the admin is trying to fix a dead root fs from the initramfs premount > shell and /proc isn't set up yet. Yes, right now it just fails. I'm more interested right now in getting the new infrastructure working such that the kernel buffer cache "just works" when there's more metadata than RAM to cache it in. > Hmm, looks like Debian actually /does/ set up procfs nowadays. Still, > if we're going to add a hard requirement on CONFIG_PSI=y and > CONFIG_PSI_DEFAULT_DISABLED=n, we need to advertise this kind of loudly. > > (Personally, I thought that if there's no pressure stall information, > we'd just fall back to not having a shrinker and daring the system to > OOM us like it does now...) Well, the existing buffer cache does have a shrinker mechanism - it will shake the cache down when it is full to free up old buffers. That's what all the MRU lists and buffer priority stuff in the repair prefetch code is all about. repair tries to bound the maximum size of the buffer cache and prevent OOM that way. If it calculates that the memory requirement is larger than RAM, that's when it gets into OOM trouble because we still allow it to use lots of memory and then just hope... I kind of want to get away from all those messy static heuristics. I'd much prefer that we do dynamic cache growth detection and size calculations in repair and determine if we should purge the cache at the end of each AG or retain it in RAM. i.e. if ((per ag cache size * no. of AGs) > 75% RAM) then purge the AG cache when the phase scan is done. This way we run with minimal caching (just what is needed for prefetching to be efficient) when it is likely we can't fit all the metadata in RAM, and otherwise we behave like we currently do. That sort of setup will go a long way to avoiding OOM kill and the need for actual memory shrinkers to activate. This mode could be activated if the PSI infomration is not there, hence might also solve most of the rescue situation problems. Cheers, Dave. -- Dave Chinner david@fromorbit.com