From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD8CEC433F5 for ; Thu, 10 Feb 2022 02:53:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D3CE06B0073; Wed, 9 Feb 2022 21:53:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CC57F6B0074; Wed, 9 Feb 2022 21:53:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B17DA6B0075; Wed, 9 Feb 2022 21:53:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0152.hostedemail.com [216.40.44.152]) by kanga.kvack.org (Postfix) with ESMTP id 9CBDE6B0073 for ; Wed, 9 Feb 2022 21:53:44 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 45FC3181CC1A5 for ; Thu, 10 Feb 2022 02:53:44 +0000 (UTC) X-FDA: 79125349968.10.FD90E8B Received: from mail-il1-f174.google.com (mail-il1-f174.google.com [209.85.166.174]) by imf08.hostedemail.com (Postfix) with ESMTP id C0404160002 for ; Thu, 10 Feb 2022 02:53:43 +0000 (UTC) Received: by mail-il1-f174.google.com with SMTP id e8so3282492ilm.13 for ; Wed, 09 Feb 2022 18:53:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=02AWBuLGOKzuO/3AWRL6F5OKKIT88gnKBJ88rLS3Q5c=; b=CAoj+qpavcGevLlpNh+qcVIfWEfwcXDkYPEAx5BOrK89vKypxjxQRLKyh/6xJvDViX mVN1CVcxNGTLP5YJpokCbFcaz8DpyVPe3tQHhg4TgRM6wb4yx1FDAoB6T2tgrPOXiskX RSIFS/nBOd5mFD8a+YLouMfTxFhXt0Qe23iMnEzms0mYz2jGD/B8uDYMudEh9KMunSj1 eSHhHjLlp9VfnGVlWQzRUjlEIFSoT3E9/68cXa3aKOqEtKxhAX1VOCJzYkXDlCQTLVPd bqkLFX3s8W+0r489fx1fwwrjD5pStmOMDhBcRANMRA+eZ6cBKsaXQcwkdfYUi5k7Z7d3 kwiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=02AWBuLGOKzuO/3AWRL6F5OKKIT88gnKBJ88rLS3Q5c=; b=lBnNFJ57DKdkHPFnEBECsV/WJluJROJWosYtPGWbkLI7w1uu+ACvZeTP1PYKO2m6lq YVPkhgr/KBMn9KrVQ3629k39tQIps64MqKzqGnTYwE8dc94BAxTP/wF6u+fyRKs2WxdJ EBlNbmP/B7yJtFbry4fmAaKmynpk+3L2BDxDv93WN/vSnojEwwqTwklWoPpnEmzcXpUc lhq8Y01pbXFhEoFOcG+5stTnFTNK14d67JLfDXSgOXzPTYTc11z+ntDMLmGO9buQAPQf 2BxgVlkUQanT2NNz+wcn2WntAHqqJbjl5L/mB54W40C/Q7Kr48uFycVsZzboNROgS9Cd ELsA== X-Gm-Message-State: AOAM532ohwku+oj5CC606+h3xZu88gHlV4WconUTy+1xXNwNhUeeXgO/ WM1XztAhR5dsP3uf8wfEbPqdjw== X-Google-Smtp-Source: ABdhPJxN76sol94sGrXIwyJukfPmPPrpjKCZDatx0OHsW2ZL5QRGOM4AtA0sGQWprYjd1omddFfeFQ== X-Received: by 2002:a05:6e02:1d90:: with SMTP id h16mr2623971ila.202.1644461622739; Wed, 09 Feb 2022 18:53:42 -0800 (PST) Received: from google.com ([2620:15c:183:200:1792:6c19:6193:97c1]) by smtp.gmail.com with ESMTPSA id v4sm7477338ilc.21.2022.02.09.18.53.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Feb 2022 18:53:42 -0800 (PST) Date: Wed, 9 Feb 2022 19:53:37 -0700 From: Yu Zhao To: Johannes Weiner Cc: Andrew Morton , Mel Gorman , Michal Hocko , Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Michael Larabel , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , Holger =?iso-8859-1?Q?Hoffst=E4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh Subject: Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Message-ID: References: <20220208081902.3550911-1-yuzhao@google.com> <20220208081902.3550911-6-yuzhao@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: C0404160002 X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CAoj+qpa; spf=pass (imf08.hostedemail.com: domain of yuzhao@google.com designates 209.85.166.174 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: 9wmc78rbezkxo6xo454hacx47tsrxkgc X-Rspamd-Server: rspam04 X-HE-Tag: 1644461623-899440 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 08, 2022 at 11:50:09AM -0500, Johannes Weiner wrote: > On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote: > > To avoid confusions, the terms "promotion" and "demotion" will be > > applied to the multigenerational LRU, as a new convention; the terms > > "activation" and "deactivation" will be applied to the active/inactive > > LRU, as usual. > > > > The aging produces young generations. Given an lruvec, it increments > > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging > > promotes hot pages to the youngest generation when it finds them > > accessed through page tables; the demotion of cold pages happens > > consequently when it increments max_seq. Since the aging is only > > interested in hot pages, its complexity is O(nr_hot_pages). Promotion > > in the aging path doesn't require any LRU list operations, only the > > updates of the gen counter and lrugen->nr_pages[]; demotion, unless > > as the result of the increment of max_seq, requires LRU list > > operations, e.g., lru_deactivate_fn(). > > I'm having trouble with this changelog. It opens with a footnote and > summarizes certain aspects of the implementation whose importance to > the reader aren't entirely clear at this time. > > It would be better to start with a high-level overview of the problem > and how this algorithm solves it. How the reclaim algorithm needs to > find the page that is most suitable for eviction and to signal when > it's time to give up and OOM. Then explain how grouping pages into > multiple generations accomplishes that - in particular compared to the > current two use-once/use-many lists. Hi Johannes, Thanks for reviewing! I suspect the information you are looking for might have been in the patchset but is scattered in a few places. Could you please glance at the following pieces and let me know 1. whether they cover some of the points you asked for 2. and if so, whether there is a better order/place to present them? The previous patch has a quick view on the architecture: https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/ Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they're aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores (seq%MAX_NR_GENS)+1 while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent processes (as in the manufacturing process): "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both processes can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases... And the design doc contains a bit more details, and I'd be happy to present it earlier, if you think doing so would help. https://lore.kernel.org/linux-mm/20220208081902.3550911-13-yuzhao@google.com/ > Explain the problem of MMU vs syscall references, and how tiering > addresses this. The previous patch also touched on this point: https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/ The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1) The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2) The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3) The penalty of underprotecting the former channel is higher because applications usually don't prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present, and the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed. > Explain the significance of refaults and how the algorithm responds to > them. Not in terms of which running averages are updated, but in terms > of user-visible behavior ("will start swapping (more)" etc.) And this patch touched on how tiers would help: 1) It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth promoting in the eviction path. 2) It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier since N=0.) 3) More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. And the design doc: https://lore.kernel.org/linux-mm/20220208081902.3550911-13-yuzhao@google.com/ To select a type and a tier to evict from, it first compares min_seq[] to select the older type. If they are equal, it selects the type whose first tier has a lower refault percentage. The first tier contains single-use unmapped clean pages, which are the best bet. > Express *intent*, how it's supposed to behave wrt workloads and memory > pressure. The code itself will explain the how, its complexity etc. Hmm... This part I'm not so sure. It seems to me this is equivalent to describing how it works. > Most reviewers will understand the fundamental challenges of page > reclaim. The difficulty is matching individual aspects of the problem > space to your individual components and design choices you have made. > > Let us in on that thinking, please ;) Agreed. I'm sure I haven't covered everything. So I'm trying to figure out what's important but missing/insufficient. > > @@ -892,6 +892,50 @@ config ANON_VMA_NAME > > area from being merged with adjacent virtual memory areas due to the > > difference in their name. > > > > +# multigenerational LRU { > > +config LRU_GEN > > + bool "Multigenerational LRU" > > + depends on MMU > > + # the following options can use up the spare bits in page flags > > + depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP) > > + help > > + A high performance LRU implementation for memory overcommit. See > > + Documentation/admin-guide/mm/multigen_lru.rst and > > + Documentation/vm/multigen_lru.rst for details. > > These files don't exist at this time, please introduce them before or > when referencing them. If they document things introduced later in the > patchset, please start with a minimal version of the file and update > it as you extend the algorithm and add optimizations etc. > > It's really important to only reference previous patches, not later > ones. This allows reviewers to read the patches linearly. Having to > search for missing pieces in patches you haven't looked at yet is bad. Okay, will remove this bit from this patch. > > +config NR_LRU_GENS > > + int "Max number of generations" > > + depends on LRU_GEN > > + range 4 31 > > + default 4 > > + help > > + Do not increase this value unless you plan to use working set > > + estimation and proactive reclaim to optimize job scheduling in data > > + centers. > > + > > + This option uses order_base_2(N+1) bits in page flags. > > + > > +config TIERS_PER_GEN > > + int "Number of tiers per generation" > > + depends on LRU_GEN > > + range 2 4 > > + default 4 > > + help > > + Do not decrease this value unless you run out of spare bits in page > > + flags, i.e., you see the "Not enough bits in page flags" build error. > > + > > + This option uses N-2 bits in page flags. > > Linus had pointed out that we shouldn't ask these questions of the > user. How do you pick numbers here? I'm familiar with workingset > estimation and proactive reclaim usecases but I wouldn't know. > > Even if we removed the config option and hardcoded the number, this is > a question for kernel developers: What does "4" mean? How would > behavior differ if it were 3 or 5 instead? Presumably there is some > sort of behavior gradient. "As you increase the number of > generations/tiers, the user-visible behavior of the kernel will..." > This should really be documented. > > I'd also reiterate Mel's point: Distribution kernels need to support > the full spectrum of applications and production environments. Unless > using non-defaults it's an extremely niche usecase (like compiling out > BUG() calls) compile-time options are not the right choice. If we do > need a tunable, it could make more sense to have a compile time upper > limit (to determine page flag space) combined with a runtime knob? I agree, and I think only time can answer all theses questions :) This effort is not in the final stage but at very its beginning. More experiments and wilder adoption are required to see how it's going to evolve or where it leads. For now, there is just no way to tell whether those values make sense for the majority or we need the runtime knobs. These are valid concerns, but TBH, I think they are minor ones because most users need not to worry about them -- this patchset has been used in several downstream kernels and I haven't heard any complaints about those options/values: https://lore.kernel.org/linux-mm/20220208081902.3550911-1-yuzhao@google.com/ 1. Android ARCVM 2. Arch Linux Zen 3. Chrome OS 4. Liquorix 5. post-factum 6. XanMod Then why do we need these options? Because there are always exceptions, as stated in the descriptions of those options. Sometimes we just can't decide everything for users -- the answers lie in their use cases. The bottom line is, if this starts bothering people or gets in somebody's way, I'd be glad to revisit. Fair enough? Thanks!