From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18492C433F5 for ; Tue, 12 Apr 2022 02:17:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245658AbiDLCTT (ORCPT ); Mon, 11 Apr 2022 22:19:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48798 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244856AbiDLCTG (ORCPT ); Mon, 11 Apr 2022 22:19:06 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E3EA33E12; Mon, 11 Apr 2022 19:16:50 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 3A7E2B81A03; Tue, 12 Apr 2022 02:16:49 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7BBF0C385A9; Tue, 12 Apr 2022 02:16:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1649729808; bh=vnuU0Vf5uUFZG5hzr0+pY/uKtAP80ZJBw7evrhTy9Ww=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=GMRsVBfP5Tznj8S0EBzpLTJ7kD4d1BJImxQDaroOwYYzhRccW5wXVfJ8vrSq5EvuE xs4wOEgfHdMOAImk859vK8lLKcW73v6C28jJARbn8SgTBM3smEZPoyKAs57ItPWna+ 1n6Q1bPX3uNaxHA2GQBvc0bv7AXEJpuBZVVo6J1Q= Date: Mon, 11 Apr 2022 19:16:45 -0700 From: Andrew Morton To: Yu Zhao Cc: Stephen Rothwell , linux-mm@kvack.org, Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, page-reclaim@google.com, x86@kernel.org, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , =?ISO-8859-1?Q? "Holger_Hoffst=E4tte" ?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain Subject: Re: [PATCH v10 14/14] mm: multi-gen LRU: design doc Message-Id: <20220411191645.91078752f29b82aea7ee508c@linux-foundation.org> In-Reply-To: <20220407031525.2368067-15-yuzhao@google.com> References: <20220407031525.2368067-1-yuzhao@google.com> <20220407031525.2368067-15-yuzhao@google.com> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 6 Apr 2022 21:15:26 -0600 Yu Zhao wrote: > Add a design doc. > > > ... > > +Design overview > +=============== > +Objectives > +---------- > +The design objectives are: > + > +* Good representation of access recency > +* Try to profit from spatial locality > +* Fast paths to make obvious choices > +* Simple self-correcting heuristics > + > +The representation of access recency is at the core of all LRU > +implementations. In the multi-gen LRU, each generation represents a > +group of pages with similar access recency. Generations establish a > +common frame of reference and therefore help make better choices, > +e.g., between different memcgs on a computer or different computers in > +a data center (for job scheduling). Does MGLRU have any special treatment for used-once pages? > +Exploiting spatial locality improves efficiency when gathering the > +accessed bit. A rmap walk targets a single page and does not try to > +profit from discovering a young PTE. A page table walk can sweep all > +the young PTEs in an address space, but the address space can be too > +large to make a profit. The key is to optimize both methods and use > +them in combination. > + > +Fast paths reduce code complexity and runtime overhead. Unmapped pages > +do not require TLB flushes; clean pages do not require writeback. > +These facts are only helpful when other conditions, e.g., access > +recency, are similar. With generations as a common frame of reference, > +additional factors stand out. But obvious choices might not be good > +choices; thus self-correction is required. > + > +The benefits of simple self-correcting heuristics are self-evident. > +Again, with generations as a common frame of reference, this becomes > +attainable. Specifically, pages in the same generation can be > +categorized based on additional factors, and a feedback loop can > +statistically compare the refault percentages across those categories > +and infer which of them are better choices. > + > +Assumptions > +----------- > +The protection of hot pages and the selection of cold pages are based > +on page access channels and patterns. There are two access channels: > + > +* Accesses through page tables > +* Accesses through file descriptors > + > +The protection of the former channel is by design stronger because: > + > +1. The uncertainty in determining the access patterns of the former > + channel is higher due to the approximation of the accessed bit. > +2. The cost of evicting the former channel is higher due to the TLB > + flushes required and the likelihood of encountering the dirty bit. > +3. The penalty of underprotecting the former channel is higher because > + applications usually do not prepare themselves for major page > + faults like they do for blocked I/O. E.g., GUI applications > + commonly use dedicated I/O threads to avoid blocking the rendering > + threads. > + > +There are also two access patterns: > + > +* Accesses exhibiting temporal locality > +* Accesses not exhibiting temporal locality > + > +For the reasons listed above, the former channel is assumed to follow > +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is > +present, and the latter channel is assumed to follow the latter > +pattern unless outlying refaults have been observed. What about MADV_SEQUENTIAL? Or did we propogate that into the fd? > +Workflow overview > +================= > +Evictable pages are divided into multiple generations for each > +``lruvec``. The youngest generation number is stored in > +``lrugen->max_seq`` for both anon and file types as they are aged on > +an equal footing. The oldest generation numbers are stored in > +``lrugen->min_seq[]`` separately for anon and file types as clean file > +pages can be evicted regardless of swap constraints. These three > +variables are monotonically increasing. > + > > ... > > +Summary > +------- > +The multi-gen LRU can be disassembled into the following parts: > + > +* Generations > +* Page table walks > +* Rmap walks > +* Bloom filters > +* The PID controller > + > +The aging and the eviction is a producer-consumer model; specifically, > +the latter drives the former by the sliding window over generations. > +Within the aging, rmap walks drive page table walks by inserting hot > +densely populated page tables to the Bloom filters. Within the > +eviction, the PID controller uses refaults as the feedback to select > +types to evict and tiers to protect. It's cool to see a PID controller in there. How do we know that it converges well, that it doesn't exhibit instability, etc?