From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90E6FC433EF for ; Mon, 14 Feb 2022 10:29:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 213376B0072; Mon, 14 Feb 2022 05:29:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 19CA66B0073; Mon, 14 Feb 2022 05:29:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 064776B0078; Mon, 14 Feb 2022 05:29:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0100.hostedemail.com [216.40.44.100]) by kanga.kvack.org (Postfix) with ESMTP id E47A86B0072 for ; Mon, 14 Feb 2022 05:29:18 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 938EC9D26C for ; Mon, 14 Feb 2022 10:29:18 +0000 (UTC) X-FDA: 79141013196.31.930DFCA Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf04.hostedemail.com (Postfix) with ESMTP id DCEE840006 for ; Mon, 14 Feb 2022 10:29:17 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id D10EFB80E20; Mon, 14 Feb 2022 10:29:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4899C340F0; Mon, 14 Feb 2022 10:29:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1644834554; bh=Yjpd9or3InM7O72Ha8qqhYu1AMkoWAkyJcyqk3l1uP4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=SxSbArKoO6QRWCpB0gREzXBLyJR4iT+F7EzfyQBAGNysz5fYvadx8h2FFUXiHcCJ8 OB0kC6nMrABepvcSdNnTyM862zaSF/W1p4BR3wT4S62Ff4EjYcND77yB11FmK7xMki TIuljUCm4+kzy6Z2y1zWg5hiVEVWZhYmSzcyG62WDnMuuwXA9RuNa5t3vJOxIugXA1 90m0ly8lpfLhOWOSg8CZiWk1qgR3H6Fw5x3ZFEO2KVCxUb4/Tu8EdzDul1LAhSYfFb f1rt5ZXAiIKJsVxilDUMkv24PF4XPKq5S3IZcIJao/EJcKZqw+AKndJGOk9tXpEWWb 1No+JyQqXdijQ== Date: Mon, 14 Feb 2022 12:28:56 +0200 From: Mike Rapoport To: Yu Zhao Cc: Andrew Morton , Johannes Weiner , Mel Gorman , Michal Hocko , Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Michael Larabel , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , Holger =?iso-8859-1?Q?Hoffst=E4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh Subject: Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation Message-ID: References: <20220208081902.3550911-1-yuzhao@google.com> <20220208081902.3550911-13-yuzhao@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20220208081902.3550911-13-yuzhao@google.com> Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=SxSbArKo; spf=pass (imf04.hostedemail.com: domain of rppt@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam07 X-Rspam-User: X-Rspamd-Queue-Id: DCEE840006 X-Stat-Signature: rou4uam5sghzoaocb6cyw6orpw5j5w67 X-HE-Tag: 1644834557-581584 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote: > Add a design doc and an admin guide. >=20 > Signed-off-by: Yu Zhao > Acked-by: Brian Geffon > Acked-by: Jan Alexander Steffens (heftig) > Acked-by: Oleksandr Natalenko > Acked-by: Steven Barrett > Acked-by: Suleiman Souhlal > Tested-by: Daniel Byrne > Tested-by: Donald Carr > Tested-by: Holger Hoffst=E4tte > Tested-by: Konstantin Kharlamov > Tested-by: Shuang Zhai > Tested-by: Sofia Trinh > --- > Documentation/admin-guide/mm/index.rst | 1 + > Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++ > Documentation/vm/index.rst | 1 + > Documentation/vm/multigen_lru.rst | 152 ++++++++++++++++++ Please consider splitting this patch into Documentation/admin-guide and Documentation/vm parts. For now I only had time to review the admin-guide part. > 4 files changed, 275 insertions(+) > create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst > create mode 100644 Documentation/vm/multigen_lru.rst >=20 > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/adm= in-guide/mm/index.rst > index c21b5823f126..2cf5bae62036 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -32,6 +32,7 @@ the Linux memory management. > idle_page_tracking > ksm > memory-hotplug > + multigen_lru > nommu-mmap > numa_memory_policy > numaperf > diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentat= ion/admin-guide/mm/multigen_lru.rst > new file mode 100644 > index 000000000000..16a543c8b886 > --- /dev/null > +++ b/Documentation/admin-guide/mm/multigen_lru.rst > @@ -0,0 +1,121 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > +Multigenerational LRU > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + > +Quick start > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D There is no explanation why one would want to use multigenerational LRU until the next section. I think there should be an overview that explains why users would want to enable multigenerational LRU.=20 > +Build configurations > +-------------------- > +:Required: Set ``CONFIG_LRU_GEN=3Dy``. Maybe=20 Set ``CONFIG_LRU_GEN=3Dy`` to build kernel with multigenerational LRU > + > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=3Dy`` to enable the > + multigenerational LRU by default. > + > +Runtime configurations > +---------------------- > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if > + ``CONFIG_LRU_GEN_ENABLED=3Dn``. > + > +This file accepts different values to enabled or disabled the > +following features: Maybe After multigenerational LRU is enabled, this file accepts different values to enable or disable the following feaures: > +=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > +Values Features > +=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > +0x0001 the multigenerational LRU The multigenerational LRU what? What will happen if I write 0x2 to this file? Please consider splitting "enable" and "features" attributes. > +0x0002 clear the accessed bit in leaf page table entries **in large > + batches**, when MMU sets it (e.g., on x86) Is extra markup really needed here... > +0x0004 clear the accessed bit in non-leaf page table entries **as > + well**, when MMU sets it (e.g., on x86) ... and here? As for the descriptions, what is the user-visible effect of these feature= s? How different modes of clearing the access bit are reflected in, say, GUI responsiveness, database TPS, or probability of OOM? > +[yYnN] apply to all the features above > +=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > + > +E.g., > +:: > + > + echo y >/sys/kernel/mm/lru_gen/enabled > + cat /sys/kernel/mm/lru_gen/enabled > + 0x0007 > + echo 5 >/sys/kernel/mm/lru_gen/enabled > + cat /sys/kernel/mm/lru_gen/enabled > + 0x0005 > + > +Most users should enable or disable all the features unless some of > +them have unforeseen side effects. > + > +Recipes > +=3D=3D=3D=3D=3D=3D=3D > +Personal computers > +------------------ > +Personal computers are more sensitive to thrashing because it can > +cause janks (lags when rendering UI) and negatively impact user > +experience. The multigenerational LRU offers thrashing prevention to > +the majority of laptop and desktop users who don't have oomd. I'd expect something like this paragraph in overview. > + > +:Thrashing prevention: Write ``N`` to > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > + ``N`` milliseconds from getting evicted. The OOM killer is triggered > + if this working set can't be kept in memory. Based on the average > + human detectable lag (~100ms), ``N=3D1000`` usually eliminates > + intolerable janks due to thrashing. Larger values like ``N=3D3000`` > + make janks less noticeable at the risk of premature OOM kills. > + > +Data centers > +------------ > +Data centers want to optimize job scheduling (bin packing) to improve > +memory utilizations. Job schedulers need to estimate whether a server > +can allocate a certain amount of memory for a new job, and this step > +is known as working set estimation, which doesn't impact the existing > +jobs running on this server. They also want to attempt freeing some > +cold memory from the existing jobs, and this step is known as proactiv= e > +reclaim, which improves the chance of landing a new job successfully. This paragraph also fits overview. > + > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations > + for working set estimation and proactive reclaim. Please add a note that this is build time option. > + > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following Is debugfs interface relevant only for datacenters?=20 > + format: > + :: > + > + memcg memcg_id memcg_path > + node node_id > + min_gen birth_time anon_size file_size > + ... > + max_gen birth_time anon_size file_size > + > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > + youngest generation number. ``birth_time`` is in milliseconds. It's unclear what is birth_time reference point. Is it milliseconds from the system start or it is measured some other way? > + ``anon_size`` and ``file_size`` are in pages. The youngest generation > + represents the group of the MRU pages and the oldest generation > + represents the group of the LRU pages. For working set estimation, a Please spell out MRU and LRU fully. > + job scheduler writes to this file at a certain time interval to > + create new generations, and it ranks available servers based on the > + sizes of their cold memory defined by this time interval. For > + proactive reclaim, a job scheduler writes to this file before it > + tries to land a new job, and if it fails to materialize the cold > + memory without impacting the existing jobs, it retries on the next > + server according to the ranking result. Is this knob only relevant for a job scheduler? Or it can be used in othe= r use-cases as well? > + > + This file accepts commands in the following subsections. Multiple ^ described > + command lines are supported, so does concatenation with delimiters > + ``,`` and ``;``. > + > + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for > + debugging. > + > +:Working set estimation: Write ``+ memcg_id node_id max_gen > + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke > + the aging. It scans PTEs for hot pages and promotes them to the > + youngest generation ``max_gen``. Then it creates a new generation > + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages > + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead > + as well as the coverage when scanning PTEs. > + > +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness > + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the > + eviction. It evicts generations less than or equal to ``min_gen``. > + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and > + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use > + ``nr_to_reclaim`` to limit the number of pages to evict. I feel that /sys/kernel/debug/lru_gen is too overloaded. > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > index 44365c4574a3..b48434300226 100644 > --- a/Documentation/vm/index.rst > +++ b/Documentation/vm/index.rst > @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply = allocating memory, see the > ksm > memory-model > mmu_notifier > + multigen_lru > numa > overcommit-accounting > page_migration --=20 Sincerely yours, Mike.