From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,
	USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BA09AC433ED
	for <linux-kernel@archiver.kernel.org>; Thu, 20 May 2021 06:54:04 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 99A326108C
	for <linux-kernel@archiver.kernel.org>; Thu, 20 May 2021 06:54:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230359AbhETGzY (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 20 May 2021 02:55:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37854 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229534AbhETGzX (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 20 May 2021 02:55:23 -0400
Received: from mail-qk1-x74a.google.com (mail-qk1-x74a.google.com [IPv6:2607:f8b0:4864:20::74a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DB47C061574
        for <linux-kernel@vger.kernel.org>; Wed, 19 May 2021 23:54:01 -0700 (PDT)
Received: by mail-qk1-x74a.google.com with SMTP id z2-20020a3765020000b02903a5f51b1c74so684222qkb.7
        for <linux-kernel@vger.kernel.org>; Wed, 19 May 2021 23:54:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:message-id:mime-version:subject:from:to:cc;
        bh=Y3hJAMwzbf34YQU8QX5BsSV2xoCmy36DYto5ZLIStJc=;
        b=r/V1aR1KHSQ2RwrGIEEbdDV0RqV+tdHJLBnCnPMLdI4quvTDua13dKOHpxS2Rc7bc4
         6ON9rpxOpEhBMPLS8798xqa4jQBTINTCKNlIi3TpaV8t/shwlViCb4Y9bZ4ng8VEsXp3
         H2s3DQbb47Iio7YrOnBahF4qBDJl2fkHL257Ao4wgzgG/ZCK2oy5dcipOFrEpQqPk5vO
         hhTC4Zr1DE3XI+Y+uTozfI8CoAtllv6qL31gAWcycyeN72teVQa9ilaeTdglxhCO9DVG
         BFkiZH+21Eo3M8PRz4OztnGgRtMvbgNnuUWZ68bnZkO4wMyL6mX2520HA9NQNkGSXLnP
         74Zg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc;
        bh=Y3hJAMwzbf34YQU8QX5BsSV2xoCmy36DYto5ZLIStJc=;
        b=h+lPKp7mQ6QF8fW3fzT7HQgoaLvOfkKtGjwFNvWMOi8UMz94CGWpTgC4tsEX0PenoK
         snCz9kDMIR35YO9Dlhz1Ci/04htNK9p+rnvGn7ri/Oin5fFeyVQ15qh33Bgut5m3SKR2
         imeFBLkWsXGtFd23XCBmjIcNrqZA0LxhIwoYCbrVWSq5H29Eo6C9ab0gmJ1oY0DCPOL/
         Fi8M2neMwLN09EebwZONh8AGuP0XiL0oSnAGDZAhaaAimfHrPBMMYCrxpjnaGxPG2hY0
         gvju/bIag6Ug8urHdAAGWsdLaNIsdrIKWlaL76FjcULVwdAARKQifiMMTwJ2JU5y5jMG
         OKRg==
X-Gm-Message-State: AOAM5322gu+Tvm1pCjTiKdWMNb3cz1Z6+VCfYHkB7vDvNRYItvu08gEA
        /W/WlY6Lc6/4O5nrreOspbq5n77XobE=
X-Google-Smtp-Source: ABdhPJy+4EmI1VvFDhlB3errX+0774OdClFY8nQyFqDe9Pqq8FOdLBnXamEbn+N9M1F/HG6sJ6Mw/n7qw/8=
X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:595d:62ee:f08:8e83])
 (user=yuzhao job=sendgmr) by 2002:a0c:e4cd:: with SMTP id g13mr3727631qvm.34.1621493640278;
 Wed, 19 May 2021 23:54:00 -0700 (PDT)
Date:   Thu, 20 May 2021 00:53:41 -0600
Message-Id: <20210520065355.2736558-1-yuzhao@google.com>
Mime-Version: 1.0
X-Mailer: git-send-email 2.31.1.751.gd2f1c929bd-goog
Subject: [PATCH v3 00/14] Multigenerational LRU Framework
From:   Yu Zhao <yuzhao@google.com>
To:     linux-mm@kvack.org
Cc:     Alex Shi <alexs@kernel.org>, Andi Kleen <ak@linux.intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Dave Chinner <david@fromorbit.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Donald Carr <sirspudd@gmail.com>,
        Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Jonathan Corbet <corbet@lwn.net>,
        Joonsoo Kim <iamjoonsoo.kim@lge.com>,
        Konstantin Kharlamov <hi-angel@yandex.ru>,
        Marcus Seyfarth <m.seyfarth@gmail.com>,
        Matthew Wilcox <willy@infradead.org>,
        Mel Gorman <mgorman@suse.de>,
        Miaohe Lin <linmiaohe@huawei.com>,
        Michael Larabel <michael@michaellarabel.com>,
        Michal Hocko <mhocko@suse.com>,
        Michel Lespinasse <michel@lespinasse.org>,
        Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        linux-kernel@vger.kernel.org, lkp@lists.01.org,
        page-reclaim@google.com, Yu Zhao <yuzhao@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

What's new in v3
================
1) Fixed a bug reported by the Arch Linux kernel team:
   https://github.com/zen-kernel/zen-kernel/issues/207
2) Rebased to v5.13-rc2.

Highlights from v2
==================
Konstantin Kharlamov <hi-angel@yandex.ru> reported:
  My success story: I have Archlinux with 8G RAM + zswap + swap. While
  developing, I have lots of apps opened such as multiple LSP-servers
  for different langs, chats, two browsers, etc. Usually, my system
  gets quickly to a point of SWAP-storms, where I have to kill
  LSP-servers, restart browsers to free memory, etc, otherwise the
  system lags heavily and is barely usable.
 
  1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
  patchset, and I started up by opening lots of apps to create memory
  pressure, and worked for a day like this. Till now I had *not a
  single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
  getting to the point of 3G in SWAP before without a single
  SWAP-storm.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
an alternative framework that is performant, versatile and
straightforward.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/53/1253/1

Problems
========
Notion of active/inactive
-------------------------
Data centers need to predict whether a job can successfully land on a
machine without actually impacting the existing jobs. The granularity
of the active/inactive is too coarse to be useful for job schedulers
to make such decisions. In addition, data centers need to monitor
their memory utilization for horizontal scaling. The active/inactive
cannot give any insight into a pool of machines because aggregating
them across multiple machines without a common frame of reference
yields no meaningful results.

Phones and laptops need to make good choices about what to evict,
since they are more sensitive to the major faults and the power
consumption. Major faults can cause "janks" (slow UI renderings) and
negatively impact user experience. The selection between anon and file
types has been suboptimal because direct comparisons between them are
infeasible based on the notion of active/inactive. On phones and
laptops, executable pages are frequently evicted despite the fact that
there are many less recently used anon pages. Conversely, on
workstations building large projects, anon pages are occasionally
swapped out while page cache contains many less recently used pages.

Fundamentally, the notion of active/inactive has very limited ability
to measure temporal locality.

Incremental scans via rmap
--------------------------
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For
workloads using a large amount of anon memory, incremental scans lose
the advantage under sustained memory pressure due to high ratios of
the number of scanned pages to the number of reclaimed pages. On top
of this, the rmap has complex data structures. And the combined
effects typically result in a high amount of CPU usage in the reclaim
path.

Simply put, incremental scans via rmap have no regard for spatial
locality.

Solutions
=========
Notion of generation numbers
----------------------------
The notion of generation numbers introduces a temporal dimension. Each
generation is a dot on the timeline and it includes all pages that
have been referenced since it was created.

Given an lruvec, scans of anon and file types and selections between
them are all based on direct comparisons of generation numbers, which
are simple and yet effective.

A larger number of pages can be spread out across a configurable
number of generations, which are associated with timestamps and
therefore aggregatable. This is specifically designed for data centers
that require working set estimation and proactive reclaim.

Differential scans via page tables
----------------------------------
Each differential scan discovers all pages that have been referenced
since the last scan. It walks the mm_struct list associated with an
lruvec to scan page tables of processes that have been scheduled since
the last scan. The cost of each differential scan is roughly
proportional to the number of referenced pages it discovers. Page
tables usually have good memory locality. The end result is generally
a significant reduction in CPU usage, for workloads using a large
amount of anon memory.

For workloads that have extremely sparse page tables, it is still
possible to fall back to incremental scans via rmap.

Framework
=========
For each lruvec, evictable pages are divided into multiple
generations. The youngest generation number is stored in
lrugen->max_seq for both anon and file types as they are aged on an
equal footing. The oldest generation numbers are stored in
lrugen->min_seq[2] separately for anon and file types as clean file
pages can be evicted regardless of may_swap or may_writepage. These
three variables are monotonically increasing. Generation numbers are
truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into
page->flags. The sliding window technique is used to prevent truncated
generation numbers from overlapping. Each truncated generation number
is an index to
lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]. Evictable
pages are added to the per-zone lists indexed by lrugen->max_seq or
lrugen->min_seq[2] (modulo MAX_NR_GENS), depending on their types.

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed N times via
file descriptors belong to tier order_base_2(N). Each generation
contains at most MAX_NR_TIERS tiers, and they require additional
MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across
generations which requires the lru lock for the list operations,
moving across tiers only involves an atomic operation on page->flags
and therefore has a negligible cost. A feedback loop modeled after the
PID controller monitors the refault rates across all tiers and decides
when to activate pages from which tiers in the reclaim path.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space for the purpose of working set estimation and proactive reclaim.

Aging
-----
The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq.

The aging maintains either a system-wide mm_struct list or per-memcg
mm_struct lists, and it only scans page tables of processes that have
been scheduled since the last scan.

The aging is due when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Eviction
--------
The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It first tries to select a type based on the values of min_seq[2].
When anon and file types are both available from the same generation,
it selects the one that has a lower refault rate.

During a scan, the eviction sorts pages according to their new
generation numbers, if the aging has found them referenced. It also
moves pages from the tiers that have higher refault rates than tier 0
to the next generation.

When it finds all the per-zone lists of a selected type are empty, the
eviction increments min_seq[2] indexed by this selected type.

Use cases
=========
High anon workloads
-------------------
Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI.

Without this patchset, the profile of kswapd looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

With this patchset, it looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are highly correlated to
user experience.

High page cache workloads
-------------------------
Tiers are specifically designed to improve the performance of page
cache under memory pressure. The fio/io_uring benchmark shows 14%
increase in IOPS when randomly accessing in buffered I/O mode.

Without this patchset, the profile of fio/io_uring looks like:
  Children  Self   Symbol
  -----------------------------------
  12.03%    0.03%  __page_cache_alloc
   6.53%    0.83%  shrink_active_list
   2.53%    0.44%  mark_page_accessed

With this patchset, it looks like:
  Children  Self   Symbol
  -----------------------------------
  9.45%     0.03%  __page_cache_alloc
  0.52%     0.46%  mark_page_accessed

Working set estimation
----------------------
User space can invoke the aging by writing "+ memcg_id node_id gen
[swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
also provides the birth time and the size of each generation.

For example, given a pool of machines, a job scheduler periodically
invokes the aging to estimate the working set of each machine. And it
ranks the machines based on the sizes of their working sets and
selects the most ideal ones to land new jobs.

Proactive reclaim
-----------------
User space can invoke the eviction by writing "- memcg_id node_id gen
[swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
command lines are supported, so does concatenation with delimiters.

For example, a job scheduler can invoke the eviction if it anticipates
new jobs. The savings from proactive reclaim may provide certain SLA
when new jobs actually land.

Yu Zhao (14):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
    !CONFIG_MEMCG
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/cgroup.h: export cgroup_mutex
  mm, x86: support the access bit on non-leaf PMD entries
  mm/vmscan.c: refactor shrink_node()
  mm/workingset.c: refactor pack_shadow() and unpack_shadow()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: activation
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  143 ++
 arch/Kconfig                      |    9 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/fuse/dev.c                     |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/memcontrol.h        |    7 +-
 include/linux/mm.h                |    2 +
 include/linux/mm_inline.h         |  234 +++
 include/linux/mm_types.h          |  107 ++
 include/linux/mmzone.h            |  117 ++
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   19 +-
 include/linux/page-flags.h        |    4 +-
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    4 +-
 kernel/bounds.c                   |    6 +
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   58 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   10 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |    6 +-
 mm/mmzone.c                       |    2 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   22 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2638 ++++++++++++++++++++++++++++-
 mm/workingset.c                   |  169 +-
 39 files changed, 3498 insertions(+), 160 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.31.1.751.gd2f1c929bd-goog


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=z6PN=KP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,
	USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B2F29C43460
	for <linux-mm@archiver.kernel.org>; Thu, 20 May 2021 06:54:03 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 37D246108C
	for <linux-mm@archiver.kernel.org>; Thu, 20 May 2021 06:54:03 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 37D246108C
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 70D646B006C; Thu, 20 May 2021 02:54:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6BD496B006E; Thu, 20 May 2021 02:54:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5367E6B0070; Thu, 20 May 2021 02:54:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250])
	by kanga.kvack.org (Postfix) with ESMTP id 172F76B006C
	for <linux-mm@kvack.org>; Thu, 20 May 2021 02:54:02 -0400 (EDT)
Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 97D866D8E
	for <linux-mm@kvack.org>; Thu, 20 May 2021 06:54:01 +0000 (UTC)
X-FDA: 78160694682.25.BFE297D
Received: from mail-qk1-f202.google.com (mail-qk1-f202.google.com [209.85.222.202])
	by imf16.hostedemail.com (Postfix) with ESMTP id 025FE8019116
	for <linux-mm@kvack.org>; Thu, 20 May 2021 06:53:59 +0000 (UTC)
Received: by mail-qk1-f202.google.com with SMTP id s4-20020a3790040000b02902fa7aa987e8so11715433qkd.14
        for <linux-mm@kvack.org>; Wed, 19 May 2021 23:54:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:message-id:mime-version:subject:from:to:cc;
        bh=Y3hJAMwzbf34YQU8QX5BsSV2xoCmy36DYto5ZLIStJc=;
        b=r/V1aR1KHSQ2RwrGIEEbdDV0RqV+tdHJLBnCnPMLdI4quvTDua13dKOHpxS2Rc7bc4
         6ON9rpxOpEhBMPLS8798xqa4jQBTINTCKNlIi3TpaV8t/shwlViCb4Y9bZ4ng8VEsXp3
         H2s3DQbb47Iio7YrOnBahF4qBDJl2fkHL257Ao4wgzgG/ZCK2oy5dcipOFrEpQqPk5vO
         hhTC4Zr1DE3XI+Y+uTozfI8CoAtllv6qL31gAWcycyeN72teVQa9ilaeTdglxhCO9DVG
         BFkiZH+21Eo3M8PRz4OztnGgRtMvbgNnuUWZ68bnZkO4wMyL6mX2520HA9NQNkGSXLnP
         74Zg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc;
        bh=Y3hJAMwzbf34YQU8QX5BsSV2xoCmy36DYto5ZLIStJc=;
        b=K08q9EHdn+dCcjxsJL0gfLszsnpSXBu2RpJ2Hp9KS6PIweqLEU/off2x2zKQuG6tjX
         0tM3XdRDB3FPvv4IEVEjw6vjKyC8+h0IBNomS5ObDV5+uiJagYLnjho0sPpV4psMCaNF
         eiGj58gG2omgwoSI64AVi4hbpJFtGGnq2lpB6KtM4pQ4/lPmnrDWTEpoMe8WyfEk5Le1
         WOdPO1TL9PPwcBT8FTDww/j/4ngO5ZP1D/UiH2g7e+lPCPcUjIE6zKJ6KtymI1vd3tkz
         BxaaTn3nPosYQsY952zgMHwbKj38UAkuCEFejaqXPNvg1DppvqwkZx/EtoNEcXvq6jGM
         VpQg==
X-Gm-Message-State: AOAM533LYTnXNZMVvu6Cl10IA/3Zd/Y6a2FjW45nl+wspAL07Qofbb5n
	LFZ5BQHtgqAKAvwc4IEP4BVbrWesv5PwfwNRudcwObryEsYatqXnw7XBIanG6BT84M39PMfLBlE
	0btkI3/Thny5/M9edqqycu6VmvMPOfXPJqTyYEJfonrofvR5WgPP9JVqT
X-Google-Smtp-Source: ABdhPJy+4EmI1VvFDhlB3errX+0774OdClFY8nQyFqDe9Pqq8FOdLBnXamEbn+N9M1F/HG6sJ6Mw/n7qw/8=
X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:595d:62ee:f08:8e83])
 (user=yuzhao job=sendgmr) by 2002:a0c:e4cd:: with SMTP id g13mr3727631qvm.34.1621493640278;
 Wed, 19 May 2021 23:54:00 -0700 (PDT)
Date: Thu, 20 May 2021 00:53:41 -0600
Message-Id: <20210520065355.2736558-1-yuzhao@google.com>
Mime-Version: 1.0
X-Mailer: git-send-email 2.31.1.751.gd2f1c929bd-goog
Subject: [PATCH v3 00/14] Multigenerational LRU Framework
From: Yu Zhao <yuzhao@google.com>
To: linux-mm@kvack.org
Cc: Alex Shi <alexs@kernel.org>, Andi Kleen <ak@linux.intel.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Dave Chinner <david@fromorbit.com>, 
	Dave Hansen <dave.hansen@linux.intel.com>, Donald Carr <sirspudd@gmail.com>, 
	Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>, 
	Jonathan Corbet <corbet@lwn.net>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, 
	Konstantin Kharlamov <hi-angel@yandex.ru>, Marcus Seyfarth <m.seyfarth@gmail.com>, 
	Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>, 
	Michael Larabel <michael@michaellarabel.com>, Michal Hocko <mhocko@suse.com>, 
	Michel Lespinasse <michel@lespinasse.org>, Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>, 
	Tim Chen <tim.c.chen@linux.intel.com>, Vlastimil Babka <vbabka@suse.cz>, 
	Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, 
	linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com, 
	Yu Zhao <yuzhao@google.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 025FE8019116
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b="r/V1aR1K";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf16.hostedemail.com: domain of 3iAemYAYKCDsvrweXldlldib.Zljifkru-jjhsXZh.lod@flex--yuzhao.bounces.google.com designates 209.85.222.202 as permitted sender) smtp.mailfrom=3iAemYAYKCDsvrweXldlldib.Zljifkru-jjhsXZh.lod@flex--yuzhao.bounces.google.com
X-Rspamd-Server: rspam03
X-Stat-Signature: tf5wnyjhiyq7cgfdnk4sdpp6bff77bzi
X-HE-Tag: 1621493639-517233
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

What's new in v3
================
1) Fixed a bug reported by the Arch Linux kernel team:
   https://github.com/zen-kernel/zen-kernel/issues/207
2) Rebased to v5.13-rc2.

Highlights from v2
==================
Konstantin Kharlamov <hi-angel@yandex.ru> reported:
  My success story: I have Archlinux with 8G RAM + zswap + swap. While
  developing, I have lots of apps opened such as multiple LSP-servers
  for different langs, chats, two browsers, etc. Usually, my system
  gets quickly to a point of SWAP-storms, where I have to kill
  LSP-servers, restart browsers to free memory, etc, otherwise the
  system lags heavily and is barely usable.
 
  1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
  patchset, and I started up by opening lots of apps to create memory
  pressure, and worked for a day like this. Till now I had *not a
  single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
  getting to the point of 3G in SWAP before without a single
  SWAP-storm.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
an alternative framework that is performant, versatile and
straightforward.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/53/1253/1

Problems
========
Notion of active/inactive
-------------------------
Data centers need to predict whether a job can successfully land on a
machine without actually impacting the existing jobs. The granularity
of the active/inactive is too coarse to be useful for job schedulers
to make such decisions. In addition, data centers need to monitor
their memory utilization for horizontal scaling. The active/inactive
cannot give any insight into a pool of machines because aggregating
them across multiple machines without a common frame of reference
yields no meaningful results.

Phones and laptops need to make good choices about what to evict,
since they are more sensitive to the major faults and the power
consumption. Major faults can cause "janks" (slow UI renderings) and
negatively impact user experience. The selection between anon and file
types has been suboptimal because direct comparisons between them are
infeasible based on the notion of active/inactive. On phones and
laptops, executable pages are frequently evicted despite the fact that
there are many less recently used anon pages. Conversely, on
workstations building large projects, anon pages are occasionally
swapped out while page cache contains many less recently used pages.

Fundamentally, the notion of active/inactive has very limited ability
to measure temporal locality.

Incremental scans via rmap
--------------------------
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For
workloads using a large amount of anon memory, incremental scans lose
the advantage under sustained memory pressure due to high ratios of
the number of scanned pages to the number of reclaimed pages. On top
of this, the rmap has complex data structures. And the combined
effects typically result in a high amount of CPU usage in the reclaim
path.

Simply put, incremental scans via rmap have no regard for spatial
locality.

Solutions
=========
Notion of generation numbers
----------------------------
The notion of generation numbers introduces a temporal dimension. Each
generation is a dot on the timeline and it includes all pages that
have been referenced since it was created.

Given an lruvec, scans of anon and file types and selections between
them are all based on direct comparisons of generation numbers, which
are simple and yet effective.

A larger number of pages can be spread out across a configurable
number of generations, which are associated with timestamps and
therefore aggregatable. This is specifically designed for data centers
that require working set estimation and proactive reclaim.

Differential scans via page tables
----------------------------------
Each differential scan discovers all pages that have been referenced
since the last scan. It walks the mm_struct list associated with an
lruvec to scan page tables of processes that have been scheduled since
the last scan. The cost of each differential scan is roughly
proportional to the number of referenced pages it discovers. Page
tables usually have good memory locality. The end result is generally
a significant reduction in CPU usage, for workloads using a large
amount of anon memory.

For workloads that have extremely sparse page tables, it is still
possible to fall back to incremental scans via rmap.

Framework
=========
For each lruvec, evictable pages are divided into multiple
generations. The youngest generation number is stored in
lrugen->max_seq for both anon and file types as they are aged on an
equal footing. The oldest generation numbers are stored in
lrugen->min_seq[2] separately for anon and file types as clean file
pages can be evicted regardless of may_swap or may_writepage. These
three variables are monotonically increasing. Generation numbers are
truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into
page->flags. The sliding window technique is used to prevent truncated
generation numbers from overlapping. Each truncated generation number
is an index to
lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]. Evictable
pages are added to the per-zone lists indexed by lrugen->max_seq or
lrugen->min_seq[2] (modulo MAX_NR_GENS), depending on their types.

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed N times via
file descriptors belong to tier order_base_2(N). Each generation
contains at most MAX_NR_TIERS tiers, and they require additional
MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across
generations which requires the lru lock for the list operations,
moving across tiers only involves an atomic operation on page->flags
and therefore has a negligible cost. A feedback loop modeled after the
PID controller monitors the refault rates across all tiers and decides
when to activate pages from which tiers in the reclaim path.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space for the purpose of working set estimation and proactive reclaim.

Aging
-----
The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq.

The aging maintains either a system-wide mm_struct list or per-memcg
mm_struct lists, and it only scans page tables of processes that have
been scheduled since the last scan.

The aging is due when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Eviction
--------
The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It first tries to select a type based on the values of min_seq[2].
When anon and file types are both available from the same generation,
it selects the one that has a lower refault rate.

During a scan, the eviction sorts pages according to their new
generation numbers, if the aging has found them referenced. It also
moves pages from the tiers that have higher refault rates than tier 0
to the next generation.

When it finds all the per-zone lists of a selected type are empty, the
eviction increments min_seq[2] indexed by this selected type.

Use cases
=========
High anon workloads
-------------------
Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI.

Without this patchset, the profile of kswapd looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

With this patchset, it looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are highly correlated to
user experience.

High page cache workloads
-------------------------
Tiers are specifically designed to improve the performance of page
cache under memory pressure. The fio/io_uring benchmark shows 14%
increase in IOPS when randomly accessing in buffered I/O mode.

Without this patchset, the profile of fio/io_uring looks like:
  Children  Self   Symbol
  -----------------------------------
  12.03%    0.03%  __page_cache_alloc
   6.53%    0.83%  shrink_active_list
   2.53%    0.44%  mark_page_accessed

With this patchset, it looks like:
  Children  Self   Symbol
  -----------------------------------
  9.45%     0.03%  __page_cache_alloc
  0.52%     0.46%  mark_page_accessed

Working set estimation
----------------------
User space can invoke the aging by writing "+ memcg_id node_id gen
[swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
also provides the birth time and the size of each generation.

For example, given a pool of machines, a job scheduler periodically
invokes the aging to estimate the working set of each machine. And it
ranks the machines based on the sizes of their working sets and
selects the most ideal ones to land new jobs.

Proactive reclaim
-----------------
User space can invoke the eviction by writing "- memcg_id node_id gen
[swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
command lines are supported, so does concatenation with delimiters.

For example, a job scheduler can invoke the eviction if it anticipates
new jobs. The savings from proactive reclaim may provide certain SLA
when new jobs actually land.

Yu Zhao (14):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
    !CONFIG_MEMCG
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/cgroup.h: export cgroup_mutex
  mm, x86: support the access bit on non-leaf PMD entries
  mm/vmscan.c: refactor shrink_node()
  mm/workingset.c: refactor pack_shadow() and unpack_shadow()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: activation
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  143 ++
 arch/Kconfig                      |    9 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/fuse/dev.c                     |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/memcontrol.h        |    7 +-
 include/linux/mm.h                |    2 +
 include/linux/mm_inline.h         |  234 +++
 include/linux/mm_types.h          |  107 ++
 include/linux/mmzone.h            |  117 ++
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   19 +-
 include/linux/page-flags.h        |    4 +-
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    4 +-
 kernel/bounds.c                   |    6 +
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   58 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   10 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |    6 +-
 mm/mmzone.c                       |    2 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   22 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2638 ++++++++++++++++++++++++++++-
 mm/workingset.c                   |  169 +-
 39 files changed, 3498 insertions(+), 160 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.31.1.751.gd2f1c929bd-goog


From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============2835332901264340149=="
MIME-Version: 1.0
From: Yu Zhao <yuzhao@google.com>
To: lkp@lists.01.org
Subject: [PATCH v3 00/14] Multigenerational LRU Framework
Date: Thu, 20 May 2021 00:53:41 -0600
Message-ID: <20210520065355.2736558-1-yuzhao@google.com>
List-Id: <oe-lkp.lists.linux.dev>

--===============2835332901264340149==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

What's new in v3
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
1) Fixed a bug reported by the Arch Linux kernel team:
   https://github.com/zen-kernel/zen-kernel/issues/207
2) Rebased to v5.13-rc2.

Highlights from v2
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Konstantin Kharlamov <hi-angel@yandex.ru> reported:
  My success story: I have Archlinux with 8G RAM + zswap + swap. While
  developing, I have lots of apps opened such as multiple LSP-servers
  for different langs, chats, two browsers, etc. Usually, my system
  gets quickly to a point of SWAP-storms, where I have to kill
  LSP-servers, restart browsers to free memory, etc, otherwise the
  system lags heavily and is barely usable.
 =

  1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
  patchset, and I started up by opening lots of apps to create memory
  pressure, and worked for a day like this. Till now I had *not a
  single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
  getting to the point of 3G in SWAP before without a single
  SWAP-storm.

TLDR
=3D=3D=3D=3D
The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
an alternative framework that is performant, versatile and
straightforward.

Repo
=3D=3D=3D=3D
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/53/12=
53/1

Problems
=3D=3D=3D=3D=3D=3D=3D=3D
Notion of active/inactive
-------------------------
Data centers need to predict whether a job can successfully land on a
machine without actually impacting the existing jobs. The granularity
of the active/inactive is too coarse to be useful for job schedulers
to make such decisions. In addition, data centers need to monitor
their memory utilization for horizontal scaling. The active/inactive
cannot give any insight into a pool of machines because aggregating
them across multiple machines without a common frame of reference
yields no meaningful results.

Phones and laptops need to make good choices about what to evict,
since they are more sensitive to the major faults and the power
consumption. Major faults can cause "janks" (slow UI renderings) and
negatively impact user experience. The selection between anon and file
types has been suboptimal because direct comparisons between them are
infeasible based on the notion of active/inactive. On phones and
laptops, executable pages are frequently evicted despite the fact that
there are many less recently used anon pages. Conversely, on
workstations building large projects, anon pages are occasionally
swapped out while page cache contains many less recently used pages.

Fundamentally, the notion of active/inactive has very limited ability
to measure temporal locality.

Incremental scans via rmap
--------------------------
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For
workloads using a large amount of anon memory, incremental scans lose
the advantage under sustained memory pressure due to high ratios of
the number of scanned pages to the number of reclaimed pages. On top
of this, the rmap has complex data structures. And the combined
effects typically result in a high amount of CPU usage in the reclaim
path.

Simply put, incremental scans via rmap have no regard for spatial
locality.

Solutions
=3D=3D=3D=3D=3D=3D=3D=3D=3D
Notion of generation numbers
----------------------------
The notion of generation numbers introduces a temporal dimension. Each
generation is a dot on the timeline and it includes all pages that
have been referenced since it was created.

Given an lruvec, scans of anon and file types and selections between
them are all based on direct comparisons of generation numbers, which
are simple and yet effective.

A larger number of pages can be spread out across a configurable
number of generations, which are associated with timestamps and
therefore aggregatable. This is specifically designed for data centers
that require working set estimation and proactive reclaim.

Differential scans via page tables
----------------------------------
Each differential scan discovers all pages that have been referenced
since the last scan. It walks the mm_struct list associated with an
lruvec to scan page tables of processes that have been scheduled since
the last scan. The cost of each differential scan is roughly
proportional to the number of referenced pages it discovers. Page
tables usually have good memory locality. The end result is generally
a significant reduction in CPU usage, for workloads using a large
amount of anon memory.

For workloads that have extremely sparse page tables, it is still
possible to fall back to incremental scans via rmap.

Framework
=3D=3D=3D=3D=3D=3D=3D=3D=3D
For each lruvec, evictable pages are divided into multiple
generations. The youngest generation number is stored in
lrugen->max_seq for both anon and file types as they are aged on an
equal footing. The oldest generation numbers are stored in
lrugen->min_seq[2] separately for anon and file types as clean file
pages can be evicted regardless of may_swap or may_writepage. These
three variables are monotonically increasing. Generation numbers are
truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into
page->flags. The sliding window technique is used to prevent truncated
generation numbers from overlapping. Each truncated generation number
is an index to
lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]. Evictable
pages are added to the per-zone lists indexed by lrugen->max_seq or
lrugen->min_seq[2] (modulo MAX_NR_GENS), depending on their types.

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed N times via
file descriptors belong to tier order_base_2(N). Each generation
contains at most MAX_NR_TIERS tiers, and they require additional
MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across
generations which requires the lru lock for the list operations,
moving across tiers only involves an atomic operation on page->flags
and therefore has a negligible cost. A feedback loop modeled after the
PID controller monitors the refault rates across all tiers and decides
when to activate pages from which tiers in the reclaim path.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space for the purpose of working set estimation and proactive reclaim.

Aging
-----
The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq.

The aging maintains either a system-wide mm_struct list or per-memcg
mm_struct lists, and it only scans page tables of processes that have
been scheduled since the last scan.

The aging is due when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Eviction
--------
The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It first tries to select a type based on the values of min_seq[2].
When anon and file types are both available from the same generation,
it selects the one that has a lower refault rate.

During a scan, the eviction sorts pages according to their new
generation numbers, if the aging has found them referenced. It also
moves pages from the tiers that have higher refault rates than tier 0
to the next generation.

When it finds all the per-zone lists of a selected type are empty, the
eviction increments min_seq[2] indexed by this selected type.

Use cases
=3D=3D=3D=3D=3D=3D=3D=3D=3D
High anon workloads
-------------------
Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI.

Without this patchset, the profile of kswapd looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

With this patchset, it looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are highly correlated to
user experience.

High page cache workloads
-------------------------
Tiers are specifically designed to improve the performance of page
cache under memory pressure. The fio/io_uring benchmark shows 14%
increase in IOPS when randomly accessing in buffered I/O mode.

Without this patchset, the profile of fio/io_uring looks like:
  Children  Self   Symbol
  -----------------------------------
  12.03%    0.03%  __page_cache_alloc
   6.53%    0.83%  shrink_active_list
   2.53%    0.44%  mark_page_accessed

With this patchset, it looks like:
  Children  Self   Symbol
  -----------------------------------
  9.45%     0.03%  __page_cache_alloc
  0.52%     0.46%  mark_page_accessed

Working set estimation
----------------------
User space can invoke the aging by writing "+ memcg_id node_id gen
[swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
also provides the birth time and the size of each generation.

For example, given a pool of machines, a job scheduler periodically
invokes the aging to estimate the working set of each machine. And it
ranks the machines based on the sizes of their working sets and
selects the most ideal ones to land new jobs.

Proactive reclaim
-----------------
User space can invoke the eviction by writing "- memcg_id node_id gen
[swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
command lines are supported, so does concatenation with delimiters.

For example, a job scheduler can invoke the eviction if it anticipates
new jobs. The savings from proactive reclaim may provide certain SLA
when new jobs actually land.

Yu Zhao (14):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
    !CONFIG_MEMCG
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/cgroup.h: export cgroup_mutex
  mm, x86: support the access bit on non-leaf PMD entries
  mm/vmscan.c: refactor shrink_node()
  mm/workingset.c: refactor pack_shadow() and unpack_shadow()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: activation
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  143 ++
 arch/Kconfig                      |    9 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/fuse/dev.c                     |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/memcontrol.h        |    7 +-
 include/linux/mm.h                |    2 +
 include/linux/mm_inline.h         |  234 +++
 include/linux/mm_types.h          |  107 ++
 include/linux/mmzone.h            |  117 ++
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   19 +-
 include/linux/page-flags.h        |    4 +-
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    4 +-
 kernel/bounds.c                   |    6 +
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   58 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   10 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |    6 +-
 mm/mmzone.c                       |    2 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   22 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2638 ++++++++++++++++++++++++++++-
 mm/workingset.c                   |  169 +-
 39 files changed, 3498 insertions(+), 160 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

-- =

2.31.1.751.gd2f1c929bd-goog

--===============2835332901264340149==--