From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D664C433B4 for ; Tue, 13 Apr 2021 06:57:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5CED260FDB for ; Tue, 13 Apr 2021 06:57:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244503AbhDMG6L (ORCPT ); Tue, 13 Apr 2021 02:58:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44250 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345123AbhDMG5Y (ORCPT ); Tue, 13 Apr 2021 02:57:24 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14C06C061756 for ; Mon, 12 Apr 2021 23:57:05 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id p75so9209574ybc.8 for ; Mon, 12 Apr 2021 23:57:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=fZsS4S+ppDN6vse6LQilTb+995ZpejDyoXEkWEzhPiI=; b=JPzEmLg8IXqkikE/b+k7FNKSdKIPd2lLmXlP9sfI87JvOkw09qdZ+KRrlaAD+a9Dhn 005sbjcbFZ0lFEPYPSKaDUzlN3hBr3DSo7pYAg76+SLl3Ga5vXEbxhKRzSwelQO0SjpX rhHL0KytAzNOPmRXNi0zkAQkCW4EAqyrBAkMJuC7dTB6jIRG6ER1dzInKps5oaOL1wQs HLIiBt2/Ahnea89fcjAFJPIS7nNG2lwTqqUVTkoanckNkavhBDYk0VsP07i7LdiYi9zN +LOuJNV+snejmLdfr2/3+aMXbxqjF2clhWnkNv/9X/ng5LI35tZxiwJOcncdT6c0vONU rPQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=fZsS4S+ppDN6vse6LQilTb+995ZpejDyoXEkWEzhPiI=; b=Mmy7jkv8AlhXjPNjblEwvM3ZtDGk7NKvJ6rsLmF6f0BWgbZq1tIB6pdyHgFU312oCj y4lT+2OfaNXkHdc1m9GGWuWIiWBODWDms6SOZyoSt3DzZKzcdOzZvjUSS2YPZRhtMBP8 dB9FKMTZmwSiNzB4tdOneaAVzDRY5bshb8bACVfCaWFqtKUYRJ7IUedFh3omjJHSY8FV 6STGtMN3VWQZjRvtH7TufrAvCfWEWJ4oYHPhHmGG2DIS+7aQ6CbYgjel6Xiw7E9VkAg2 JoiFRDcRNv+ByQW+uYw+Z96cYJm5wf4hkkC+/iCib2vWT1vXRgZ7CRYsjyRwZmHJd2Jy fKJA== X-Gm-Message-State: AOAM532ohDzhQEIUgvNgG4R8COEdtptVwp/WFnYFKQYURGql6xBpawoF Y2GA+8fymXJP5OJ1UDw0RBDHBeXkM1Q= X-Google-Smtp-Source: ABdhPJzHOTHYLMuXC88wBZEF39dm7Sun3+0TVIBRLg85pDR3z2FX1I51OcfzuM68n03ioC4rVU3FQw4etPM= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:d02d:cccc:9ebe:9fe9]) (user=yuzhao job=sendgmr) by 2002:a25:e00f:: with SMTP id x15mr25695207ybg.85.1618297024186; Mon, 12 Apr 2021 23:57:04 -0700 (PDT) Date: Tue, 13 Apr 2021 00:56:33 -0600 In-Reply-To: <20210413065633.2782273-1-yuzhao@google.com> Message-Id: <20210413065633.2782273-17-yuzhao@google.com> Mime-Version: 1.0 References: <20210413065633.2782273-1-yuzhao@google.com> X-Mailer: git-send-email 2.31.1.295.g9ea45b61b8-goog Subject: [PATCH v2 16/16] mm: multigenerational lru: documentation From: Yu Zhao To: linux-mm@kvack.org Cc: Alex Shi , Andi Kleen , Andrew Morton , Benjamin Manes , Dave Chinner , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Miaohe Lin , Michael Larabel , Michal Hocko , Michel Lespinasse , Rik van Riel , Roman Gushchin , Rong Chen , SeongJae Park , Tim Chen , Vlastimil Babka , Yang Shi , Ying Huang , Zi Yan , linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com, Yu Zhao Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 192 ++++++++++++++++++++++++++++++ 2 files changed, 193 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..c353b3f55924 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management swap_numa zswap + multigen_lru Kernel developers MM documentation ================================== diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..cf772aeca317 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,192 @@ +===================== +Multigenerational LRU +===================== + +Quick Start +=========== +Build Options +------------- +:Required: Set ``CONFIG_LRU_GEN=y``. + +:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support + a maximum of ``X`` generations. + +:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to support + a maximum of ``Y`` tiers per generation. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by + default. + +Runtime Options +--------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N`` + to spread pages out across ``N+1`` generations. ``N`` should be less + than ``X``. Larger values make the background aging more aggressive. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature. + This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +Given a memcg and a node, ``min_gen`` is the oldest generation +(number) and ``max_gen`` is the youngest. Birth time is in +milliseconds. The sizes of anon and file types are in pages. + +Recipes +------- +:Android on ARMv8.1+: ``X=4``, ``N=0`` + +:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of + ``ARM64_HW_AFDBM`` + +:Laptops running Chrome on x86_64: ``X=7``, ``N=2`` + +:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]`` + to ``/sys/kernel/debug/lru_gen`` to account referenced pages to + generation ``max_gen`` and create the next generation ``max_gen+1``. + ``gen`` should be equal to ``max_gen``. A swap file and a non-zero + ``swappiness`` are required to scan anon type. If swapping is not + desired, set ``vm.swappiness`` to ``0``. + +:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict + generations less than or equal to ``gen``. ``gen`` should be less + than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active + generations and therefore protected from the eviction. Use + ``nr_to_reclaim`` to limit the number of pages to be evicted. + Multiple command lines are supported, so does concatenation with + delimiters ``,`` and ``;``. + +Framework +========= +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in ``max_seq`` +for both anon and file types as they are aged on an equal footing. The +oldest generation numbers are stored in ``min_seq[2]`` separately for +anon and file types as clean file pages can be evicted regardless of +swap and write-back constraints. Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists. Evictable pages are added to the per-zone lists indexed by +``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``), +depending on whether they are being faulted in. + +Each generation is then divided into multiple tiers. Tiers represent +levels of usage from file descriptors only. Pages accessed N times via +file descriptors belong to tier order_base_2(N). In contrast to moving +across generations which requires the lru lock, moving across tiers +only involves an atomic operation on ``page->flags`` and therefore has +a negligible cost. + +The workflow comprises two conceptually independent functions: the +aging and the eviction. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +scans page tables for referenced pages of this ``lruvec``. Upon +finding one, the aging updates its generation number to ``max_seq``. +After each round of scan, the aging increments ``max_seq``. + +The aging maintains either a system-wide ``mm_struct`` list or +per-memcg ``mm_struct`` lists, and it only scans page tables of +processes that have been scheduled since the last scan. Since scans +are differential with respect to referenced pages, the cost is roughly +proportional to their number. + +The aging is due when both of ``min_seq[2]`` reaches ``max_seq-1``, +assuming both anon and file types are reclaimable. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans the pages on the per-zone lists indexed by either of +``min_seq[2]``. It first tries to select a type based on the values of +``min_seq[2]``. When anon and file types are both available from the +same generation, it selects the one that has a lower refault rate. + +During a scan, the eviction sorts pages according to their generation +numbers, if the aging has found them referenced. It also moves pages +from the tiers that have higher refault rates than tier 0 to the next +generation. + +When it finds all the per-zone lists of a selected type are empty, the +eviction increments ``min_seq[2]`` indexed by this selected type. + +Rationale +========= +Limitations of Current Implementation +------------------------------------- +Notion of Active/Inactive +~~~~~~~~~~~~~~~~~~~~~~~~~ +For servers equipped with hundreds of gigabytes of memory, the +granularity of the active/inactive is too coarse to be useful for job +scheduling. False active/inactive rates are relatively high, and thus +the assumed savings may not materialize. + +For phones and laptops, executable pages are frequently evicted +despite the fact that there are many less recently used anon pages. +Major faults on executable pages cause ``janks`` (slow UI renderings) +and negatively impact user experience. + +For ``lruvec``\s from different memcgs or nodes, comparisons are +impossible due to the lack of a common frame of reference. + +Incremental Scans via ``rmap`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Each incremental scan picks up at where the last scan left off and +stops after it has found a handful of unreferenced pages. For +workloads using a large amount of anon memory, incremental scans lose +the advantage under sustained memory pressure due to high ratios of +the number of scanned pages to the number of reclaimed pages. On top +of that, the ``rmap`` has poor memory locality due to its complex data +structures. The combined effects typically result in a high amount of +CPU usage in the reclaim path. + +Benefits of Multigenerational LRU +--------------------------------- +Notion of Generation Numbers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The notion of generation numbers introduces a quantitative approach to +memory overcommit. A larger number of pages can be spread out across +configurable generations, and thus they have relatively low false +active/inactive rates. Each generation includes all pages that have +been referenced since the last generation. + +Given an ``lruvec``, scans and the selections between anon and file +types are all based on generation numbers, which are simple and yet +effective. For different ``lruvec``\s, comparisons are still possible +based on birth times of generations. + +Differential Scans via Page Tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Each differential scan discovers all pages that have been referenced +since the last scan. Specifically, it walks the ``mm_struct`` list +associated with an ``lruvec`` to scan page tables of processes that +have been scheduled since the last scan. The cost of each differential +scan is roughly proportional to the number of referenced pages it +discovers. Unless address spaces are extremely sparse, page tables +usually have better memory locality than the ``rmap``. The end result +is generally a significant reduction in CPU usage, for workloads +using a large amount of anon memory. + +To-do List +========== +KVM Optimization +---------------- +Support shadow page table scanning. + +NUMA Optimization +----------------- +Support NUMA policies and per-node RSS counters. -- 2.31.1.295.g9ea45b61b8-goog From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C1FEC433B4 for ; Tue, 13 Apr 2021 06:57:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B4EDF61278 for ; Tue, 13 Apr 2021 06:57:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B4EDF61278 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E97EF6B0087; Tue, 13 Apr 2021 02:57:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E49476B0088; Tue, 13 Apr 2021 02:57:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC6ED6B0089; Tue, 13 Apr 2021 02:57:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0130.hostedemail.com [216.40.44.130]) by kanga.kvack.org (Postfix) with ESMTP id A96176B0087 for ; Tue, 13 Apr 2021 02:57:05 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 732081E03 for ; Tue, 13 Apr 2021 06:57:05 +0000 (UTC) X-FDA: 78026436810.29.E983EEC Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf07.hostedemail.com (Postfix) with ESMTP id D2D24A000395 for ; Tue, 13 Apr 2021 06:57:04 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id 131so15395392ybp.16 for ; Mon, 12 Apr 2021 23:57:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=fZsS4S+ppDN6vse6LQilTb+995ZpejDyoXEkWEzhPiI=; b=JPzEmLg8IXqkikE/b+k7FNKSdKIPd2lLmXlP9sfI87JvOkw09qdZ+KRrlaAD+a9Dhn 005sbjcbFZ0lFEPYPSKaDUzlN3hBr3DSo7pYAg76+SLl3Ga5vXEbxhKRzSwelQO0SjpX rhHL0KytAzNOPmRXNi0zkAQkCW4EAqyrBAkMJuC7dTB6jIRG6ER1dzInKps5oaOL1wQs HLIiBt2/Ahnea89fcjAFJPIS7nNG2lwTqqUVTkoanckNkavhBDYk0VsP07i7LdiYi9zN +LOuJNV+snejmLdfr2/3+aMXbxqjF2clhWnkNv/9X/ng5LI35tZxiwJOcncdT6c0vONU rPQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=fZsS4S+ppDN6vse6LQilTb+995ZpejDyoXEkWEzhPiI=; b=HXcH4LVxZQyddAL3oVugQNFPcAomAkxkllIrfI1cCRm2oq/veXfq+H70E8ihOw1ssp hm9KIj0dH4kshsEBFtlgNIPJkd+vQT4yXXLiSzbYqDI2A3ZmpYzVRrdqqd0Y4KQqaUsG 3KP7Og8PYX5epnrL9U3qNvObl1uEqLEgpvTP7C275ZETta5Rje6LmNiyqULSiMOzAavA I0Ioa+i127Pp/iAINXE5jfknNRBUA1k83J2ITLe3gWfI3qWa7g9QEC0eDVX3M21Y1ido sXz0qYQXsSgULV4IGQ3C/OXSMKSehB+AwSKZgKdwyTugmuXLvOMtY645erUiszQM+/XL sfaA== X-Gm-Message-State: AOAM531xGSHQqUKrNtzyVhCUKI54EFyuX/uXSZqtefEwqTuKOQBAbRFI OOeF3KtMHIXGE6u0Ve7S1jFaDtNo3x7LINIIokImkwxFFvxCEks2P3M4c96cToe8MG9wusXUBU+ adlHfiDIHKDYGckeMxhbSIGcpJ4knR0OCPgbJKjxTS4nYF5XMNjCymd6V X-Google-Smtp-Source: ABdhPJzHOTHYLMuXC88wBZEF39dm7Sun3+0TVIBRLg85pDR3z2FX1I51OcfzuM68n03ioC4rVU3FQw4etPM= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:d02d:cccc:9ebe:9fe9]) (user=yuzhao job=sendgmr) by 2002:a25:e00f:: with SMTP id x15mr25695207ybg.85.1618297024186; Mon, 12 Apr 2021 23:57:04 -0700 (PDT) Date: Tue, 13 Apr 2021 00:56:33 -0600 In-Reply-To: <20210413065633.2782273-1-yuzhao@google.com> Message-Id: <20210413065633.2782273-17-yuzhao@google.com> Mime-Version: 1.0 References: <20210413065633.2782273-1-yuzhao@google.com> X-Mailer: git-send-email 2.31.1.295.g9ea45b61b8-goog Subject: [PATCH v2 16/16] mm: multigenerational lru: documentation From: Yu Zhao To: linux-mm@kvack.org Cc: Alex Shi , Andi Kleen , Andrew Morton , Benjamin Manes , Dave Chinner , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Miaohe Lin , Michael Larabel , Michal Hocko , Michel Lespinasse , Rik van Riel , Roman Gushchin , Rong Chen , SeongJae Park , Tim Chen , Vlastimil Babka , Yang Shi , Ying Huang , Zi Yan , linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com, Yu Zhao Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: D2D24A000395 X-Stat-Signature: 4pcgjtuuzkwm656ezk7g4a6cog5eajxu Received-SPF: none (flex--yuzhao.bounces.google.com>: No applicable sender policy available) receiver=imf07; identity=mailfrom; envelope-from="<3wEB1YAYKCCEVRWE7LDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--yuzhao.bounces.google.com>"; helo=mail-yb1-f201.google.com; client-ip=209.85.219.201 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618297024-926043 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 192 ++++++++++++++++++++++++++++++ 2 files changed, 193 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..c353b3f55924 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management swap_numa zswap + multigen_lru Kernel developers MM documentation ================================== diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..cf772aeca317 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,192 @@ +===================== +Multigenerational LRU +===================== + +Quick Start +=========== +Build Options +------------- +:Required: Set ``CONFIG_LRU_GEN=y``. + +:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support + a maximum of ``X`` generations. + +:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to support + a maximum of ``Y`` tiers per generation. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by + default. + +Runtime Options +--------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N`` + to spread pages out across ``N+1`` generations. ``N`` should be less + than ``X``. Larger values make the background aging more aggressive. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature. + This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +Given a memcg and a node, ``min_gen`` is the oldest generation +(number) and ``max_gen`` is the youngest. Birth time is in +milliseconds. The sizes of anon and file types are in pages. + +Recipes +------- +:Android on ARMv8.1+: ``X=4``, ``N=0`` + +:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of + ``ARM64_HW_AFDBM`` + +:Laptops running Chrome on x86_64: ``X=7``, ``N=2`` + +:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]`` + to ``/sys/kernel/debug/lru_gen`` to account referenced pages to + generation ``max_gen`` and create the next generation ``max_gen+1``. + ``gen`` should be equal to ``max_gen``. A swap file and a non-zero + ``swappiness`` are required to scan anon type. If swapping is not + desired, set ``vm.swappiness`` to ``0``. + +:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict + generations less than or equal to ``gen``. ``gen`` should be less + than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active + generations and therefore protected from the eviction. Use + ``nr_to_reclaim`` to limit the number of pages to be evicted. + Multiple command lines are supported, so does concatenation with + delimiters ``,`` and ``;``. + +Framework +========= +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in ``max_seq`` +for both anon and file types as they are aged on an equal footing. The +oldest generation numbers are stored in ``min_seq[2]`` separately for +anon and file types as clean file pages can be evicted regardless of +swap and write-back constraints. Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists. Evictable pages are added to the per-zone lists indexed by +``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``), +depending on whether they are being faulted in. + +Each generation is then divided into multiple tiers. Tiers represent +levels of usage from file descriptors only. Pages accessed N times via +file descriptors belong to tier order_base_2(N). In contrast to moving +across generations which requires the lru lock, moving across tiers +only involves an atomic operation on ``page->flags`` and therefore has +a negligible cost. + +The workflow comprises two conceptually independent functions: the +aging and the eviction. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +scans page tables for referenced pages of this ``lruvec``. Upon +finding one, the aging updates its generation number to ``max_seq``. +After each round of scan, the aging increments ``max_seq``. + +The aging maintains either a system-wide ``mm_struct`` list or +per-memcg ``mm_struct`` lists, and it only scans page tables of +processes that have been scheduled since the last scan. Since scans +are differential with respect to referenced pages, the cost is roughly +proportional to their number. + +The aging is due when both of ``min_seq[2]`` reaches ``max_seq-1``, +assuming both anon and file types are reclaimable. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans the pages on the per-zone lists indexed by either of +``min_seq[2]``. It first tries to select a type based on the values of +``min_seq[2]``. When anon and file types are both available from the +same generation, it selects the one that has a lower refault rate. + +During a scan, the eviction sorts pages according to their generation +numbers, if the aging has found them referenced. It also moves pages +from the tiers that have higher refault rates than tier 0 to the next +generation. + +When it finds all the per-zone lists of a selected type are empty, the +eviction increments ``min_seq[2]`` indexed by this selected type. + +Rationale +========= +Limitations of Current Implementation +------------------------------------- +Notion of Active/Inactive +~~~~~~~~~~~~~~~~~~~~~~~~~ +For servers equipped with hundreds of gigabytes of memory, the +granularity of the active/inactive is too coarse to be useful for job +scheduling. False active/inactive rates are relatively high, and thus +the assumed savings may not materialize. + +For phones and laptops, executable pages are frequently evicted +despite the fact that there are many less recently used anon pages. +Major faults on executable pages cause ``janks`` (slow UI renderings) +and negatively impact user experience. + +For ``lruvec``\s from different memcgs or nodes, comparisons are +impossible due to the lack of a common frame of reference. + +Incremental Scans via ``rmap`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Each incremental scan picks up at where the last scan left off and +stops after it has found a handful of unreferenced pages. For +workloads using a large amount of anon memory, incremental scans lose +the advantage under sustained memory pressure due to high ratios of +the number of scanned pages to the number of reclaimed pages. On top +of that, the ``rmap`` has poor memory locality due to its complex data +structures. The combined effects typically result in a high amount of +CPU usage in the reclaim path. + +Benefits of Multigenerational LRU +--------------------------------- +Notion of Generation Numbers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The notion of generation numbers introduces a quantitative approach to +memory overcommit. A larger number of pages can be spread out across +configurable generations, and thus they have relatively low false +active/inactive rates. Each generation includes all pages that have +been referenced since the last generation. + +Given an ``lruvec``, scans and the selections between anon and file +types are all based on generation numbers, which are simple and yet +effective. For different ``lruvec``\s, comparisons are still possible +based on birth times of generations. + +Differential Scans via Page Tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Each differential scan discovers all pages that have been referenced +since the last scan. Specifically, it walks the ``mm_struct`` list +associated with an ``lruvec`` to scan page tables of processes that +have been scheduled since the last scan. The cost of each differential +scan is roughly proportional to the number of referenced pages it +discovers. Unless address spaces are extremely sparse, page tables +usually have better memory locality than the ``rmap``. The end result +is generally a significant reduction in CPU usage, for workloads +using a large amount of anon memory. + +To-do List +========== +KVM Optimization +---------------- +Support shadow page table scanning. + +NUMA Optimization +----------------- +Support NUMA policies and per-node RSS counters. -- 2.31.1.295.g9ea45b61b8-goog From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============2065719642533781014==" MIME-Version: 1.0 From: Yu Zhao To: lkp@lists.01.org Subject: [PATCH v2 16/16] mm: multigenerational lru: documentation Date: Tue, 13 Apr 2021 00:56:33 -0600 Message-ID: <20210413065633.2782273-17-yuzhao@google.com> In-Reply-To: <20210413065633.2782273-1-yuzhao@google.com> List-Id: --===============2065719642533781014== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 192 ++++++++++++++++++++++++++++++ 2 files changed, 193 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..c353b3f55924 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management = swap_numa zswap + multigen_lru = Kernel developers MM documentation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_= lru.rst new file mode 100644 index 000000000000..cf772aeca317 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,192 @@ +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Multigenerational LRU +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Quick Start +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Build Options +------------- +:Required: Set ``CONFIG_LRU_GEN=3Dy``. + +:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support + a maximum of ``X`` generations. + +:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to support + a maximum of ``Y`` tiers per generation. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=3Dy`` to turn the feature on by + default. + +Runtime Options +--------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N`` + to spread pages out across ``N+1`` generations. ``N`` should be less + than ``X``. Larger values make the background aging more aggressive. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature. + This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +Given a memcg and a node, ``min_gen`` is the oldest generation +(number) and ``max_gen`` is the youngest. Birth time is in +milliseconds. The sizes of anon and file types are in pages. + +Recipes +------- +:Android on ARMv8.1+: ``X=3D4``, ``N=3D0`` + +:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of + ``ARM64_HW_AFDBM`` + +:Laptops running Chrome on x86_64: ``X=3D7``, ``N=3D2`` + +:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]`` + to ``/sys/kernel/debug/lru_gen`` to account referenced pages to + generation ``max_gen`` and create the next generation ``max_gen+1``. + ``gen`` should be equal to ``max_gen``. A swap file and a non-zero + ``swappiness`` are required to scan anon type. If swapping is not + desired, set ``vm.swappiness`` to ``0``. + +:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict + generations less than or equal to ``gen``. ``gen`` should be less + than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active + generations and therefore protected from the eviction. Use + ``nr_to_reclaim`` to limit the number of pages to be evicted. + Multiple command lines are supported, so does concatenation with + delimiters ``,`` and ``;``. + +Framework +=3D=3D=3D=3D=3D=3D=3D=3D=3D +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in ``max_seq`` +for both anon and file types as they are aged on an equal footing. The +oldest generation numbers are stored in ``min_seq[2]`` separately for +anon and file types as clean file pages can be evicted regardless of +swap and write-back constraints. Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists. Evictable pages are added to the per-zone lists indexed by +``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``), +depending on whether they are being faulted in. + +Each generation is then divided into multiple tiers. Tiers represent +levels of usage from file descriptors only. Pages accessed N times via +file descriptors belong to tier order_base_2(N). In contrast to moving +across generations which requires the lru lock, moving across tiers +only involves an atomic operation on ``page->flags`` and therefore has +a negligible cost. + +The workflow comprises two conceptually independent functions: the +aging and the eviction. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +scans page tables for referenced pages of this ``lruvec``. Upon +finding one, the aging updates its generation number to ``max_seq``. +After each round of scan, the aging increments ``max_seq``. + +The aging maintains either a system-wide ``mm_struct`` list or +per-memcg ``mm_struct`` lists, and it only scans page tables of +processes that have been scheduled since the last scan. Since scans +are differential with respect to referenced pages, the cost is roughly +proportional to their number. + +The aging is due when both of ``min_seq[2]`` reaches ``max_seq-1``, +assuming both anon and file types are reclaimable. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans the pages on the per-zone lists indexed by either of +``min_seq[2]``. It first tries to select a type based on the values of +``min_seq[2]``. When anon and file types are both available from the +same generation, it selects the one that has a lower refault rate. + +During a scan, the eviction sorts pages according to their generation +numbers, if the aging has found them referenced. It also moves pages +from the tiers that have higher refault rates than tier 0 to the next +generation. + +When it finds all the per-zone lists of a selected type are empty, the +eviction increments ``min_seq[2]`` indexed by this selected type. + +Rationale +=3D=3D=3D=3D=3D=3D=3D=3D=3D +Limitations of Current Implementation +------------------------------------- +Notion of Active/Inactive +~~~~~~~~~~~~~~~~~~~~~~~~~ +For servers equipped with hundreds of gigabytes of memory, the +granularity of the active/inactive is too coarse to be useful for job +scheduling. False active/inactive rates are relatively high, and thus +the assumed savings may not materialize. + +For phones and laptops, executable pages are frequently evicted +despite the fact that there are many less recently used anon pages. +Major faults on executable pages cause ``janks`` (slow UI renderings) +and negatively impact user experience. + +For ``lruvec``\s from different memcgs or nodes, comparisons are +impossible due to the lack of a common frame of reference. + +Incremental Scans via ``rmap`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Each incremental scan picks up at where the last scan left off and +stops after it has found a handful of unreferenced pages. For +workloads using a large amount of anon memory, incremental scans lose +the advantage under sustained memory pressure due to high ratios of +the number of scanned pages to the number of reclaimed pages. On top +of that, the ``rmap`` has poor memory locality due to its complex data +structures. The combined effects typically result in a high amount of +CPU usage in the reclaim path. + +Benefits of Multigenerational LRU +--------------------------------- +Notion of Generation Numbers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The notion of generation numbers introduces a quantitative approach to +memory overcommit. A larger number of pages can be spread out across +configurable generations, and thus they have relatively low false +active/inactive rates. Each generation includes all pages that have +been referenced since the last generation. + +Given an ``lruvec``, scans and the selections between anon and file +types are all based on generation numbers, which are simple and yet +effective. For different ``lruvec``\s, comparisons are still possible +based on birth times of generations. + +Differential Scans via Page Tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Each differential scan discovers all pages that have been referenced +since the last scan. Specifically, it walks the ``mm_struct`` list +associated with an ``lruvec`` to scan page tables of processes that +have been scheduled since the last scan. The cost of each differential +scan is roughly proportional to the number of referenced pages it +discovers. Unless address spaces are extremely sparse, page tables +usually have better memory locality than the ``rmap``. The end result +is generally a significant reduction in CPU usage, for workloads +using a large amount of anon memory. + +To-do List +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +KVM Optimization +---------------- +Support shadow page table scanning. + +NUMA Optimization +----------------- +Support NUMA policies and per-node RSS counters. -- = 2.31.1.295.g9ea45b61b8-goog --===============2065719642533781014==--