linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeelb@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Linux MM <linux-mm@kvack.org>,
	 Cgroups <cgroups@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	 Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH 00/11] mm: fix page aging across multiple cgroups
Date: Wed, 6 Nov 2019 18:50:25 -0800	[thread overview]
Message-ID: <CALvZod7821vuP_KcOKZkzKu-6b_kzDPrximi3E-Ld95fd=zbMg@mail.gmail.com> (raw)
In-Reply-To: <20190603210746.15800-1-hannes@cmpxchg.org>

On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> When applications are put into unconfigured cgroups for memory
> accounting purposes, the cgrouping itself should not change the
> behavior of the page reclaim code. We expect the VM to reclaim the
> coldest pages in the system. But right now the VM can reclaim hot
> pages in one cgroup while there is eligible cold cache in others.
>
> This is because one part of the reclaim algorithm isn't truly cgroup
> hierarchy aware: the inactive/active list balancing. That is the part
> that is supposed to protect hot cache data from one-off streaming IO.
>
> The recursive cgroup reclaim scheme will scan and rotate the physical
> LRU lists of each eligible cgroup at the same rate in a round-robin
> fashion, thereby establishing a relative order among the pages of all
> those cgroups. However, the inactive/active balancing decisions are
> made locally within each cgroup, so when a cgroup is running low on
> cold pages, its hot pages will get reclaimed - even when sibling
> cgroups have plenty of cold cache eligible in the same reclaim run.
>
> For example:
>
>    [root@ham ~]# head -n1 /proc/meminfo
>    MemTotal:        1016336 kB
>
>    [root@ham ~]# ./reclaimtest2.sh
>    Establishing 50M active files in cgroup A...
>    Hot pages cached: 12800/12800 workingset-a
>    Linearly scanning through 18G of file data in cgroup B:
>    real    0m4.269s
>    user    0m0.051s
>    sys     0m4.182s
>    Hot pages cached: 134/12800 workingset-a
>

Can you share reclaimtest2.sh as well? Maybe a selftest to
monitor/test future changes.


> The streaming IO in B, which doesn't benefit from caching at all,
> pushes out most of the workingset in A.
>
> Solution
>
> This series fixes the problem by elevating inactive/active balancing
> decisions to the toplevel of the reclaim run. This is either a cgroup
> that hit its limit, or straight-up global reclaim if there is physical
> memory pressure. From there, it takes a recursive view of the cgroup
> subtree to decide whether page deactivation is necessary.
>
> In the test above, the VM will then recognize that cgroup B has plenty
> of eligible cold cache, and that thet hot pages in A can be spared:
>
>    [root@ham ~]# ./reclaimtest2.sh
>    Establishing 50M active files in cgroup A...
>    Hot pages cached: 12800/12800 workingset-a
>    Linearly scanning through 18G of file data in cgroup B:
>    real    0m4.244s
>    user    0m0.064s
>    sys     0m4.177s
>    Hot pages cached: 12800/12800 workingset-a
>
> Implementation
>
> Whether active pages can be deactivated or not is influenced by two
> factors: the inactive list dropping below a minimum size relative to
> the active list, and the occurence of refaults.
>
> After some cleanups and preparations, this patch series first moves
> refault detection to the reclaim root, then enforces the minimum
> inactive size based on a recursive view of the cgroup tree's LRUs.
>
> History
>
> Note that this actually never worked correctly in Linux cgroups. In
> the past it worked for global reclaim and leaf limit reclaim only (we
> used to have two physical LRU linkages per page), but it never worked
> for intermediate limit reclaim over multiple leaf cgroups.
>
> We're noticing this now because 1) we're putting everything into
> cgroups for accounting, not just the things we want to control and 2)
> we're moving away from leaf limits that invoke reclaim on individual
> cgroups, toward large tree reclaim, triggered by high-level limits or
> physical memory pressure, that is influenced by local protections such
> as memory.low and memory.min instead.
>
> Requirements
>
> These changes are based on the fast recursive memcg stats merged in
> 5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12
> plus the page cache fix in https://lkml.org/lkml/2019/5/24/813.
>
>  include/linux/memcontrol.h |  37 +--
>  include/linux/mmzone.h     |  30 +-
>  include/linux/swap.h       |   2 +-
>  mm/memcontrol.c            |   6 +-
>  mm/page_alloc.c            |   2 +-
>  mm/vmscan.c                | 667 ++++++++++++++++++++++---------------------
>  mm/workingset.c            |  74 +++--
>  7 files changed, 437 insertions(+), 381 deletions(-)
>
>


  parent reply	other threads:[~2019-11-07  2:50 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-03 21:07 [PATCH 00/11] mm: fix page aging across multiple cgroups Johannes Weiner
2019-06-03 21:07 ` [PATCH 01/11] mm: vmscan: move inactive_list_is_low() swap check to the caller Johannes Weiner
2019-11-07  2:50   ` Shakeel Butt
2019-11-08  3:43     ` Andrew Morton
2019-06-03 21:07 ` [PATCH 02/11] mm: clean up and clarify lruvec lookup procedure Johannes Weiner
2019-11-07  2:50   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 03/11] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
2019-11-07  2:51   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 04/11] mm: vmscan: naming fixes: cgroup_reclaim() and writeback_working() Johannes Weiner
2019-11-07  2:51   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 05/11] mm: vmscan: replace shrink_node() loop with a retry jump Johannes Weiner
2019-11-07  2:51   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 06/11] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() Johannes Weiner
2019-11-07  2:51   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 07/11] mm: vmscan: split shrink_node() into node part and memcgs part Johannes Weiner
2019-11-07  2:51   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 08/11] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs Johannes Weiner
2019-11-07  2:52   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 09/11] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
2019-11-07  2:52   ` Shakeel Butt
2019-06-03 21:07 ` [PATCH 10/11] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
2019-06-03 21:07 ` [PATCH 11/11] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
2019-11-07  2:50 ` Shakeel Butt [this message]
2019-11-07 17:45   ` [PATCH 00/11] mm: fix page aging across multiple cgroups Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALvZod7821vuP_KcOKZkzKu-6b_kzDPrximi3E-Ld95fd=zbMg@mail.gmail.com' \
    --to=shakeelb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=aryabinin@virtuozzo.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).