Linux-mm Archive on
 help / color / Atom feed
From: Johannes Weiner <>
To: Andrew Morton <>
Cc: Andrey Ryabinin <>,
	Suren Baghdasaryan <>,
	Shakeel Butt <>,
	Rik van Riel <>, Michal Hocko <>,,,,
Subject: [PATCH 0/3] mm: fix page aging across multiple cgroups
Date: Thu,  7 Nov 2019 12:53:31 -0800
Message-ID: <> (raw)

When applications are put into unconfigured cgroups for memory
accounting purposes, the cgrouping itself should not change the
behavior of the page reclaim code. We expect the VM to reclaim the
coldest pages in the system. But right now the VM can reclaim hot
pages in one cgroup while there is eligible cold cache in others.

This is because one part of the reclaim algorithm isn't truly cgroup
hierarchy aware: the inactive/active list balancing. That is the part
that is supposed to protect hot cache data from one-off streaming IO.

The recursive cgroup reclaim scheme will scan and rotate the physical
LRU lists of each eligible cgroup at the same rate in a round-robin
fashion, thereby establishing a relative order among the pages of all
those cgroups. However, the inactive/active balancing decisions are
made locally within each cgroup, so when a cgroup is running low on
cold pages, its hot pages will get reclaimed - even when sibling
cgroups have plenty of cold cache eligible in the same reclaim run.

For example:

   [root@ham ~]# head -n1 /proc/meminfo 
   MemTotal:        1016336 kB

   [root@ham ~]# ./ 
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.269s
   user    0m0.051s
   sys     0m4.182s
   Hot pages cached: 134/12800 workingset-a

The streaming IO in B, which doesn't benefit from caching at all,
pushes out most of the workingset in A.


This series fixes the problem by elevating inactive/active balancing
decisions to the toplevel of the reclaim run. This is either a cgroup
that hit its limit, or straight-up global reclaim if there is physical
memory pressure. From there, it takes a recursive view of the cgroup
subtree to decide whether page deactivation is necessary.

In the test above, the VM will then recognize that cgroup B has plenty
of eligible cold cache, and that the hot pages in A can be spared:

   [root@ham ~]# ./ 
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.244s
   user    0m0.064s
   sys     0m4.177s
   Hot pages cached: 12800/12800 workingset-a


Whether active pages can be deactivated or not is influenced by two
factors: the inactive list dropping below a minimum size relative to
the active list, and the occurence of refaults.

This patch series first moves refault detection to the reclaim root,
then enforces the minimum inactive size based on a recursive view of
the cgroup tree's LRUs.


Note that this actually never worked correctly in Linux cgroups. In
the past it worked for global reclaim and leaf limit reclaim only (we
used to have two physical LRU linkages per page), but it never worked
for intermediate limit reclaim over multiple leaf cgroups.

We're noticing this now because 1) we're putting everything into
cgroups for accounting, not just the things we want to control and 2)
we're moving away from leaf limits that invoke reclaim on individual
cgroups, toward large tree reclaim, triggered by high-level limits, or
physical memory pressure that is influenced by local protections such
as memory.low and memory.min instead.


These changes are based on v5.4-rc6-mmotm-2019-11-05-20-44.

 include/linux/memcontrol.h |   5 +
 include/linux/mmzone.h     |   4 +-
 include/linux/swap.h       |   2 +-
 mm/vmscan.c                | 269 +++++++++++++++++++++++++------------------
 mm/workingset.c            |  72 +++++++++---
 5 files changed, 223 insertions(+), 129 deletions(-)

             reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-07 20:53 Johannes Weiner [this message]
2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
2019-11-10 22:02   ` Suren Baghdasaryan
2019-11-10 22:09   ` Khadarnimcaan Khadarnimcaan
2019-11-07 20:53 ` [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
2019-11-11  2:01   ` Suren Baghdasaryan
2019-11-12 17:45     ` Johannes Weiner
2019-11-12 18:45       ` Suren Baghdasaryan
2019-11-12 18:59         ` Johannes Weiner
2019-11-12 20:35           ` Suren Baghdasaryan
2019-11-14 23:47   ` Shakeel Butt
2019-11-15 16:07     ` Johannes Weiner
2019-11-15 16:52       ` Shakeel Butt
2019-11-07 20:53 ` [PATCH 3/3] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
2019-11-11  2:15   ` Suren Baghdasaryan
2019-11-12 18:00     ` Johannes Weiner
2019-11-12 19:13       ` Suren Baghdasaryan
2019-11-12 20:34         ` Suren Baghdasaryan
2019-11-15  0:29   ` Shakeel Butt
2019-11-27 22:16     ` Shakeel Butt

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on

Archives are clonable:
	git clone --mirror linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ \
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone