linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>, Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH 0/7]: mm: memcontrol: switch to rstat
Date: Tue,  2 Feb 2021 13:47:39 -0500	[thread overview]
Message-ID: <20210202184746.119084-1-hannes@cmpxchg.org> (raw)

This series converts memcg stats tracking to the streamlined rstat
infrastructure provided by the cgroup core code. rstat is already used
by the CPU controller and the IO controller. This change is motivated
by recent accuracy problems in memcg's custom stats code, as well as
the benefits of sharing common infra with other controllers.

The current memcg implementation does batched tree aggregation on the
write side: local stat changes are cached in per-cpu counters, which
are then propagated upward in batches when a threshold (32 pages) is
exceeded. This is cheap, but the error introduced by the lazy upward
propagation adds up: 32 pages times CPUs times cgroups in the subtree.
We've had complaints from service owners that the stats do not
reliably track and react to allocation behavior as expected, sometimes
swallowing the results of entire test applications.

The original memcg stat implementation used to do tree aggregation
exclusively on the read side: local stats would only ever be tracked
in per-cpu counters, and a memory.stat read would iterate the entire
subtree and sum those counters up. This didn't keep up with the times:

- Cgroup trees are much bigger now. We switched to lazily-freed
  cgroups, where deleted groups would hang around until their
  remaining page cache has been reclaimed. This can result in large
  subtrees that are expensive to walk, while most of the groups are
  idle and their statistics don't change much anymore.

- Automated monitoring increased. With the proliferation of userspace
  oom killing, proactive reclaim, and higher-resolution logging of
  workload trends in general, top-level stat files are polled at least
  once a second in many deployments.

- The lifetime of cgroups got shorter. Where most cgroup setups in the
  past would have a few large policy-oriented cgroups for everything
  running on the system, newer cgroup deployments tend to create one
  group per application - which gets deleted again as the processes
  exit. An aggregation scheme that doesn't retain child data inside
  the parents loses event history of the subtree.

Rstat addresses all three of those concerns through intelligent,
persistent read-side aggregation. As statistics change at the local
level, rstat tracks - on a per-cpu basis - only those parts of a
subtree that have changes pending and require aggregation. The actual
aggregation occurs on the colder read side - which can now skip over
(potentially large) numbers of recently idle cgroups.

---

A kernel build test confirms that the cost is comparable. Two kernels
are built simultaneously in a nested tree with several idle siblings:

root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                             `- build-b (defconfig, make -j16)
                                             `- idle-1
                                             `- ...
                                             `- idle-9

During the builds, kernelbuild/memory.stat is read once a second.

A perf diff shows that the changes in cycle distribution is
minimal. Top 10 kernel symbols:

     0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
     0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
     0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
     0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
     0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
     0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
     0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
     0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
     0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault


---

And of course, the on-demand aggregated stats are now fully accurate
again:

$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
  grep -e inactive_file /sys/fs/cgroup/memory.stat

vanilla:                              patched:
nr_inactive_file 1574105088           nr_inactive_file 1027801088
   inactive_file 1577410560              inactive_file 1027801088

---

 block/blk-cgroup.c         |  14 +-
 include/linux/memcontrol.h | 119 ++++++----------
 kernel/cgroup/cgroup.c     |  34 +++--
 kernel/cgroup/rstat.c      |  62 +++++----
 mm/memcontrol.c            | 320 +++++++++++++++++++++----------------------
 5 files changed, 266 insertions(+), 283 deletions(-)

Based on v5.11-rc5-mm1.




             reply	other threads:[~2021-02-02 18:47 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-02 18:47 Johannes Weiner [this message]
2021-02-02 18:47 ` [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
2021-02-02 22:23   ` Shakeel Butt
2021-02-02 23:07   ` Roman Gushchin
2021-02-03  2:28     ` Roman Gushchin
2021-02-04 19:29       ` Johannes Weiner
2021-02-04 19:34         ` Roman Gushchin
2021-02-05 17:50           ` Johannes Weiner
2021-02-04 13:28   ` Michal Hocko
2021-02-02 18:47 ` [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner
2021-02-02 22:24   ` Shakeel Butt
2021-02-02 23:13   ` Roman Gushchin
2021-02-04 13:29   ` Michal Hocko
2021-02-02 18:47 ` [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner
2021-02-02 22:26   ` Shakeel Butt
2021-02-02 23:17   ` Roman Gushchin
2021-02-04 13:30   ` Michal Hocko
2021-02-02 18:47 ` [PATCH 4/7] cgroup: rstat: support cgroup1 Johannes Weiner
2021-02-03  1:16   ` Roman Gushchin
2021-02-04 13:39   ` Michal Hocko
2021-02-04 16:01     ` Johannes Weiner
2021-02-04 16:42       ` Michal Hocko
2021-02-02 18:47 ` [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner
2021-02-02 18:47 ` [PATCH 6/7] mm: memcontrol: switch to rstat Johannes Weiner
2021-02-03  1:47   ` Roman Gushchin
2021-02-04 16:26     ` Johannes Weiner
2021-02-04 18:45       ` Roman Gushchin
2021-02-04 20:05         ` Johannes Weiner
2021-02-04 14:19   ` Michal Hocko
2021-02-04 16:15     ` Johannes Weiner
2021-02-04 16:44       ` Michal Hocko
2021-02-04 20:28         ` Johannes Weiner
2021-02-05 15:05   ` Michal Hocko
2021-02-05 16:34     ` Johannes Weiner
2021-02-08 14:07       ` Michal Hocko
2021-02-02 18:47 ` [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
2021-02-03  2:25   ` Roman Gushchin
2021-02-04 21:44     ` Johannes Weiner
2021-02-04 21:47       ` Roman Gushchin
2021-02-05 15:17   ` Michal Hocko
2021-02-05 17:10     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210202184746.119084-1-hannes@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).