[PATCH 0/8] mm: memcontrol: switch to rstat v2

* [PATCH 0/8] mm: memcontrol: switch to rstat v2
@ 2021-02-05 18:27 Johannes Weiner
  2021-02-05 18:27 ` [PATCH 1/8] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:27 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

This is version 2 of the memcg rstat patches. Updates since v1:

- added cgroup selftest output (see test section below) (thanks Roman)
- updated cgroup selftest to match new kernel implementation
- added Fixes: tag to 'mm: memcontrol: fix cpuhotplug statistics flushing' (Shakeel)
- collected review & ack tags
- added rstat overview to 'mm: memcontrol: switch to rstat' changelog (Michal)
- simplified memcg_flush_lruvec_page_state() and removed cpu==-1 case (Michal)

---

This series converts memcg stats tracking to the streamlined rstat
infrastructure provided by the cgroup core code. rstat is already used
by the CPU controller and the IO controller. This change is motivated
by recent accuracy problems in memcg's custom stats code, as well as
the benefits of sharing common infra with other controllers.

The current memcg implementation does batched tree aggregation on the
write side: local stat changes are cached in per-cpu counters, which
are then propagated upward in batches when a threshold (32 pages) is
exceeded. This is cheap, but the error introduced by the lazy upward
propagation adds up: 32 pages times CPUs times cgroups in the subtree.
We've had complaints from service owners that the stats do not
reliably track and react to allocation behavior as expected, sometimes
swallowing the results of entire test applications.

The original memcg stat implementation used to do tree aggregation
exclusively on the read side: local stats would only ever be tracked
in per-cpu counters, and a memory.stat read would iterate the entire
subtree and sum those counters up. This didn't keep up with the times:

- Cgroup trees are much bigger now. We switched to lazily-freed
  cgroups, where deleted groups would hang around until their
  remaining page cache has been reclaimed. This can result in large
  subtrees that are expensive to walk, while most of the groups are
  idle and their statistics don't change much anymore.

- Automated monitoring increased. With the proliferation of userspace
  oom killing, proactive reclaim, and higher-resolution logging of
  workload trends in general, top-level stat files are polled at least
  once a second in many deployments.

- The lifetime of cgroups got shorter. Where most cgroup setups in the
  past would have a few large policy-oriented cgroups for everything
  running on the system, newer cgroup deployments tend to create one
  group per application - which gets deleted again as the processes
  exit. An aggregation scheme that doesn't retain child data inside
  the parents loses event history of the subtree.

Rstat addresses all three of those concerns through intelligent,
persistent read-side aggregation. As statistics change at the local
level, rstat tracks - on a per-cpu basis - only those parts of a
subtree that have changes pending and require aggregation. The actual
aggregation occurs on the colder read side - which can now skip over
(potentially large) numbers of recently idle cgroups.

---

The test_kmem cgroup selftest is currently failing due to excessive
cumulative vmstat drift from 100 subgroups:

    ok 1 test_kmem_basic
    memory.current = 8810496
    slab + anon + file + kernel_stack = 17074568
    slab = 6101384
    anon = 946176
    file = 0
    kernel_stack = 10027008
    not ok 2 test_kmem_memcg_deletion
    ok 3 test_kmem_proc_kpagecgroup
    ok 4 test_kmem_kernel_stacks
    ok 5 test_kmem_dead_cgroups
    ok 6 test_percpu_basic

As you can see, memory.stat items far exceed memory.current. The
kernel stack alone is bigger than all of charged memory. That's
because the memory of the test has been uncharged from memory.current,
but the negative vmstat deltas are still sitting in the percpu caches.

The test at this time isn't even counting percpu, pagetables etc. yet,
which would further contribute to the error. The last patch in the
series updates the test to include them - as well as match error
expectations on the new implementation to catch future regressions.

With all patches applied, the (updated, more rigorous) test succeeds:

    ok 1 test_kmem_basic
    ok 2 test_kmem_memcg_deletion
    ok 3 test_kmem_proc_kpagecgroup
    ok 4 test_kmem_kernel_stacks
    ok 5 test_kmem_dead_cgroups
    ok 6 test_percpu_basic

---

A kernel build test confirms that overhead is comparable. Two kernels
are built simultaneously in a nested tree with several idle siblings:

root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                             `- build-b (defconfig, make -j16)
                                             `- idle-1
                                             `- ...
                                             `- idle-9

During the builds, kernelbuild/memory.stat is read once a second.

A perf diff shows that the changes in cycle distribution is
minimal. Top 10 kernel symbols:

     0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
     0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
     0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
     0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
     0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
     0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
     0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
     0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
     0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault

---

The on-demand aggregated stats are now fully accurate:

$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
  grep -e inactive_file /sys/fs/cgroup/memory.stat

vanilla:                              patched:
nr_inactive_file 1574105088           nr_inactive_file 1027801088
   inactive_file 1577410560              inactive_file 1027801088

---

 block/blk-cgroup.c                         |  14 +-
 include/linux/memcontrol.h                 | 119 ++++-------
 kernel/cgroup/cgroup.c                     |  34 +--
 kernel/cgroup/rstat.c                      |  62 +++---
 mm/memcontrol.c                            | 306 +++++++++++++--------------
 tools/testing/selftests/cgroup/test_kmem.c |  22 +-
 6 files changed, 266 insertions(+), 291 deletions(-)

Based on v5.11-rc5-mm1.

^ permalink raw reply	[flat|nested] 22+ messages in thread