From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org>, Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 0/7]: mm: memcontrol: switch to rstat Date: Tue, 2 Feb 2021 13:47:39 -0500 [thread overview] Message-ID: <20210202184746.119084-1-hannes@cmpxchg.org> (raw) This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. --- A kernel build test confirms that the cost is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault --- And of course, the on-demand aggregated stats are now fully accurate again: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 --- block/blk-cgroup.c | 14 +- include/linux/memcontrol.h | 119 ++++++---------- kernel/cgroup/cgroup.c | 34 +++-- kernel/cgroup/rstat.c | 62 +++++---- mm/memcontrol.c | 320 +++++++++++++++++++++---------------------- 5 files changed, 266 insertions(+), 283 deletions(-) Based on v5.11-rc5-mm1.
WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> To: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg@public.gmane.org Subject: [PATCH 0/7]: mm: memcontrol: switch to rstat Date: Tue, 2 Feb 2021 13:47:39 -0500 [thread overview] Message-ID: <20210202184746.119084-1-hannes@cmpxchg.org> (raw) This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. --- A kernel build test confirms that the cost is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault --- And of course, the on-demand aggregated stats are now fully accurate again: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 --- block/blk-cgroup.c | 14 +- include/linux/memcontrol.h | 119 ++++++---------- kernel/cgroup/cgroup.c | 34 +++-- kernel/cgroup/rstat.c | 62 +++++---- mm/memcontrol.c | 320 +++++++++++++++++++++---------------------- 5 files changed, 266 insertions(+), 283 deletions(-) Based on v5.11-rc5-mm1.
next reply other threads:[~2021-02-02 18:50 UTC|newest] Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-02 18:47 Johannes Weiner [this message] 2021-02-02 18:47 ` [PATCH 0/7]: mm: memcontrol: switch to rstat Johannes Weiner 2021-02-02 18:47 ` [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 23:07 ` Roman Gushchin 2021-02-02 23:07 ` Roman Gushchin 2021-02-03 2:28 ` Roman Gushchin 2021-02-03 2:28 ` Roman Gushchin 2021-02-04 19:29 ` Johannes Weiner 2021-02-04 19:29 ` Johannes Weiner 2021-02-04 19:34 ` Roman Gushchin 2021-02-04 19:34 ` Roman Gushchin 2021-02-05 17:50 ` Johannes Weiner 2021-02-05 17:50 ` Johannes Weiner 2021-02-04 13:28 ` Michal Hocko 2021-02-04 13:28 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 23:13 ` Roman Gushchin 2021-02-02 23:13 ` Roman Gushchin 2021-02-04 13:29 ` Michal Hocko 2021-02-04 13:29 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:26 ` Shakeel Butt 2021-02-02 22:26 ` Shakeel Butt 2021-02-02 23:17 ` Roman Gushchin 2021-02-02 23:17 ` Roman Gushchin 2021-02-04 13:30 ` Michal Hocko 2021-02-04 13:30 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 4/7] cgroup: rstat: support cgroup1 Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 1:16 ` Roman Gushchin 2021-02-03 1:16 ` Roman Gushchin 2021-02-04 13:39 ` Michal Hocko 2021-02-04 13:39 ` Michal Hocko 2021-02-04 16:01 ` Johannes Weiner 2021-02-04 16:01 ` Johannes Weiner 2021-02-04 16:42 ` Michal Hocko 2021-02-04 16:42 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 18:47 ` [PATCH 6/7] mm: memcontrol: switch to rstat Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 1:47 ` Roman Gushchin 2021-02-03 1:47 ` Roman Gushchin 2021-02-04 16:26 ` Johannes Weiner 2021-02-04 16:26 ` Johannes Weiner 2021-02-04 18:45 ` Roman Gushchin 2021-02-04 18:45 ` Roman Gushchin 2021-02-04 20:05 ` Johannes Weiner 2021-02-04 20:05 ` Johannes Weiner 2021-02-04 14:19 ` Michal Hocko 2021-02-04 14:19 ` Michal Hocko 2021-02-04 16:15 ` Johannes Weiner 2021-02-04 16:44 ` Michal Hocko 2021-02-04 16:44 ` Michal Hocko 2021-02-04 20:28 ` Johannes Weiner 2021-02-04 20:28 ` Johannes Weiner 2021-02-05 15:05 ` Michal Hocko 2021-02-05 15:05 ` Michal Hocko 2021-02-05 16:34 ` Johannes Weiner 2021-02-05 16:34 ` Johannes Weiner 2021-02-08 14:07 ` Michal Hocko 2021-02-08 14:07 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner 2021-02-03 2:25 ` Roman Gushchin 2021-02-03 2:25 ` Roman Gushchin 2021-02-04 21:44 ` Johannes Weiner 2021-02-04 21:44 ` Johannes Weiner 2021-02-04 21:47 ` Roman Gushchin 2021-02-04 21:47 ` Roman Gushchin 2021-02-05 15:17 ` Michal Hocko 2021-02-05 15:17 ` Michal Hocko 2021-02-05 17:10 ` Johannes Weiner 2021-02-05 17:10 ` Johannes Weiner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210202184746.119084-1-hannes@cmpxchg.org \ --to=hannes@cmpxchg.org \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=guro@fb.com \ --cc=kernel-team@fb.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@suse.com \ --cc=tj@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.